July 16, 2021
July 16, 2021

Build a Reliable System in a Microservices World at Snap

How do you orchestrate a complex workflow across tens of microservices running on multiple cloud providers? How do you build it reliably while focusing on core business logic?

In this post, we will talk about how Snap embraces the Temporal open source project to solve microservice orchestration with its powerful and flexible workflow engine.

Microservices orchestration?

Microservice oriented architecture has obvious advantages over a monolith setup especially in the modern world: clear responsibility and ownership model, easier to develop and maintain, easier to scale and regionalize, smaller blast radius and more. Snap Engineering has adopted the SOA architecture with the utilization of Envoy and Service Mesh. This effort brought flexibility and efficiency to our teams, while providing clear responsibility and separation for our organizational setup.

However, with microservices, things are distributed across multiple fleets, databases, regions and cloud providers. Building a reliable and efficient system that can communicate between services, maintain application states and gracefully deal with outages has become a hard problem. In order to properly orchestrate services and track application states, technologies including queues and databases are often used. Engineers end up spending lots of time implementing state tracking functionality and writing error handling code to keep their system intact and resilient to component failures.

Take Snap’s Asynchronous Ads Reporting as an example. This system is used to help advertisers generate analytics reports asynchronously based on their requirements and notify them by emails when ready.

This involves 3 microservices for the simplicity of the view: 

  • Data Fetching Service: Fetch raw data based on the query, do pre processing and generate formatted files, processing could run for minutes based on the query.

  • Report Generation Service: Read in generated files, run through data analysis jobs and produce a downloadable report and put it into another user accessible file system.

  • Notification Service: Send out email notification to the advertiser that requested the report.

Figure 1. Workflow View for the Asynchronous Ads Reporting System

It looks easy enough to build this system at first glance, we just need an async API and in the background, call each microservice one by one until the advertiser receives the notification. 

But the reality of building such a system and making it run reliably is not as simple. In the real world, anything can happen. Any one of the microservices could go down, or there could be network issues at any point of time. For example, if a report is generated and the notification service goes down, then the advertiser will not be able to receive the notification and the processing in previous steps would have been wasted. In order to solve it, we usually would add retries, timeouts or other error handling logic to make sure we have a reliable system that a short outage won’t void all the previous states.

In other cases, if advertisers send out the request, the API receives it and starts the job, but when a response is being sent back, their network goes down. Then they wouldn’t know whether their request has been received or not, so typically the same request will be resent and thus starting a duplicate job. This will cause issues if the system is not idempotent or jobs are extremely costly to run. To solve this problem, it requires the application status to be tracked, so we will know if there's already a job running for the same request.

We can see from the above discussion that building a reliable “simple” microservices system is not simple. It often requires engineers to think through edge cases and write boilerplate error handling logic in any places where systems are integrated. Beyond error handling, in certain systems like payment or billing systems if we are not able to track the application states, it could mean money loss or other worse consequences. 

In Snap, we must have reliable services so that our advertisers can have consistent high quality experiences. At the same time, we want our engineering to be more efficient and able to iterate fast while developing reliable services. We will need a central orchestration solution to help orchestrate our services and track system states across services. Luckily we have found a solution to this, and we will give a brief introduction in the next section and show how Snap Engineering is embracing it.

Introducing Temporal

Temporal is an open source project which helps with service/workflow orchestration problems. Using its own words:

Temporal is the open source microservices orchestration platform for running mission critical code at any scale.

Temporal solves the orchestration problem by preserving the states of your workflow execution and achieves execution coordination through internal distributed queues.

Figure 2. Temporal Orchestration Architecture

In our previous example, we will create an Asynchronous Processing Orchestrator which helps orchestrate our processing workflow. We can then model our system workflow as three activities, each one calls into the microservices as shown in Figure 2. Each activity’s execution will be orchestrated by Temporal Server and processing states will be persisted into its persistence layer. The order of execution is defined by our workflow code and scheduled through Temporal Server, and then the execution tasks will be distributed to our fleet of worker machines by Temporal Server’s internal distributed task queue.

Because the system states are persisted, if any services go down, even Temporal Server, things can be restored to their previous state and resumed. Temporal provides an SDK that interacts with Temporal Server and includes things like error handling configuration, enabling engineers to focus on core business logic and assume things will run reliably by default. Additionally because Temporal is tracking and persisting all the intermediate states of workflows, the systems become interactive, callers can query the states of the workflow at any time and also determine the next step of action based on the information.

Temporal can naturally fit into Snap’s Service Oriented Architecture:

  • gRPC based, can organically work with Snap’s Service Mesh, teams can onboard easily with build-in AuthN and easy AuthZ support.

  • Improves engineering efficiency and helps teams focus on writing business logic.

  • Helps with application state management, save engineering resources by not having teams write their own queue+database solution.

Temporal at Snap

At Snap Engineering, we are always seeking out innovative ways to help our engineers solve exciting challenges from our rapidly growing business. Snap’s Service Oriented Architecture strategy has enabled us to rapidly scale our service developments and enrich our engineering experience. Bringing in Temporal to help with our service orchestration challenges is an addition to our efforts to make our service deployments more efficient and at the same time reliable.

We’ve started working with the Temporal open source community from the end of 2019, and we have maintained a good cooperative relationship since. During the process we’ve discussed various ideas with them for different use cases, contributed back in various ways through process feedback, product ideas, code contribution and raised issues to help make Temporal and its community better.

“As one of the early adopters of Temporal, Snap has been a firm believer in the impact and value of our technology. The scope and size of the problems Snap is solving with Temporal is immense and impressive to say the least. Just knowing how much pain Temporal removes from Snap's orchestration challenges is one of the greatest rewards I can imagine as a co-creator of this technology.”  Maxim Fateev (CEO and Co-Creator of Temporal)

To help with more teams in Snap, our internal Temporal cluster is hosted as a multi-tenant environment which runs on top of Snap’s Service Mesh where we can get authN, authZ and load balancing support. Also with Service Mesh, team onboarding is as simple as access approval and establishing connection using our vended client libraries which incorporate Service Mesh connections and Temporal SDKs in one.

Teams in Snap are solving exciting challenges right now utilizing Temporal. One noteworthy use case is Snap’s Continuous Integration/Continuous Deployment pipeline that helps all engineers deploy their services more efficiently. Our engineers utilized its orchestration help and were able to build a reliable solution that connects multiple build systems and deployment services together in a short time. In this use case, Temporal has enabled us to concentrate on the business logic which improved our development time, eliminated complexity in the system and most importantly made it flexible and fast to further integrate with other services and extend functionalities on the CI/CD workflows.

At Snap, hundreds of new services and systems are created here to meet the demand of Snap’s rapidly growing business. Besides incorporating Temporal to improve our service orchestration challenges, Snap Engineering is always working with an open mind to create and find solutions to solve other exciting large scale infrastructure challenges. If you are interested in what we are solving here at Snap, don’t hesitate, please check out Careers at Snap, we’re hiring! 

Back To BlogFollow Snap on LinkedIn!