Over the past 2 years, Snap Engineering has evolved the way our cloud infrastructure is built, from running a monolith inside of Google App Engine to microservices running in Kubernetes across both Amazon Web Services and Google Cloud. This new architecture has saved Snap millions of dollars via a 65% reduction in compute costs while also reducing latency and increasing reliability for Snapchatters. However, it takes more than sprinkling some Kubernetes on VMs to achieve Snap’s high scale and multi-cloud vision, while complying with our strict security and privacy requirements.
In this post, we’ll cover how Snap expedited this architectural shift through the combination of proven design patterns like Service Mesh, leading open source components like Envoy, and internal orchestration layers that radically simplify service ownership.
Snap’s backend scaled for years via a monolithic app running in Google App Engine. A monolith is a good architecture for accommodating rapid growth in features, engineers, and customers. However, our monolith became challenging as we scaled. Engineers wanted to own their architecture. They wanted clear, explicit relationships between components to limit the blast radius during outages. They wanted to iterate quickly and independently on their data models, by eliminating shared datastores that can’t evolve due to tight couplings across teams. As a company, Snap wanted a viable path to shifting workloads to other cloud providers like AWS. None of the above could be satisfied via the monolith.
So, we proposed a shift to service-oriented architecture with Snap’s backend powered by independent, interoperable services with clear contracts. We would enable rich functionality for Snapchatters by allowing services to take dependencies on each other, composing new features off of service-to-service communication. We wanted a universe of self contained, independently deployable services, aka microservices.
Making a microservice from scratch is similar. There were many elements of our underlying infrastructure that needed to be considered: network topology, authentication, cloud resource provisioning, deployment, logs, metrics, traffic routing, rate limiting, and staging vs. production environments.
With so much complexity, we needed to find a way to implement this design without risking experience for Snapchatters. Not only that, it would be a bad use of our engineers’ time if every team had to solve these problems for each new service. We want our engineers delivering differentiating value for our customers, not mired in infrastructure toil
To enable SOA, we chose to generalize these problems and solve them in a reusable way. We prioritized these design tenets:
Secure by default. Authentication, authorization, and network security should be defaults, not optional, within the platform.
Clear separation of concerns between services’ business logic and the infrastructure. We want loose coupling so each side can iterate independently.
Abstract the differences between cloud providers where we can. We want to minimize deep provider dependencies so it’s feasible to shift services between AWS, GCP, and other cloud providers.
Centralized service discovery and management. We want all service owners to have the same experience for owning a service, regardless of where the service is running.
Minimal friction for creating new services. An intern should be able to stand up a productionized service by lunch time.
We brainstormed a layered architecture to achieve the above. It encompassed networking, identity, provisioning, deployments, separation of business logic from infrastructure via Docker containers, and orchestration via Kubernetes. Service configuration could be centrally managed via a control plane that spanned both cloud providers.
It was an ambitious plan, and we didn’t have years of time or an army of engineers to implement it. We realized we could move faster if we adopted the most viable open source offerings for our core building blocks, rather than writing everything from scratch. Open source alone wasn’t enough, though. We would need to own the implementation to meet Snap’s custom requirements and achieve our design tenets.
Snap engineers began rallying around Envoy as one of these core building blocks. Envoy is an open source service and edge proxy originally built at Lyft.
Envoy was intriguing for several reasons:
Compelling featureset: Envoy supports gRPC and HTTP/2 for upstream and downstream communication, hot restarts on configuration changes, client-side load balancing, and robust circuit-breaking.
Clear separation of data plane and control plane: Envoy on its own is a simple proxy. Config changes can be propagated to Envoys at runtime via a set of dynamic management APIs called xDS. xDS covers the discovery of routes, clusters, listeners, endpoints, and more.
Extensible: Envoy supports pluggable filters, allowing developers to inject their own functionality.
Excellent observability: Envoy offers a broad set of upstream and downstream metrics across latency, connections, requests, retries, circuit-breakers, and much more.
Robust enough for both the edge and internal services: We wanted the same process running at the edge of our network as inside of our network.
Rich ecosystem: Both AWS and Google Cloud are investing heavily around Envoy. As a cloud-native company, this provider support was an encouraging signal.
We decided to build on Envoy as a consistent layer of service-to-service communication for all of Snap’s microservices.
We envisioned each service host running an Envoy sidecar container. All ingress and egress for the service container would flow through Envoy, with the service having no direct interaction with the network. By default, Envoy would enforce TLS and publish metrics on all inbound and outbound traffic. Through this, we could guarantee that all requests between services were secure and observable. Each Envoy would connect to a custom control plane, receiving service discovery and fine-grained traffic management settings over its xDS API.
This design pattern is called a Service Mesh. Matt Klein wrote a great piece on data planes, control planes, and how they combine to form a Service Mesh.
Service Mesh was an exciting idea, but there were a number of hard questions that weren’t yet solved. How do we support Snapchat’s mobile client authentication scheme in Envoy? How should engineers manage their Envoy configurations? How does all this work securely across AWS and Google Cloud? Questions like these drove us towards a layered architecture that we call Snap’s Service Mesh.
Switchboard
Our hub for service configuration is Switchboard, an internal web app. Switchboard provides a single control panel for Snap’s services across cloud providers, regions, and environments. Through Switchboard, service owners can manage their service dependencies, shift traffic between clusters, and drain regions.
Switchboard presents its own simplified configuration model that centers around services. A service has a protocol and basic metadata like owner, email list, and description. These services have clusters which can be in any cloud provider, region, or environment. Each cluster has a cloud native identity. Switchboard services have dependencies and consumers, which are other Switchboard services.
We chose to simplify Switchboard’s configuration model rather than expose Envoy’s full API surface area. If we surfaced the entire Envoy API for Snap engineering teams, we could find ourselves supporting a combinatorial explosion of configurations. By standardizing and hiding as much of the configuration possible, we simplify the developer experience and make it easier to reason about operations across services. We found configurability is most useful for service-specific and operational concerns like connect timeouts, healthcheck endpoints, circuit breaker policies, and retry policies.
Configuration changes in Switchboard are saved to DynamoDB. We then take the high-level Switchboard configuration model, expand all services impacted by the configuration change, and regenerate their complete Envoy configuration across clusters, listeners, and routes.
Every Envoy on the Mesh connects to a regionalized xDS Control Plane via bidirectional gRPC stream. Once the Envoy configuration has been regenerated for a service, the Control Plane sends the updated config to a small subset of Envoys and measures their health over a period of minutes before committing the changes across the Mesh. Consistent rollout and rollback semantics are possible since we’ve centralized the config generation logic for all Envoys through Switchboard.
Switchboard does a lot more than just Envoy configuration at this point. Service owners can provision and manage Kubernetes clusters directly from Switchboard. They can generate Spinnaker deployment pipelines with canaries, health checking, and zonal rollouts. Again, we have taken an opinionated approach to both provisioning and deployments, surfacing a handful of fields and standardizing the rest.
Network and API Gateway
We want to minimize the number of services that are exposed to the Internet. This gives a first level of defense in the event of vulnerabilities, similar to what on-premises network topologies achieve with physical firewalls. We designed a shared, internal, regional network for our microservices. Services within the same region can communicate without going over the public Internet. No external traffic source could communicate directly with the internal network. In each region, only a single system would be deployed exposed to the Internet: our Gateways.
API Gateway is the front door for all requests from the Snapchat client. It runs the same Envoy image that our internal microservices run, and it connects to the same Control Plane. Custom Envoy filters are enabled via our Control Plane. These filters handle Snapchat’s authentication schemes, as well as our rate limiting and load shedding implementations. Once the filter chain is complete, Envoy routes requests to the appropriate microservice via the Service Mesh.
A Diagram Is Worth a Thousand Words
At a high level, the system looks like this:
Engineers across Snap banded together to work on the Service Mesh in early 2018. Thanks to highly engaged customers, rapid iteration, and a heck of a lot of hard work, we brought these systems to production quickly. Today, Snap’s Service Mesh is live in 7 regions across AWS and Google Cloud. We have 300+ production services live on the Mesh, handling 10 million qps of service-to-service requests.
We have an ambitious year ahead of us as we continue to scale rapidly, enrich our developer experience, and hold a high operational bar across the globe. We need great engineers to make this happen! If you made it this far, I encourage you to go 1 step farther and check out Careers at Snap.