January 28, 2025

Introducing Bento, Snap's ML Platform

Snap's ML platform that powers personalized experiences for our global community of hundreds of millions of Snapchatters. From training petabyte-scale datasets to handling a billion predictions per second, Bento scales ML to new heights.

Dive into our architecture, challenges, and solutions for MLOps at scale in this in-depth technical blog post.

Follow Snap on LinkedIn!Back To Blog

Bento Scale - At a Glance

100s

Models trained per day

100K+

Training compute hours per day

100+ GBs

Largest model size in production

10 Trillion

Events processed per day in feature pipelines

PBs

Training data size
(~1 PB consumed by single job)

800 TB

Online feature store volume

500+

Number of models deployed

>1B

Predictions per second

1 TB/s

Feature store reads

Overview

At Snap, ML is instrumental in curating personalized content, formulating friend suggestions and delivering targeted, personalized ads. To facilitate these diverse ML applications at scale and optimize our model experimentation and productionization lifecycle, we have developed Bento, a singular ML platform managing all major server-side machine learning (ML) workloads. Bento seamlessly handles the end-to-end ML development lifecycle, spanning feature and training data generation, model training and management, and model deployment in an array of configurations. Since Bento's launch in 2020, it has significantly boosted ML engineering efficiency. As an example, ML Engineers at Snap can perform thousands of different model experiments in a month.

In this article, we provide an in-depth introduction to Bento, delving into its architectural design, various components, and associated practices. We'll examine the complexities involved in maintaining large-scale, real-time ML applications in a production environment and detail the strategies we've implemented to overcome these challenges. To further illuminate the versatility and power of Bento, we'll present a case study exemplifying its role in supporting complex ML applications within Snap.

Designing Bento

Today, ML technology building blocks are widely accessible through open-source software and cloud providers. Such software and services have significantly lowered the entry barrier for developing ML applications. Leveraging these building blocks, we construct our own ML platform to capture these additional benefits:

End-to-end experience: We piece together building blocks such as big data technologies, ML frameworks, workflow engines and cloud services to present one centralized, seamless ML dev experience for engineers at Snap.
Specialization: We optimize our systems for high scale tasks like ranking and recommendation, where the scale often surpasses the design assumptions of many readily available software and services.
Integration: We eliminate duplicate effort that several product teams need to do to integrate with our internal tech stack by providing this ML platform layer to productionize all ML workloads.
Support: Our own ML platform engineers can design the system to cater to the unique needs, development styles, and release workflows of ML application teams, and provide more focused support compared to external software and service providers.

Bento’s mission is to accelerate ML adoption and experimentation by providing a polished, scalable and efficient solution. The main challenges we are trying to solve are: (1) ML development throughput, primarily gauged by the quantity of models and experiments each ML engineer creates within a given timeframe, and (2) production scalability, measured by throughput, latency, freshness metrics and operational costs. We concentrate the majority of our platform engineering efforts on developing high-value components that leverage cloud building blocks. These include complex data integrations, real-time ML features, an enhanced ML development experience, and high-performance production inference and retrieval services.

System Architecture

Bento is built on top of a combination of cloud services, open-source software and in-house technologies. The major components in Bento are:

Feature and Training Data Generation
Model Training
ML Production
UI, Control Plane and Monitoring

The following architecture diagram illustrates those components. We will walk through the system architecture in the next few sections below.

Feature and Training Data Generation

The source data for ML is stored using a variety of technologies and formats, such as BigQuery tables, Apache Iceberg tables, cloud storage buckets, and various other databases. The first step of ML is to turn these data into features for models to consume.

As illustrated in the following diagram, we subject this data to two primary operations. First, the majority of ML features are generated by aggregating event streams over sliding windows and grouping by various keys. For instance, a video-view feature might be defined as the "number of views in the last 24 hours, grouped by video id". We use Robusta, our custom-built feature platform powered by Apache Spark, to aggregate these types of features. The aggregated features are then (1) stored in an offline Apache Iceberg database for training purposes, and (2) disseminated to an online feature store.

The second operation generates training data. Training data is composed of training records, each assembled by pairing a raw event, which contains the label, with aggregated features. In a greatly simplified example of predicting user-video preferences, the raw event might be "user id A likes video id B". Then, this event is enriched with a selected set of user features and video features. Furthermore, training data must undergo transformations such as standardization, range-compression, and indexing before being supplied to a deep learning training job. Internally, this process forms a multi-step data pipeline, executed using a combination of Apache Spark and Google Dataflow. The output of this operation is a set of TFRecord files ready to be consumed by the training job.

Model Training

Model training forms the core of ML platforms, and given the diversity of use cases, we've designed our training platform to be highly adaptable.

A standard model training job generally involves four tasks: (1) training data generation, (2) model training and tuning, (3) model evaluation, and (4) model export. These tasks create a workflow graph. Depending on the ML application, there are different variations of this workflow, such as composite models—where a model combines multiple separately-trained sub-models—and model distillation, where a sophisticated, resource-intensive “teacher” model is used to train a simpler, more cost-efficient “student” model. As flexibility is crucial, we've adopted Kubeflow which is an open-source workflow standard for ML applications. Through Kubeflow, we have a variety of workflow templates supporting a wide range of scenarios. Depending on the configuration, we orchestrate the workflows on Google Kubernetes Engine or Google Cloud's Vertex AI Pipelines. The following screenshot is a training workflow example displayed in the Bento UI.

We have already discussed the training data generation step in the previous section. The model training job is a Python application running on GPU or TPU hosts. While the training application could involve any user-provided code, to enhance ML development speed—particularly for ranking and recommendation applications—we've structured the application into three composable layers: (1) Core framework, (2) User model code, and (3) Training configuration.

The Core framework is an internal library that encapsulates Tensorflow and Keras, standardizing many aspects of modeling techniques and components for recommendation use cases, such as deep cross network and transformers. This structure enables ML engineers to experiment more efficiently with various feature ideas, and promotes code reuse and sharing. ML engineers then express their model by authoring user model code. Finally, the training configuration is a YAML file that outlines how the training job should be executed. For example, it specifies the type of hardware, the location of input and output data, and other runtime options that determine the behavior of the training code. This layered structure of our training code facilitates rapid experimentation by merely modifying the job configuration. Finally, we support model training on GPUs and TPUs. Our Core framework provides sufficient abstractions to handle flexibility of running training code on different hardware types.

Model training has many hyperparameters such as learning rate, batch size, and other model parameters such as layer dimensions. Part of the ML development process is to experiment with different parameter configurations. To facilitate this, we integrate Bayesian optimization to the workflow. Bayesian optimization suggests better parameter values based on observed metrics, reducing the amount of hand tuning ML engineers need to do.

Once a model is generated, we run an evaluation task to yield model metrics. We upload these metrics, along with those from the training phase, to Tensorboard for visualization. The evaluation results for each model are stored in a database and displayed in the Bento UI.

Following the evaluation task, the Model Export task takes the trained model and prepares it for inference. At Snap, we use different hardware for inference, depending on factors like application, latency requirements, performance characteristics, and resource usage. The Model Export task creates optimized models for different types of hardware. For instance, for models destined for inference on GPUs, we examine the compute graph and determine which operations should be performed on GPUs—such as dense matrix multiplications—and which should be performed on CPUs, like feature parsing and embedding lookup. This optimization is crucial for reducing the serving costs of ranking and recommendation models, because these models exhibit a distinct pattern of dense computation that is compute-bound, along with large embedding lookup tables that are memory-size-bound.

Last but not least, to maintain prediction accuracy, models must be incrementally updated as new events are recorded. Bento 100% automates incremental training. Our orchestration handles the incremental appending, joining, and transformation of new training data, preparing it for incremental training. The incremental training itself is scheduled on Kubeflow. After passing through export and various validations, models are automatically deployed to production. We have online ML monitoring to ensure the quality of model production, which will be introduced later in this article.

ML Production

In the ML landscape, model production manifests in various forms. It can be embedded into an application backend, run offline to generate static predictions or embeddings, hosted as an RPC service, or even deployed to edge devices. Given Snap's Service-Oriented Architecture, Bento primarily supports RPC for model production, with secondary support for asynchronous and batch model invocations.

Our inference service caters to both Tensorflow and PyTorch models, accommodating a range of use-cases such as computer vision, text understanding, and large-scale, real-time recommendation applications. The production of models for recommendation applications, however, presents distinct challenges. Some of these include:

High fanout: A ranking request consists of user-related features and a large number of documents to score. This requires the inference engine to efficiently hydrate all user and document features.
High predictions per second: Typical ranking-driven application backends, such as those for advertising and content, issue requests every few seconds per user session. When multiplied by the number of documents per request, we encounter a significant throughput. Generally, the more documents we can score per request, the better the precision and recall we can achieve.
Real-time feature serving: For online recommendation applications, it's crucial to respond as quickly as possible to user engagement events to ensure that the recommendation output remains relevant and timely.

Bento has made specific design choices to overcome these challenges. Following is an overview of the internal architecture of the Inference service.

To address the high-fanout challenge, we employ two strategies based on the use case.

In the first strategy, we divide the online feature store into two parts: (1) a distributed key-value store as a centralized feature store for users, and (2) an instance-local feature store for documents co-located with our proprietary inference engine. Given that a typical ranking request involves scoring many documents relevant to a single user request context, we perform one user feature lookup before forwarding the request to inference. Document features are then hydrated within the inference engine. This local hydration process is highly efficient as it occurs without fanning out to different services, resulting in very substantial latency reduction and cost savings. This approach requires the inference engine instance to have the capacity to host the entire document feature corpus.

The second strategy involves pushing online document feature hydration upstream. This is where our Retrieval service comes into play. We have designed our retrieval service to function as a "one-stop-shop for large corpus document retrieval." A simplified architecture diagram is as follows:

This strategy includes designing a distributed retrieval system capable of performing Approximate Nearest Neighbor, inverted index, and forward index lookups within each leaf node. This facilitates a flexible, single-pass retrieval and hydration query capable of returning relevant documents and their features. This approach is particularly suitable for use cases involving a very large document corpus.

Given the high volume of requests and documents to be scored every second, our inference engine is heavily optimized for the Total Cost of Operation. Here are some key optimization techniques we use in Bento:

In-house inference engine optimization: For instance, we use a shared-nothing architecture and carry out careful thread tuning for the Tensorflow runtime to ensure every CPU core is optimally utilized with minimal lock contention. Our build is further optimized for the hardware platform on which the engine is deployed.
Model optimization for hardware: The Model Export step in our process produces various versions of the model targeting different hardware types.
Model deployment optimization: Each inference engine loads one or more models. We fine-tune model co-location to take advantage of their differing performance characteristics. We build different inference fleets of engines on Kubernetes to scale differently for each use case.
Request batching optimization: For efficiency, a GRPC ranking request can contain multiple sub-requests for different models. Our Inference Front End service performs dynamic unbundling/batching and routing to optimize performance.
Data plane optimization: We minimize RPC serialization and deserialization cost by optimizing our inference APIs, so that ML features can be fetched and transferred in raw bytes, with deserialization only happening within the Inference engine. This and our custom optimized Protobuf deserialization resulted in 2x less latency and 10x cheaper dataplane costs.

UI, Control Plane and Monitoring

Finally, to streamline the ML development process, we've created an internal control plane and user interface that manages the entire development lifecycle. Through the Bento UI, ML engineers can:

Explore features and construct datasets.
Edit, launch, and monitor training workflows.
Visualize metrics for trained models and manage experiments.
Deploy models and retrieval indexes to production.
Set up and schedule incremental training.
Monitor features, predictions, and operating costs.

ML engineers can deploy models from our Bento UI and monitor their performance. Behind the scenes, the deployment control plane executes a workflow to reconcile the difference between the desired state and the current running state.

ML production systems require meticulous monitoring. While the monitoring of standard microservice software metrics such as availability and throughput is well established, ML monitoring is a relatively new domain. At Bento, we have constructed our own monitoring pipeline to detect anomalies in features and predictions. This includes monitoring:

Various statistical measures of predictions, including their mean, maximum, and percentiles
Statistics from prediction-time features for active models, including the missing ratio, zero ratio, mean, maximum, minimum, empty list ratio, quantile for dense numerical features, and list size for sparse ID features
Discrepancies between online and offline predictions and features

Our ML monitoring is enabled through a series of processes involving data logging, storage, and querying, all of which are built on the Google Cloud Storage and BigQuery platforms. The diagram below provides a detailed representation of how we've constructed this system.

Case Study: Snapchat Ad Ranking

Snapchat ad ranking aims to serve the right ad to the right user at the right time. These are selected from millions of ads in our inventory at any time. Serving the right ad, in turn, generates value for our community of advertisers and Snapchatters. Under the hood, a very high throughput real-time ad auction is powered by our ML Platform, Bento.

This post on our engineering blog details an overview of the Snapchat ad ranking system, the challenges unique to the online ad ecosystem, and the corresponding machine learning (ML) development cycle.

ML plays a crucial role in the Ad Ranking system, utilizing multiple models trained on a variety of data and deployed in different modes. It faces challenges such as training complex models, maintaining real-time features, improving development speed, ensuring data accuracy, automating updates, and optimizing costs. All of this is efficiently managed with Bento's technology.

Evolution of Bento

Two years ago, Snap doubled down on ML across the company. We established a technical strategy to scale our machine learning capabilities that are critical to our business - from deepening engagement to driving revenue. As a result of this investment , our ranking model size has increased by 20x, training data size increased by 40x, and feature volume and prediction volume increased by 2-3x. Additionally, we have built up universal user understanding, graph understanding and content understanding capabilities that help us deliver even better models by leveraging data across all Snap product surfaces. And this is just the beginning.

In 2025, we are growing our team. We're looking for talented people to build the next evolution of Snap's ML Platform. If your passions lie in ML, big data, large-scale distributed systems, or tackling difficult performance challenges, we encourage you to consider a role within our dynamic team!

See Openings See All Blog Posts