March 28, 2022

Training Large-Scale Recommendation Models with TPUs

At Snap, we train a large number of deep learning models every day to continuously improve the ad recommendation quality to Snapchatters and provide more value to the advertisers. These ad ranking models have hundreds of millions of parameters and are trained on billions of examples. Training an ad ranking model is a computation-intensive and memory lookup heavy task. It requires a state-of-the-art distributed system and performant hardware to complete the training reliably and in a timely manner. This post describes how we leveraged Google’s Tensor Processing Units (TPU) for fast and efficient training.

Key Takeaways

TPUs can offer much faster training speed and significantly lower training costs for recommendation system models than the CPUs.
TPU hardware is well-supported by TensorFlow, which provides a powerful API to handle large embedding tables and fast lookups.
TPU offers near-linear scaling performance for a small number of cores, but scaling becomes challenging the higher the core count.
Moving from asynchronous to synchronous training helped us have more stable experimentations.

Distributed Training for Ad Ranking Models

Snap's ad platform aims to serve relevant and timely ads to 300M+ users daily while respecting their privacy preferences. We rank millions of ads using highly customized deep neural network (DNN) models that predict the probability of the various conversion events for each user; see Machine Learning for Snap Ad Ranking [9] for further details.

This section gives an overview of the challenges we face in training large DNN models for ads ranking. We then describe the CPU-based system we used before TPU and its limitations.

Training Challenges

The DNN models for ads ranking are trained on billions of examples generated from users' interactions with the ads shown. Training these models is a critical component of the ad ranking team. Improving them requires many offline and online experiments. Our training platform must achieve high training throughput at a low cost to enable fast iterations on our machine learning models.

Another set of challenges is related to the nature of our data. Our dataset includes many categorical features. To handle them, we use embeddings, and thus a large part of our model computation is lookups. Embedding lookups represent discrete id features as dense vectors in a low-dimensional space and are memory intensive operations in nature.

Specialized hardware for neural networks training has usually favored speeding-up large matrix multiplications. This has worked very well for large state-of-the-art Computer Vision or NLP models. By contrast, recommenders have additional requirements such as fast embedding lookups, larger RAM to fit large embedding tables and high memory bandwidth. This makes it more challenging to choose the right accelerator.

Asynchronous Distributed Training

Our first distributed training system used asynchronous training with workers and parameter servers which is fast and scalable. In such a system, each worker fetches its own batch of data that is a subset of the dataset. It then performs the back-propagation and sends the gradients to the parameter servers, where they will be applied. Last, the parameter servers share back the new weights with the worker.

There are a few shortcomings of this approach:

Asynchronous training removes the point of synchronization and thus is faster, but it is also noisier and might affect accuracy.
Choosing the optimal amount of parameter servers is non trivial. It is very easy to select a sub-optimal setup: too many parameter servers can cause large training costs due to low usage; or on the contrary, too few parameter servers can also cause large training costs due to throughput bottlenecks.
It is crucial not to overload a parameter server by making sure that we partition and share out all training variables equally to avoid frequent ids located on the same PS. Finding the correct partition, however, is difficult to achieve.

The PS approach limitations have also been documented and corroborated by recent research [1], and is thus usually not favored.

Over time, our models kept getting larger and larger, with more and more features, thus we actively looked at more specialized hardware to overcome these issues and improve our systems.

Hardware Accelerators Overview

Hardware Evolution for Machine Learning

Over the past decade, there has been a tremendous evolution of hardware accelerators for training large machine learning (ML) models, especially DNN models. Currently, Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) are leading hardware accelerators.

DNN training started on general-purpose multicore Central Processing Units (CPUs). CPU training typically uses a cluster of machines to split the workload and increase training throughput [10].

A GPU is specialized hardware with thousands of arithmetic logic units (ALUs) inside its processor. It can efficiently process large blocks of data in parallel. DNNs typically have massive parallelism on their dominant matrix operations, making GPUs the most popular choice for training DNN models. Over time, GPUs have been enhanced to offer immense computing power to meet the ever-increasing demand for DNN training [11]. GPUs are also suitable for model inference; see Applying GPU to Snap [15].

TPUs are Google's custom-developed application-specific integrated circuits. They achieve massive parallelism on neural network workloads through a systolic array architecture [12] called matrix unit, where thousands of ALUs are directly connected to form a large physical matrix. This domain-specific architecture requires no memory access during the matrix multiplication process; by contrast, GPUs are more general purpose and need to access memory to read operands and store intermediate results [13]. TPUs have emerged as a competitive option for training large DNNs.

Google has introduced four generations of TPUs to date [14], with the latest v4 unveiled in 2020. Currently, only TPU v2 and v3 are publicly available on Google Cloud as Cloud TPU boards (each board has 4 chips, and each chip has 2 cores). Furthermore, TPU boards can be connected via a 2D toroidal mesh network and become a cloud TPU Pod with enormous compute power. A TPU v2 Pod has 512 cores, and a v3 one has 2048 cores. We normally only use a slice of an entire pod by specifying the number of cores to use.

Public Benchmark

In the recent MLPerf v1.0 benchmark results (obtained in June 2021), the fastest training jobs on various model architectures for different ML applications are either using Nvidia’s latest A100 GPU (launched in 2020) or Google TPU v4.

The table below summarizes the training time (in minutes) for four different models using either an Nvidia DGX Pod slice with 64 A100 cards or a TPU v4 Pod slice with 128 cores (since A100 is a single-chip card and each TPU chip has two cores, these two pod slices have an equal number of chips). In particular, TPU v4 has a 1.25x training speedup over A100 on DLRM, an open-source deep-learning recommendation model, which is a good reference for our ranking models.

* Source: https://mlcommons.org/en/training-normal-10/ [7]

The evolution of hardware accelerator use at Snap

Most of the ad ranking model training jobs have been running asynchronously on our distributed system. Each worker machine uses either CPU or a GPU card (e.g., Nvidia T4) to process a different slice of the input dataset.

We recently started the migration of model training to run synchronously on either a cloud TPU v2 or v3 board (with 8 cores) or a slice (usually with 32 cores) of a TPU v2 or v3 Pod. Simultaneously, we are also exploring training models synchronously on Nvidia GPUs, especially the latest A100.

For reference, the table below summarizes the compute power in terms of trillion floating-point operations per second (TFLOPS) for the three accelerators we experimented with on synchronous training, namely, TPU v2 and v3 boards as well as the A100 GPU card. This table also includes the standard hourly cost on Google cloud and the corresponding performance per dollar for these three devices.

* Source: https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus [4]

TPU Integrations Details

Overview

TPUs are well suited for training our ranking models for multiple reasons:

TPUs are natively supported and highly optimized on TensorFlow. TPU has its own customizable APIs for the critical embedding lookup operation. For instance, it can be configured to efficiently handle huge embedding tables by automatically sharding them across all TPU cores, where part of the optimization is attributed to the ultra-fast chip-to-chip interconnect in TPU devices and pod slices. Moreover, TPU can support much larger vocabulary sizes in this mode. Note that we did not test the latest TPU v4 because it is not fully supported on this embedding lookup API yet.
Good scalability from a TPU board with 8 cores to a slice of pod with 32 cores to deliver high throughput and cost efficiency.
TPU enables synchronous training, where the TPU cores train over different slices of input data in sync and aggregate gradients via all-reduce at each step. All-reduce [8] is a collective communication operation that sums the different gradients from all TPU cores and returns the results to each. Each TPU core updates its local model parameters using the same aggregated gradients so that it always has an identical copy of variables after each training step.

However, the uniqueness of TPU hardware and the highly optimized TPU training pipeline on TensorFlow also make the integration challenging (we used the TensorFlow v1 API, integration is expected to be easier with the v2 API, which has better support on TPU training):

TPU has a unique input pipeline for data parallelism and uses an infeed/outfeed queue for transferring data or results between host and TPU cores.
TensorFlow uses its XLA (Accelerated Linear Algebra) compiler to accelerate model training on TPU. However, this compilation step requires that the shapes of all tensors in the model graph are static, i.e., known at graph compilation time. This imposes challenges as the length of the list of categorical features is not static.
TPU also wraps the model training step in a native TensorFlow while loop to amortize its launch cost, making certain TensorFlow operators, TensorBoard visualization, and customized training hooks much more complicated than their counterparts without TPU support.

TPU Embedding Column

The main challenge in TPU training is handling the sparse tensor obtained from the list of categorical features, which has a non-static shape and causes a graph compilation error. Moreover, sparse tensor embedding lookup is only supported through the TPU embedding column API.

After some initial exploration on slicing sparse tensors to obtain static shape, we eventually adopted the TPU embedding column API. This is a native and optimized solution for embedding-based operations, which takes the list of categorical features as input and efficiently performs embedding-based operations. It also offers customizable options, including sharding large embedding tables across multiple cores.

Input Pipeline

TPU needs specialized code and parameters for its unique input pipeline setup to maximize the overall training throughput. In our initial tests, TPU training was much slower than expected. After some debugging, we found that the input pipeline was the bottleneck. We were applying the same TensorFlow Dataset parameters that worked well on our distributed CPU training jobs to the TPU input pipeline, which turned out to be suboptimal.

After extensive benchmarking, we found the optimal parameters that can remove most of the I/O bottleneck and approach the theoretical max performance. The overall training throughput has more than doubled compared to the initial results.

Results & Benchmarks

TPU Benchmarks vs CPU

Setup

The benchmarks were run using Snap’s internal distributed training system on Google Cloud AI Platform, with one of our principal model architecture and features. We evaluated both a standard size model and a heavier one. Both follow the typical architecture found in [9], with the heavier model having 50% more features, larger embedding size, twice larger DNN layers and additional DCN interaction blocks.

The reference CPU training uses:- 56 workers (c2-standard-8, 8 vCPU, 32Gb Memory), 10gbps bandwidth- 10 parameter servers (n2-standard-16, 16 vCPU, 64Gb Memory), 10gbps bandwidth

TPU Benchmark

After proper tuning and optimization, the TPU training results were impressive. They drastically outperformed the CPU-based training on throughput and cost while offering the same accuracy level for our most common models.

Average Relative performance for training with TPU vs CPU.

Thanks to its higher TFLOPS/core, high interconnect between TPU boards, and high memory bandwidth, it appears that TPU can fit our use-case much better overall than our current CPU-based distributed training.

Batch size impact

Batch size is essential to ensure that we saturate the compute capabilities of a device. Increasing the batch size to boost throughput is a common technique, but it can get challenging as very large batch sizes usually exhibit metrics deterioration. In our previous system, we measured that for a batch >4096, the log loss would rapidly degrade, whatever the common techniques we tried to mitigate that [2], [3].

For synchronous training with TPUs, increasing the batch size was relatively resilient. We were able to match baseline accuracy with very large batch size by simply increasing the learning rate the larger the batch size, and applying a warm-up as described in Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour [2]. TPU v2 performance started to saturate at about 131k batch size for our system, and it wouldn’t benefit from getting higher.

Average Relative performance for training with TPU v2 vs CPU baseline, at different batch sizes.

Scaling

TPU also provides the ability to increase the number of cores even further. While we observed near-linear scaling for moving from 8 cores to 32 cores TPU, the 128 cores runs hit a pretty heavy bottleneck and could only provide 5.25x improvements instead of the theoretical 16x. Similar scaling issues could be observed in public benchmarks of DLRM with TPUs [7] and require further investigation.

Average Relative performance for TPU scaling, compared to TPU-8 cores.

Training Stability

We also measured the stability of switching from asynchronous to synchronous training. It is indeed very important for the team to be able to reproduce past and current experiments. If the variance is too high between identical training jobs, it is harder to draw meaningful conclusions from the experimentations, and more challenging for hyperparameter tuning.

Migrating to synchronous training showed a significant decrease in variance between identical training jobs:

Log loss variance for 5 identical training runs.

TPU Benchmarks vs GPU

Finally, we compare TPUs with GPUs. Using GPUs to train recommender systems with TensorFlow had a few challenges. For example, before TF v2.6, there were no ops to support basic GPU embedding lookups. We also had to wait for TF 2.7 to fix a bug with empty sparse id list lookups on GPUs (more details: Improved TF 2.7 Operations for Faster Recommenders with NVIDIA [16]).

The following benchmark compares 4x A100 with TPU v3 32-cores (4x TPU-8 boards), and uses the same global batch size of 131,072. We tried to run the A100 in the most optimized environment, using NVIDIA latest NGC 22.01 with TensorFlow v2.7.

NVIDIA A100 relative performance to TPU v3.

So far, we haven’t matched TPU performance using NVIDIA’s A100. A possible explanation is not using embedding lookups optimization as we did for TPUs. Nvidia has started making good efforts to better support recommendation system model training on GPUs. For example, the HugeCTR library [5] recently released a TF plugin [6] showing 7.9x speed-up thanks to their embedding layers optimization compared to vanilla TensorFlow. We are actively improving our training pipeline to better support GPUs and integrate these optimized solutions.

Summary

In conclusion, while TPUs seem to be working best for Snap’s ads ranking needs today, the hardware accelerator field is very active, and exciting innovations continue to come out every year. Our strategic goal is to continue improving our training system to reach state-of-the-art performance for each major ML hardware (TPU, GPU, CPU) and be able to switch to whatever hardware will provide the highest value in the future.

References

[1] Performance Analysis and Comparison of Distributed Machine Learning Systems

[2] Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

[3] Large Batch Training of Convolutional Networks

[4] https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus

[5] https://github.com/NVIDIA-Merlin/HugeCTR

[6] https://developer.nvidia.com/blog/accelerating-embedding-with-the-hugectr-tensorflow-embedding-plugin/

[7] https://mlcommons.org/en/training-normal-10/

[8] https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/

[9] Machine Learning for Snap Ad Ranking

[10] https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af

[11] https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664