GPU Transcoding at Scale

We strive to make Snapchat the fastest way to communicate and share a moment. In doing so, we constantly juggle the trade offs between quality and application performance when creating, posting and viewing media. We do this while also optimizing data usage and infrastructure cost. 
When a user posts a story in Snapchat, the video goes through a number of processing steps:
  1. The story is transcoded on the device into a resolution and bitrate that ensures both reasonable upload latency and visual quality.
  2. On the server, the story is further transcoded into a number of variants at different resolutions and bitrate using different codecs, we determine the variants based on our forecast of the viewer side capabilities. 
  3. When the user's friends and fans view the video, the server chooses the optimal variant based on the device capability and bandwidth.
In this post, we will focus on the second step. Generating the list of variants is a challenging optimization problem. We want to deliver the best video quality while ensuring smooth playback experience. This means we need to leverage the most advanced codecs where we can. HEVC (H265) allows us to deliver the same video quality with a smaller bitrate. However, it’s more compute intensive than H264 and with hundreds of millions of videos being uploaded by the Snapchat users everyday, it would be prohibitively expensive to apply HEVC transcoding for all videos. So we must add compute cost as another dimension of our optimization problem. 
GPU transcoding
Over the past few years, GPUs have evolved from a specialized graphics hardware to become a general purpose compute device powering the AI revolution. For parallelizable operations, GPUs offer higher efficiency allowing for achieving higher compute per dollar compared to CPUs. It turns out that video transcoding could also tap into this power. The NVIDIA Turing architecture achieves both high performance and high quality HEVC transcoding. The GPU instances are more expensive, but the higher throughput means lower cost per video.  In addition, we can process the latency sensitive workload faster and provide a better customer experience.  
We have tested GPU transcoding on both AWS G4 instances and the GCP instances with T4 GPU attachment. Both performed similarly and achieved higher throughput than the CPU instances we use in production. Our transcoding fleet spans across both cloud providers and leverages the one that best suits the workload.
GPU vs CPU
The following charts compare the bitrate and quality for GPU and CPU transcoded 480p videos with HEVC codec. The GPU version achieves the same quality (as measured by VMAF) with a slightly smaller bitrate than the CPU version.
HEVC vs H264
The following charts show the GPU transcoded HEVC videos achieved both bitrate reduction and slightly better quality than H264.
Here is a visual example: the one on the right is a GPU transcoded HEVC video, which not only has better quality, but also uses 20% smaller bitrate.
Challenges and where we are now
The GPU encoder works differently from the software encoders. It lacks many of the parameters that we typically rely on to tune bitrate and quality. Also, because it’s an emerging field, there are no established best practices for GPU transcoding at scale. So it took trial and error to get to where we are today. And we believe there is still room for further optimization. 
So far, we have launched GPU transcoding for Ads videos and some categories of the user stories. The cost saving has enabled us to enable HEVC transcoding for a larger set of users than we would have with software encoding. We have also leveraged GPU to accelerate latency sensitive workload such as stitching series of short videos into long form videos.