QUIC at Snapchat
on Thursday, June 24, 2021
At Snapchat, our goal is to make our Camera the fastest way to share a moment. We do not want Snapchatters to face any delays when they are in the moment and want to share with their real friends.
Network requests are on the critical path of using Snapchat. Compared to a UI update or disk write that takes milliseconds, network latency can take seconds with high error rate and device constraints. To reduce network latency and error, we make requests and responses smaller, reduce unnecessary sync, utilize global content distribution partners to bring media close to the people who use it, and use an efficient, next generation network protocol called Quick UDP Internet Connections, or QUIC for short.
How does QUIC help Snapchatters?
Let’s first take a quick look at the network stack before QUIC. Take a Snap send as an example: on the application layer, we put the Snap media into the HTTP2 request payload. Then we use TLS to ensure the connection safety on the security layer and leverage TCP to split the request into chunks and upload the Snap to the server. However the TCP+TLS+HTTP2 stack is suboptimal for the mobile network environment. For example, TCP requests will fail if a Snapchatter switches between WiFi and WWAN. For a user chatting with friends, failure to send a message due to a connection drop can lead to a degraded experience.
QUIC is a transport protocol for the internet, developed by engineers at Google. QUIC is the foundation of HTTP3 that replaces the TCP+TLS+HTTP2 and is built on top of UDP. QUIC solves a number of transport-layer and application-layer problems while requiring little or no change from application developers. As the above diagram depicts, QUIC does not alter the low-level operating system network protocols nor does it alter high-level HTTP.
Compared to the TCP+TLS+HTTP2 stack, QUIC offers the following advancements:
Multiplexing without head-of-line blocking: For HTTP2 connections, when a TCP packet is lost, no streams on that connection can make forward progress until the packet is retransmitted and received by the far side. This leads to increased latency and a potentially degraded user experience on mobile network connections. QUIC eliminates this stalling for other streams multiplexed over the same connection.
Connection migration across IP addresses: A TCP request will fail if the IP changes. However, QUIC connections are recognized by a QUIC protocol layer randomly generated 64 bit identifier, so a client using QUIC can continue in-flight requests, uninterrupted across a change in IP address, allowing for an undisturbed user experience.
Detection of lost connection: QUIC detects connection loss quickly and avoids long hanging requests.
QUIC advancements fit well with Snapchat use cases:
Faster connection establishment: At Snapchat, before QUIC, p90 connection setup took up to 300ms. This connection setup latency translates to user waiting latency, and blocks the user from receiving Snaps and viewing Stories. Faster connection on QUIC directly reduces user waiting latency and improves the user experience.
Improved congestion control: At Snap, the size of upload media can be as large as 10MB. A better congestion control algorithm improves throughput, reduces latency and error rate especially for large media.
Multiplexing without head-of-line blocking: Snapchat has rich use cases with short content, including Snaps, Stories, Discover content, etc. Normally there are multiple download streams using the same connection. QUIC eliminates the HTTP2 head of blocking issues, such as avoiding the send message requests blocking spotlight requests.
Connection migration across IP addresses: When with friends, failure to send a message due to a drop in their wifi connection can lead to a degraded experience. Connection migration solves this pain point.
Detection of lost connection: A long loading spinner due to a lost connection is disturbing, especially when a Snapchatter is in full screen mode enjoying content. With QUIC, when requests fail due to a lost connection, we can detect and retry while providing a user-friendly UI.
Adopting QUIC at Snapchat
Snapchat’s client network stack is built on top of the open-source mobile network library Cronet. Snap leverages Cronet not only for QUIC but also for improved observability through rich metrics and logs. We are able to build a cohesive view of client and server network performance.
We chose different protocols based on the network performance across countries and platforms. In general, we observed that enabling QUIC improved p90/P99 network latency by 6-20% and network errors by 3%-8%. There are even more improvements on the low network connectivity user cohort. Here we showcase three examples.
In the first example, we enabled QUIC on our ads service in Oct 2019. We observed latency improvements across P90/P99 and error rate.
We observed error rate improvements across all error codes, including connection timeout, connection lost, request timeout. In further breakdowns of the latency improvements by country and region, we observed higher latency improvement for countries and regions with relatively worse network quality and further geo distance to our services.
In the second example, on top of QUIC, enabling BBR congestion control on the client to server path also brought significant latency improvements. And there are more improvements with larger request payload.
In the last example, by enabling connection migration on Android, the network request success rate when losing a wifi connection increased by 20%.
With the success story on QUIC integration, moving forward, we will double down on QUIC, including:
Increasing QUIC coverage
Further leveraging QUIC advancements, including experimenting on BBR V2, supporting connection migration on iOS..etc.