When a user opens the Snapchat app, the first experience they are exposed to is the camera. This is key to ensure our users can capture the moment and communicate as fast as possible. To measure and optimize this experience, we measure the time a user presses the Snapchat app icon on their home screen, to the time that the camera is ready to take a snap. This metric is intended to capture the end to end user experience when it comes to startup latency.
This metric is important to Snap for a couple of reasons. First, we want to ensure that our users can capture the moment they want to as quickly as possible. Imagine you are walking along the street when suddenly a bear starts crossing the street a few hundred feet ahead of you. You pull out your cell phone and open Snapchat to capture an image of this amazing moment to share with your friends. If the app and camera take too long to open, the bear may have disappeared and you miss the moment. This would be a frustrating experience and is an illustration of why startup latency matters. Secondly, we want to ensure that over time Snapchatters have a fast startup experience to camera so they can trust the app to reliably capture the moment.
We measure the time it takes to start the app process, open the camera, and render the viewfinder, preview frame and capture button - getting the app into a state where a user can capture a snap.
To protect the startup path from regressions, we require an architecture that allows us to measure and optimize startup and a detailed understanding of the startup path and the components and code paths involved. We build this picture using a combination of code analysis and systraces to allow us to understand what happens during the app startup flow on each thread. With this overview, we can then add latency metrics for the subcomponent parts of startup - which are the main indicator we use to understand if a regression has been introduced to any part of the startup path.
We measure 3 startup ‘types’ - cold start (starting the app process from scratch); warm start (the app process is already running, we need to bring the app to the foreground and load the UI); and hot start (the app is running and was backgrounded by the user, we are now bringing it back to the foreground). A high level overview of the parts that make up cold start can be seen here; warm and hot start are for the most part sub measurements of this flow:
We have invested in a number of areas to protect our users from regressions. Some of these are process investments and some are engineering investments.
On both iOS and Android, we have taken steps to structure the startup path to make understanding it and detecting changes to it as simple as possible.
Within Android, each of our feature components are built as separate Dagger components. The components are connected using dependencies creating a component graph. With a graph structure, we get the minimum set of components for startup and only initialize those. Components are monitored on the startup path to prevent new components from introducing regressions. We distribute component initialization across threads and reduce contention on the main thread. We also push important non-startup feature initialization, such as data syncers, to after startup. By structuring the Android app with the component graph we minimize the work done during initialization.
On the iOS side, we have isolated the components that need to be initialized during startup and created an internal DAG to map in which order the dependencies need to be executed (Even though there are fewer device variants in the iOS ecosystem, we still need a more modularized app to deliver the best performance for all our customers. This iOS Snapchat app modularization work is currently in progress). This will allow us to only initialize what’s necessary during startup to reach the fastest startup experience of all device models.
In order to catch additional code that has been added during the startup path, we run post-merge analysis on the symbols (classes/methods/functions) that are being called during startup. Comparing the list of symbols that are being called between commits allow us to have a detailed view of what has been added and what could potentially slow down the overall startup process.
Automated performance testing is our strongest mechanism to catch regressions before they reach production. A startup performance test consists of launching the app and restarting the app multiple times while gathering startup metrics. We automatically execute these tests on a range of devices in our device lab with different performance profiles and compare results to understand if there is a change in the startup performance. The challenging part of performance testing is that there can be a lot of variance between the iterations of the tests which means that we need enough test runs in order to get statistical significance.
In order to make this process more efficient, we run startup performance tests at different critical points in the software development lifecycle. We run them when engineers commit new code; we run them periodically post-merge; and we run them regularly comparing release over release. We also compare performance release over release as a new release branch is created to ensure that we did not miss any changes. If a change is identified from these automated tests, then the engineer who introduced the commit is notified as well as the startup team.
Once a new version of Snapchat is released, we have automatic alerts setup to compare the newer Snapchat version performance with the previous one. We are following a staged rollout process, progressively rolling out new versions of Snapchat to customers. At every step of this rollout process, there are automatic processes that are comparing startup metrics with the previous version. If there is a significant regression detected an alert is created and the rollout is paused until an investigation is completed to understand if a hot-fix is required.
A number of teams can potentially introduce code changes that impact startup. As a result, it is crucial that startup metrics have visibility across all engineering teams and that the importance of startup is well understood and communicated across engineering.
In order to achieve this, we spend time socializing the importance of startup - highlighting what our startup performance metrics are and getting others familiar with the metric terminology, and highlighting how it directly impacts our users. We use mechanisms such as brown bag sessions, email updates and design reviews around how we plan to measure startup to get more visibility into the startup path. We also review our plans to improve startup which helps to solicit feedback from others and has even led to others suggesting additional ways to improve the metric.
On an ongoing basis, we participate in engineering-wide operational reviews where we highlight performance over the past week and relate these shifts to user engagement. This ensures that startup performance remains top of mind across engineering teams and stakeholders.
We also took a business-level goal around improving startup performance to ensure that this metric is tracked at all levels - and to ensure that we have a mechanism to ensure fast traction should we see regressions.
Many teams can impact startup latency, and this will sometimes happen as an unexpected side effect of another change. When this happens, we have a number of mechanisms to identify the regression as it is introduced and can then work with the engineer who introduced the change to highlight the regression and work with them to resolve. Some teams have a higher possibility of introducing code that might impact startup performance, and it is important that they pay additional attention to startup metrics as they make changes that may impact the startup path. We meet with these teams on a more regular basis to discuss any upcoming changes they are working on that may impact startup, and to discuss improvements the team are making to improve performance. This knowledge sharing is crucial not only to ensure that changes do not clash, but also to ensure that these teams are co-owners of startup performance and can help us protect it.
Protecting app startups is important to ensure a great and consistent customer experience. In order to effectively protect the startup path, it’s crucial that we continue to invest in tooling and monitoring of the startup path across the entire SDLC (Software Development Life Cycle). Equally important is to ensure that all engineering teams are on board with the importance of app startup and contribute to improving and protecting it from regressions - if this is a single team concern, it’s much harder to be successful.