Don't Rewrite Your App, Unless You Have To
on Thursday, May 7, 2020
One year ago, the Snapchat Engineering team launched a new version of our Android app, rewritten from the ground up to be more performant and less bug prone. Rebuilding Snapchat for Android was a complex and intimidating process. But, it was critical for us to build a better app for our global community, and we are glad we did it.
Rewrites can have positive results, but they are challenging. In order to develop our strategy, we researched our app’s performance thoroughly before starting. Once we began, a tremendous amount of coordination and support was needed to keep the project on track, and we used and tested our new app from early on to keep the quality high. In this post, we outline some of our decision making processes and lessons learned from the experience.
The main motivation for our rewrite was to improve app performance. Snapchat is a camera app, optimized for capturing fleeting moments. If the app takes too long to load, the moment can be lost. Like many apps that evolved at a fast pace, Snapchat grew in complexity into a new problem space that it wasn’t originally designed and built for.
The features in our code base were tightly coupled in a way that hurt flexibility. Particularly troublesome was the lack of a structured way to initialize and schedule work. A lot of our code and data was loaded immediately at startup by features that were not part of the startup path, in a way that was hard to reason about or unwind. So much work was done at startup that it took the app 30-60s to settle down, leaving a large memory footprint to be carried throughout the app session.
Before pursuing a rewrite, we tried to incrementally streamline our existing codebase. Progress was slow, and each step caused unintended side-effects, like deadlocks or corner case bugs. As an example, we prioritized camera loading speed, delaying all other work until after the camera was ready to capture. But this caused video Snaps to drop frames, because the backlog of delayed work would now run during recording. The performance of the old app felt like a game of tug of war, where freeing up resources to improve performance of one feature would necessarily hurt another.
When thinking about the future, we wanted an app that would:
Be performant. The app would load instantly and feel fast.
Allow quick iteration. The app should be easy to develop on.
Be sustainable. As we added new features, the app would remain fast to use and build.
Deciding to Rewrite
When considering rewriting an app, there’s some fairly common wisdom across the industry: don’t do it! The idea that we could overhaul 5 years of work and re-implement it better in a fast timeline seemed optimistic. We would inevitably introduce new bugs, which could take a long time to address. Even if we pulled it off, who is to say it wouldn’t regress back to the old state after? There would be a lot of pent up demand for launching new features after the rewrite, and launching features fast is what created our issues in the first place.
We pushed ahead with a rewrite because we believed it would be faster and less risky than an incremental refactor. The existing app was complex and interconnected. We believed issues introduced by the rewrite would be easier to find and fix than the ones caused by the current, complex system. Our leadership fully supported this decision, giving us the latitude to plan for a successful rewrite.
To address some of the coming challenges, we adopted the following strategies, which we’ll discuss in more detail next:
Have some ground rules. We wanted to fix a performance problem. It was important we had a plan for how to fix that problem ahead of time, instead of simply hoping the rewrite would be done better.
Focus. Rewriting code is fun, but to ship a rewrite in a reasonable time frame it was important that we narrowed its scope and worked well together.
Adopt an MVP strategy. Since it’s difficult to test a rewrite directly on our community, we made a point of using it ourselves, and keeping the quality high from the start.
The Ground Rules
Snapchat consists of many small apps rolled into one, including camera, chat, memories, photo editing, content consumption, and a map. It opens to the camera, which is resource intensive in both memory and CPU, and includes AR lenses and a lot of heavy media content. Combining these features together into a single app makes for an engaging user experience, but presents a hard engineering challenge.
In the Android operating system, users can have many different apps installed on their phone, and each app is able to load fast and perform smoothly in isolation. We started seeing our Android app as a mini OS, and our features as mini apps running inside of that OS. If each mini app could be made to load fast in isolation, it should be possible to combine them while keeping performance high, without the need to preload features at startup.
As a proof of concept, we built prototype-style standalone apps for our Friends and Discover screens. We then made each of these screens load instantly, by tweaking the feature architecture to have a ready-to-render database schema, load data incrementally, and have flat view hierarchies. We then built a standalone camera page and a small mini-OS layer (which we call “app platform”) to tie them together. Our prototype proved that our solution would work, as the app loaded fast without any preloading.
Our ground rules to enforce this independence were simple. Don’t preload, treat each feature as a standalone app, and make it fast. We found this gave teams a good understanding of what the objective of the rewrite was. Teams were given a choice to either reuse their existing code and adapt it to the new app or start from scratch on the new app. The ground rules gave them a foundation to rethink their code, and most teams chose to rewrite. By reusing suggested components and design patterns, they were able to rebuild their feature quickly with high performance.
Rewriting code is fun, and old code often looks ugly and hard to comprehend. In the process, it’s important to not get carried away into rebuilding more than what you need. This, more than anything, is what we thought could add risk to our timeline and introduce scope creep. We wanted our new app to be much better, but we did not need to solve every single problem at once.
After experimenting with several ideas and brainstorming, it became clear that we could cut down the scope by postponing work in many areas:
We would not add new features. With small exceptions, our Android app feature set was frozen for 6 months during the rewrite.
We would not change the app UI. This turned our rewrite into a pure engineering problem, and also allowed us to do an apples-to-apples comparison.
We would not make any changes to the client-server protocol, unless explicitly needed.
We would not rewrite components of our application that were already isolated and of good quality simply to adopt new languages or libraries.
We would not change our build systems, CI, QA or release processes.
As an example, at the time of the rewrite our app heavily relied on JSON to make network requests. We knew JSON was inefficient and expensive to parse, and wanted to move to a more modern solution. However, doing so would take longer as we needed to change our client-server protocols and endpoints and do a careful migration. Instead, we adopted an intermediary solution where we introduced a centralized network manager API, which hid the usage of JSON as an implementation detail. This pattern of centralizing areas of future improvement behind APIs became widely used.
Another important aspect of focusing was coordination. From the point when the rewrite left the prototyping stage, we had a roadmap involving many feature teams, with internal milestones for when certain features would be ready. We constantly discussed roadblocks and decided whether they were important enough to delay our timelines. Project management was an ongoing process, and our rewrite never had an open-ended timeline.
During this time, our leaders frequently re-emphasized the importance and status of the project to the company. This kept the different teams focused, even as we hit roadblocks and delays. This was necessary for us to stick to the plan, since the project took months, and demand to start building new features again grew.By keeping our focus, we were able to minimize the time when we had to support two apps in production, as well as the time when we weren’t able to ship new features. It also kept us honest and allowed us to spend our energy solving the crucial performance problem that motivated the rewrite to begin with. Finally, it built confidence within the company that the rewrite would ship, which was important giving how many people were dedicated to it.
When building or rebuilding a feature, it’s too easy to focus on feature completeness first, and only then worry about quality. Given the scope of features we were implementing, flipping this order was vital to our success.
From early in our development process, we made a point of having an app that would be stable around a core set of MVP features: communicating with friends, viewing stories, and using the camera. Less than 2 months into our rewrite, many Snap employees were dogfooding the new app, switching to the old one only to access missing features. Any bugs or performance regressions introduced into our new app would be treated as if they were production bugs, and addressed immediately.
The first version of the app we used only allowed sending photos and videos with simple text captions, and sending/receiving simple text messages without effects such as stickers. This allowed us to later catch many performance regressions that were introduced as we added extra functionality to the communication flow. Catching these regressions early on gave us the time to dive deep into the root causes, and invest in better testing strategies and design patterns to make similar problems less likely to occur. If we had waited until features were fully implemented to start using the app, it may have been much more difficult to address them.
Once we had enough of our feature set ready, we started testing the app on new users. We listened closely to customers, and learned a lot about which features they valued the most in our app. The learnings allowed us to adjust our rewrite plan as we went.
Our new app is much more performant. We were able to reduce slow cold starts and ANR rates by 60%, frozen frames by 45%, and APK size by 25MB. It also laid a good foundation for continued performance improvement. Each fixed bug is now seen as an opportunity to add a new test, a new metric, or a new performance test to make it less likely to regress. By starting from scratch and adopting a new mindset, the experience of launching our rewritten app helped solidify a new engineering culture internally which values craftsmanship, performance, and great engineering practices.
Although our rewrite was a one-time project, we continue to refactor our codebase and keep it modern as we go. By having a modular app, we can work independently on different components, rewriting the ones that need to be rewritten, and measuring and improving the overall system for better performance. We can also invest in more platform efforts, to reduce our app size and memory footprint, and to support and integrate new Android features like Kotlin, dynamic delivery, and native development.
Not every project requires a rewrite, and we take the incremental approach for our initiatives whenever possible. But there is no reason to fear a rewrite, if it’s the best solution to the problem. With a good game plan, focus, and an MVP mentality, rewrites can help fast track meaningful changes in an engineering product and culture.