Snapchat uses machine learning to power many features across the app - from sophisticated lenses, to serving relevant and interesting Discover content, to returning relevant search results, to security architecture that keeps our community safe.
In most cases, the effectiveness of machine learning depends on the quality of training data. A machine learning framework, such as Tensorflow, ingests training examples and outputs a model capable of performing a new task. Thus, having access to high quality training data is absolutely essential for the success of any machine learning undertaking.
In a traditional machine learning setting, training data is collected and processed in a centralized place. Centralized collection of user data, however, is sometimes at odds with the privacy and security assurances we strive to provide to our users. In some cases, our privacy guidelines may consider certain data to be sensitive. It is exactly for such situations that we devised a Device-Distributed Machine Learning framework that embraces privacy as its core principle, while offering utility benefits comparable to server-side models.
Device-Distributed Machine Learning
Device-Distributed Machine Learning (DDML) is a set of technologies that enable training of privacy-preserving machine learning models directly on client devices, without the need to transmit sensitive information to our servers. In many ways, it is similar to Federated Learning, though the two technologies mainly differ in their privacy guarantees, abuse safeguards and scalability aspects (more technical details in the paper).
Privacy considerations are the main force driving recent distributed machine learning developments. The notion that you can train a machine learning model without collecting any user data is extremely appealing from a privacy perspective, because managing user data, regardless of good intentions and best practices, always carries some modest but unavoidable risk. When operating cloud infrastructure on a massive scale, the possibility of human error can be mitigated, but never fully removed; DDML sidesteps this inherent challenge in a very elegant way: by not collecting the data in the first place.
The difference between the classical model training and DDML are shown schematically below. Despite the fact that the main technological machinery underlying model training is the same, their components interact in a very different way.
In a traditional setting (shown on the left), the user’s data crosses the network and is centrally collected and used for training on the server. In the DDML setting (shown on the right), the user’s data remains stored locally on each device and instead, global machine learning models are sent up and down the line for updates. These models do not contain any Personally Identifiable Information (PII) and represent global, sequentially updated across all previous devices, relationships between various model features and the label (output). In their most simple incarnation, a model is a single array of floats, constantly adjusted up and down to match the desired output.
DDML Phishing Model
To give a practical example of DDML in action, it is useful to consider the problem of detecting phishy URLs. Stealing users’ login and password information, commonly known as phishing, is an example of one-on-one abuse on the platform. Suppose a user receives a malicious link from a third party. They click on the link and see a site that visually looks like accounts.snapchat.com, asking them to enter their Snapchat username and password. What a less attentive user may fail to realize is that the link actually points to a non-Snap website meant to harvest Snapchat credentials.
We could prevent this attack if we could detect that the URL sent in the chat was phishy. But how do we determine if an arbitrary URL is phishy? This is where DDML proves to be valuable, offering both privacy and security benefits at the same time.
Every time a user receives a URL in chat, the Snapchat app makes a record in its private client-side storage consisting of features derived from the URL (its length, number of query parameters, number of digits, etc), page (does it have a password box, number of paragraph tags, etc) and context (locale, time, country, etc), as well as prior knowledge on whether this is a known phishy URL or not (classification label). The true classification of an URL is not always available up front, but Google Safe Browsing, user reports, and manual curation by our customer service after helping with hijacking attempts can provide a robust approximation. When the app has sufficiently many training data records, it sends a generic request to the server, asking for a model. The app slightly adjusts the model’s weights to reflect the locally-observed relationships in the training data. It then adds appropriately calibrated noise to the updated weights to provide local differential privacy guarantees and deletes the feature data from the local storage. The only piece of data that leaves the user's device is the same set of weights it has received from the server, just slightly adjusted and masked with noise. This same global model can be checked out and updated by another client at a later time.
As the model is updated by different clients, the global DDML model slowly learns to discriminate between phishy and non-phishy URLs/pages, given a set of features. In the future, when we encounter a new URL whose phishing status is unknown we can estimate how likely it is to lead to a phishing website by running that URL’s features through the model. Features like “does the page have a password box?” or “does it look like accounts.snapchat.com?” are strongly predictive of the phishing status. The model just negotiates how much weight to give to each of them in the final decision making step.
The adversarial setting in which phishing takes place and where attackers are constantly adjusting their strategies and approaches requires continuously adapting countermeasures. Only flexible machine learning models with strong feature fidelity and anti-cloaking properties available on the client can offer sufficient protection. With state-of-the-art privacy guarantees and reliable features, the DDML phishing detection model comes to the rescue where traditional machine learning approaches would fall short.
There are several advantages to DDML. First and foremost is privacy. We would not have been able to apply a traditional machine learning setup to our phishy URL detection problem, since otherwise, our privacy principles would limit our ability to look at such signals. Decentralization of data and computation also has other additional benefits, such as scalability, resilience to failure and cost.
DDML does come with disadvantages. Since all data collection, processing and featurization takes place on the client, for a DDML-like solution, machine learning engineers would need to ramp up on mobile development and understanding of iOS and Android platforms. In addition, one must be mindful of device and network resources and the types of models that can be used given the constraints.
Distributed machine learning will not completely replace the centralized approach, but for particularly sensitive domains, such as private communication, its adoption will continue to accelerate, extending the scope of its applications.