AR-Enabled Catalogs

From Monolith to Multi-Cluster: Snap's Journey to a Resilient Data Orchestration Platform

Flowrida, Snap's workflow orchestration tool, is built on Apache Airflow and is critical for managing data at scale. It supports over 1,000 users across more than 170 teams, handling 3,300 active data pipelines and over 180,000 task instances daily. A significant limitation of open-source Apache Airflow is its lack of multi-cluster resiliency. While vendor-hosted offerings exist, they present trade-offs concerning cost, IAM, and integration at Snap's operational scale. Flowrida initially operated as a monolithic single Kubernetes cluster. Although effective at first, this setup posed substantial challenges as Snap's exabyte-scale data processing needs grew. With Snap's growth, timely data processing and availability have become essential for the core Snapchat app across iOS, Android, and web. Data orchestration is also crucial for A/B testing, analytics, and machine learning use cases like recommendations and ranking. Given the platform's criticality and increasing operational burden, Flowrida was elevated to a Tier 0/1 service at Snap. Since then, by revamping our infrastructure with multicluster set up, it has reliably maintained 99.95% availability for Flowrida.


The Challenges of a Monolithic Architecture

The single-cluster design led to several issues that hampered Flowrida's operational efficiency and reliability:

  • Lack of high availability and disaster recovery:  Our previous system lacked the ability to guarantee business continuity and the rerouting of critical DAGs in the event of a cluster failure.

  • All-at-once deployments: Configuration changes, patches, and upgrades immediately affected all components across the entire system. This created a massive blast radius during incidents and made testing difficult, as staging environments could not accurately replicate production loads. This kept our MTBF (Mean time between failures) low.

  • Reduced service reliability: We couldn’t isolate DAGs into different clusters, significantly reducing the blast radius of faults.

  • High  Mean Time To Recovery (MTTR): In case of a cluster-level fault, we didn’t have a backup cluster to meet SLAs for data availability across 100s of teams. Even for  business critical DAGs with sensitive SLAs, we didn’t have a mechanism to unblock.

  • Scalability bottleneck: The Metadata Database (Postgres) served as a primary scaling bottleneck. As the number of DAGs increased (with ~40% year-over-year growth), so did the database size, growth rate, and lock times, leading to degraded scheduler health.

These challenges were increasingly visible during periods of operational stress, revealing the urgent need for a more robust and resilient infrastructure.

The Vision: A Tiered Multi-Cluster Infrastructure

To address these limitations, we have transitioned Flowrida to a tiered multi-cluster infrastructure. This architectural shift aims to:

  • Enable incremental deployments: Changes can be rolled out gradually to smaller subsets of DAGs, allowing for thorough verification before broader deployment

  • Improve long-term scalability: Distributing DAGs across multiple clusters mitigates the Metadata DB bottleneck and prepares Flowrida to handle future demand.

  • Faster Mean Time To Recovery (MTTR): In case of a cluster-level fault, critical DAGs can be quickly migrated to a healthy cluster ensuring Snap’s business continues to run.

While the benefits are clear, transitioning to a multi-cluster environment introduces new complexities:

  • Maintainability: Avoiding the pitfalls of Flowrida V1's single-tenant architecture (50+ clusters managed manually) requires limiting the number of clusters to a manageable few (e.g., 2-3).

  • Increased Dependencies: New services built to support multi-cluster operation introduce new potential points of failure, necessitating high reliability for these components.

  • Customer Experience and URL Namespacing: With DAGs spread across clusters, ensuring a seamless user experience and avoiding fragmented UI access (like in Flowrida V1) is crucial.

To tackle these challenges, Snap has developed several key components and strategies:


1. The DAG Switching Service: The cornerstone of Multi-Cluster Operation

The DAG Switching Service is fundamental to Flowrida's multi-cluster strategy, enabling both fast incident mitigation during cluster failures and controlled rollouts. Its core goal is to develop a fast (within minutes) and safe method for switching DAGs between clusters, addressing situations like unhealthy clusters, Airflow upgrades, or phased rollouts.

Constraints and Considerations
Before looking at the specific switching approaches, it’s important to recognize several fundamental limitations of Apache Airflow that shaped our design. These are constraints are imposed by working within airflows open source code, and apply no matter which switching strategy is used:

  • Catchup=True Backfill Problem: When performing a naive switch, the scheduler in the destination cluster will attempt large-scale historical backfills because it interprets the catchup=true setting (used by most Snap DAGs) as indicating "missed" runs. To prevent this, it's essential to copy the most recent terminal run as an "anchor task" using Airflow's backfill mechanism.

  • CLI-Only Operations: Key operations (airflow backfill, dag-processor) are not available via the REST API on Airflow. This forced us to build Kubernetes job wrappers using Client-Go to safely run these commands.

  • Duplicate DAG Visibility: Deploying the same DAG to multiple clusters risks confusion and duplicate execution. We enforce global pause/unpause rules to ensure only one cluster actively schedules a DAG.                 

  • Minimal open-source code changes: Our approach mandates minimal, if any, changes to Airflow’s open-source code. Extensive cherry-picking and reapplication of patches with each Airflow version upgrade would undermine the goal of simpler migrations.

Only by addressing these constraints could we build safe, reliable mechanisms for both Force Switch and Continuous Switch.

DAG Switching service supports two modes:

  • Force Switch — optimized for speed during incidents.

  • Continuous Switch — optimized for seamless, planned migrations.

Force Switch

The Force Switch mechanism copies the DAG’s scheduling state from a source cluster to a destination cluster. Flowrida then resumes scheduling on the destination cluster from that point forward. This approach bypasses Flowrida;s normal scheduling by orchestrating the switch through direct database writes, REST API calls, and Kubernetes jobs (to trigger Airflow’s CLI). 

Use Cases & Pros

  • Best for incidents: Fast recovery when a source cluster is unhealthy.

  • Speed: The DAG is running in the destination cluster as soon as the command is issued.

Cons:

  • Complexity: Requires coordination across multiple systems (DB, APIs, Kubernetes)..

  • Task Disruption: Active tasks cannot be carried over. Any in-progress task is killed and retried in the destination cluster. Acceptable in emergencies, but wasteful for planned moves.

Continuous Switch

The Continuous Switch method is designed for planned migrations. Instead of forcing a transfer, it partitions DAG execution history at the DAG run level:

  • The source cluster completes all ongoing DAG runs.

  • The destination cluster takes over from the very next run.

This handoff is managed by Airflow’s own scheduler, making it simpler and more reliable. We call this the Continuous Switch because the DAG’s schedule is unbroken across clusters.

Key Components

The implementation relies on two lightweight but powerful building blocks:

  1. Airflow dag_policy: A dag_policy is a hook that executes each time a DAG is parsed in Airflow.. It gives us access to the in-memory DAG object before Airflow serializes and stores it. We use it to dynamically adjust a DAG’s scheduling behavior during migration.

  2. Kubernetes ConfigMap: A ConfigMap is a simple key–value store in Kubernetes, typically used for lightweight configuration. Here, it stores the migration state (e.g., which DAGs are being switched). ConfigMap is chosen over Airflow Variables because Variables live in Airflow’s metadata database, which would add overhead if queried during every DAG parse. ConfigMaps are cheaper to read and well-suited to our mostly read-heavy workload.

Lifecycle of a Continuous Switch

  1. Initiate Switch: The DAG’s ID is added to a “source list” in the source cluster’s ConfigMap.

  2. Stop New Runs in Source: On the next parse, dag_policy reads the ConfigMap, if the DAG is listed, its schedule is set to None. This halts new runs but allows all active runs to finish naturally.

  3. Anchor the Destination Cluster: The destination cluster uses backfill (Airflow’s feature for filling missed runs) to align with the source cluster’s final run.

  4. Seamless Resumption: Once backfill completes, the destination DAG is unpaused. Scheduling resumes from the next expected interval with no disruption.

  5. Cleanup: After the source finishes its active runs, the DAG is paused and removed from the ConfigMap.
    At this point, migration is complete: the DAG runs only on the destination cluster, with no lost tasks or delays.

Tradeoffs
  • Pros: Non-disruptive, automatic, and simple—Airflow’s schedulers do the heavy lifting.

  • Cons: Requires both clusters to be healthy; individual switches take longer than a Force Switch.

2. Metadata Synchronization 

Metadata synchronization is primarily achieved via the Airflow API, which is preferred over direct database writes due to its maintainability and safety, despite being potentially slower. Crucially, commands like backfill and DAG processor runs, which are not available via API, are executed using Kubernetes Jobs submitted via a Client-Go library. To prevent duplicate execution, the system ensures a DAG is unpaused in at most one cluster globally.


3. Flowrida Cloud Console UI: A Unified User Experience

To mitigate customer friction caused by URL namespacing, Flowrida has integrated with the Snap Cloud Console. This new UI acts as a unified portal, allowing users to:

  • Search and locate their DAGs across all clusters from a single home page.

  • View DAG details via seamless iFrame integration with Airflow's native UI.

  • Access features like starred DAGs, recent browsing history, SLA Tracker, and lineage views.

This approach avoids extensive changes to the open-source Airflow web server, laying the groundwork for future Flowrida features and ensuring version skew agnosticism between clusters. Existing Flowrida URLs are now redirected to the Cloud Console.


4. Cross-Cluster Communication with ExternalTaskSensor

DAGs often depend on other DAGs across different clusters. Preserving this inter-DAG dependency is crucial. The approach involves the ExternalTaskSensor directly checking with the Airflow API. If a DAG is not found in the current cluster's dagbag, an API call is made to the other cluster.  Dagbag is an  in-memory collection of all DAGs that a scheduler or webserver process has loaded. This method offers a simplified structure and minimizes middle layers. Earlier solutions, like extending the External Meta API (EMA), were considered but posed reliability concerns due to tight dependencies and stability issues.

Benefits in Action: The Phoenix Cluster Incident

A significant validation of this multi-cluster strategy occurred in H1, 2025. A misconfiguration pushed to the Phoenix cluster (which serves as a production canary cluster for new changes) caused approximately new tasks to hang indefinitely and broke SLA alerting, leading to a 3-hour delay in A/B testing.

Without multi-cluster isolation, this incident would have been catastrophic, potentially affecting over 3,000 DAGs across more than 150 teams, requiring manual clearing of tens of thousands of hung tasks, causing company-wide delays, and demanding a massive multi-team effort.

Instead, thanks to the multi-cluster environment, the incident was contained. It drastically reduced our MTTR, requiring only a couple of  developer hours for resolution. The blast radius was reduced by at least an order of magnitude, unequivocally validating the strategic decision to implement a multi-cluster architecture.


The Roadmap Ahead

Snap's multi-cluster journey for Flowrida is a strategic investment, We have an ambitious roadmap for Flowrida. Below are some of the challenges we are aiming to tackle in coming quarters.

  1. Reducing total cost of ownership for Flowrida infrastructure.: We recently had some impressive wins and learnings here optimizing our infra setup with multi-cluster set up. 

  2. Flowrida Feature Improvements: Implement key features like DAG cost attribution,, and production backfilling, which are crucial .

  3. Dual Persistent Flowrida Clusters and Flowrida Routing Layer: Fully realize the tiered multi-tenant Flowrida clusters with phased rollouts, backed by the Dag Cluster Mapping Service and the unified Flowrida Routing Layer.

The transition to a multi-cluster architecture for Flowrida at Snap represents a significant leap forward in ensuring the platform's reliability, scalability, and operational excellence. By strategically addressing the complexities of distributed workflow orchestration, Snap is building a more resilient and efficient data infrastructure for the future.