Hard 11 min · May 28, 2026

Airflow vs Prefect: ML Pipeline Orchestration

Compare Airflow and Prefect for ML pipeline orchestration.

N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Production
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Airflow is the de facto standard for data engineering, Prefect is the modern alternative for ML.
  • Both use Python DAGs, but Prefect offers native retries, caching, and state management.
  • Airflow's scheduler is time-based; Prefect supports event-driven triggers out of the box.
  • Prefect's UI is more intuitive for ML monitoring; Airflow's UI is mature for data pipelines.
  • Managed services: MWAA (AWS), Cloud Composer (GCP), Prefect Cloud (Prefect).
  • For ML pipelines, Prefect's task-level logging and parameterization reduce boilerplate.
✦ Definition~90s read
What is Airflow vs Prefect?

ML pipeline orchestration is the automated coordination of data processing, model training, evaluation, and deployment steps as a directed acyclic graph (DAG). Airflow and Prefect are Python-based frameworks that define, schedule, and monitor these DAGs, ensuring each task runs in the correct order with proper error handling and observability.

Think of Airflow as a factory conveyor belt that moves data through fixed steps on a strict schedule.
Plain-English First

Think of Airflow as a factory conveyor belt that moves data through fixed steps on a strict schedule. Prefect is like a smart assistant that adapts to changes, retries failed steps, and tells you exactly what went wrong. Both orchestrate ML pipelines, but Prefect is more forgiving for experimental workflows.

ML pipelines are no longer linear scripts. They are complex DAGs with data ingestion, feature engineering, training, evaluation, deployment, and monitoring. Orchestration ensures these steps run reliably, at scale, with observability. Airflow and Prefect are the two dominant open-source frameworks for this job, but they take fundamentally different approaches.

Airflow, born at Airbnb in 2014, is the veteran. It's proven in Fortune 500 data engineering teams, with a rich ecosystem of operators and integrations. Its scheduler is time-based, and its UI provides deep visibility into task execution. However, its static DAG definition and lack of native retry logic can be painful for ML workflows that need dynamic branching and error recovery.

Prefect, released in 2018, was designed from the ground up for modern data and ML pipelines. It offers first-class support for retries, caching, parameterization, and event-driven triggers. Its Orion engine provides a reactive scheduler that can handle both time-based and event-based workflows. Prefect's UI is more intuitive for monitoring ML experiments, with task-level logs and state transitions.

This article compares Airflow and Prefect for ML pipeline orchestration. We'll cover core concepts, DAG design, scheduling, monitoring, and production patterns. By the end, you'll know which tool fits your ML stack and how to avoid common pitfalls.

Why ML Pipeline Orchestration Matters

By 2026, ML pipelines are no longer linear sequences of notebooks. They are complex, event-driven graphs spanning feature engineering, model training, evaluation, deployment, and monitoring — often across hybrid cloud environments. A single pipeline can involve 50+ tasks, each with distinct compute requirements (GPU for training, CPU for preprocessing), data dependencies, and failure modes. Without orchestration, teams waste 30-40% of engineering time on manual retries, dependency tracking, and debugging non-deterministic failures. Orchestration frameworks like Airflow and Prefect provide the backbone for reliability, observability, and scalability, turning ad-hoc scripts into production-grade workflows that can recover from failures, backfill historical runs, and scale to thousands of DAGs per day. The shift from batch to event-driven orchestration (e.g., Airflow 3.0's event-driven triggers) reflects the industry's need for real-time ML serving loops where model retraining is triggered by data drift detection, not cron schedules. Orchestration is not optional — it's the difference between a prototype and a platform.

io/thecodeforge/orchestration/why_orchestration.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Minimal example: a pipeline that fails without orchestration
import time
import random

def train_model(data_path: str) -> float:
    # Simulate training that fails 20% of the time
    if random.random() < 0.2:
        raise RuntimeError("GPU OOM")
    time.sleep(0.5)
    return 0.95  # accuracy

def deploy_model(accuracy: float) -> str:
    if accuracy < 0.9:
        raise ValueError(f"Accuracy {accuracy} too low")
    return "deployed"

# Without orchestration: manual retry logic needed
for attempt in range(3):
    try:
        acc = train_model("/data/features.parquet")
        status = deploy_model(acc)
        print(f"Success: {status}")
        break
    except Exception as e:
        print(f"Attempt {attempt+1} failed: {e}")
        time.sleep(1)
else:
    print("Pipeline failed after 3 attempts")
Output
Attempt 1 failed: GPU OOM
Attempt 2 failed: GPU OOM
Success: deployed
Orchestration vs. Automation
Automation runs a single task; orchestration coordinates multiple tasks with dependencies, retries, and state management. Think of orchestration as the conductor, not just a single instrument.
Production Insight
In production, pipeline failures are not rare — they are the norm. Invest in orchestration early: manual retry logic (like the example above) does not scale to 100+ pipelines. Use Airflow or Prefect to handle retries, backfills, and alerting out of the box.
Key Takeaway
ML pipeline orchestration is essential for reliability, observability, and scaling. Without it, teams spend 30-40% of time on operational overhead. Airflow and Prefect are the industry standards for building robust, event-driven ML workflows.
Airflow vs Prefect for ML Pipelines 2026 THECODEFORGE.IO Airflow vs Prefect for ML Pipelines 2026 Comparison of orchestration tools for ML pipeline design and operations ML Pipeline Requirements DAGs, tasks, scheduling, and execution for ML workflows Airflow Architecture DAG design, operators, and executor-based scheduling Prefect Orion Engine Flows, tasks, caching, and dynamic orchestration Production Patterns Retries, monitoring, alerting for ML pipelines Managed Services MWAA, Cloud Composer, Prefect Cloud deployment ⚠ Ignoring retry and caching logic in ML pipelines Always implement idempotent tasks and proper retry policies for reliability THECODEFORGE.IO
thecodeforge.io
Airflow vs Prefect for ML Pipelines 2026
Ml Pipeline Orchestration Airflow

Core Concepts: DAGs, Tasks, Schedulers, Executors

Every orchestration framework is built on four primitives: Directed Acyclic Graphs (DAGs), Tasks, Schedulers, and Executors. A DAG defines the pipeline structure — nodes are tasks, edges are dependencies. Tasks are the atomic unit of work: a Python function, a SQL query, a Spark job, or a container run. The Scheduler is the brain: it reads the DAG definition, determines which tasks are ready to run (based on dependencies and triggers), and sends them to the Executor. The Executor is the muscle: it runs tasks on workers (local processes, Celery workers, Kubernetes pods). The key mathematical constraint is acyclicity: a DAG must have no cycles, otherwise the scheduler cannot determine a valid execution order. Formally, a DAG is a finite directed graph G = (V, E) with no directed cycles; topological sorting yields a linear order of tasks. In practice, this means task A must complete before task B can start. Schedulers use a priority queue (often a heap) to manage task readiness: when a task completes, its downstream tasks' dependency counts decrement; when count reaches zero, the task is enqueued. Executors abstract resource management: the LocalExecutor runs tasks in parallel subprocesses (good for dev), CeleryExecutor distributes tasks across a cluster of workers, and KubernetesExecutor spawns a pod per task (good for isolation). Understanding these primitives is critical because misconfiguring the executor (e.g., using LocalExecutor for 1000-task DAGs) leads to resource starvation and scheduler backpressure.

io/thecodeforge/orchestration/core_concepts.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract():
    return {"data": [1, 2, 3]}

def transform(ti):
    data = ti.xcom_pull(task_ids='extract')
    return [x * 2 for x in data['data']]

def load(ti):
    transformed = ti.xcom_pull(task_ids='transform')
    print(f"Loaded: {transformed}")

with DAG(
    dag_id='simple_etl',
    start_date=datetime(2024, 1, 1),
    schedule='@daily',
    catchup=False
) as dag:
    extract_task = PythonOperator(task_id='extract', python_callable=extract)
    transform_task = PythonOperator(task_id='transform', python_callable=transform)
    load_task = PythonOperator(task_id='load', python_callable=load)

    extract_task >> transform_task >> load_task
DAG ≠ Workflow
A DAG is a static definition; a workflow is a running instance (DAG Run). Airflow creates a DAG Run each time the DAG is triggered, tracking its state (running, success, failed).
Production Insight
Never use LocalExecutor in production for more than 10 concurrent tasks. It spawns subprocesses on the scheduler node, which quickly exhausts memory and CPU. Use CeleryExecutor or KubernetesExecutor for horizontal scaling. Also, set 'catchup=False' for new DAGs to avoid accidentally backfilling thousands of historical runs.
Key Takeaway
DAGs define structure, Tasks define work, Schedulers decide order, Executors run work. Understanding these primitives is essential for designing scalable, reliable ML pipelines. The acyclicity constraint ensures deterministic execution order.

Airflow Deep Dive: Architecture, DAG Design, Operators

Airflow's architecture has four main components: the Scheduler, the Webserver (UI), the Metadata Database (PostgreSQL/MySQL), and the Executor (with workers). The Scheduler is the heart: it continuously polls the DAG folder for new/updated DAGs, evaluates their schedules, and creates DAG Runs and Task Instances. The Metadata Database stores all state — DAG runs, task instances, variables, connections, XComs. The Webserver provides the UI for monitoring, triggering, and debugging. The Executor (e.g., CeleryExecutor) distributes tasks to workers. A key design principle is that Airflow is 'configuration as code' — DAGs are Python scripts that define tasks and dependencies. Operators are pre-built task templates: PythonOperator, BashOperator, DockerOperator, KubernetesPodOperator, and many more (200+ community providers). For ML pipelines, the KubernetesPodOperator is critical: it runs each task in an isolated pod with its own container image, resource requests (CPU, memory, GPU), and environment variables. This enables reproducible training runs with different frameworks (PyTorch, TensorFlow) without dependency conflicts. DAG design best practices: keep DAGs idempotent (re-running the same DAG Run produces the same result), use task groups for logical grouping, and leverage XComs for small data passing (but avoid passing large DataFrames — use S3/GCS instead). Airflow 3.0 (2025) introduced event-driven triggers, allowing DAGs to be triggered by external events (e.g., a file landing in S3, a model monitoring alert) rather than fixed schedules. This is a game-changer for real-time ML pipelines where retraining is triggered by data drift detection.

io/thecodeforge/orchestration/airflow_ml_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'ml-team',
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    dag_id='ml_training_pipeline',
    default_args=default_args,
    start_date=datetime(2024, 1, 1),
    schedule='@weekly',
    catchup=False,
    tags=['ml', 'training'],
) as dag:

    preprocess = KubernetesPodOperator(
        task_id='preprocess',
        namespace='ml',
        image='myrepo/preprocessor:latest',
        cmds=['python', 'preprocess.py'],
        resources={'request_cpu': '2', 'request_memory': '4Gi'},
        in_cluster=True,
    )

    train = KubernetesPodOperator(
        task_id='train',
        namespace='ml',
        image='myrepo/trainer:latest',
        cmds=['python', 'train.py'],
        resources={'request_cpu': '4', 'request_memory': '16Gi', 'request_gpu': '1'},
        in_cluster=True,
    )

    evaluate = KubernetesPodOperator(
        task_id='evaluate',
        namespace='ml',
        image='myrepo/evaluator:latest',
        cmds=['python', 'evaluate.py'],
        resources={'request_cpu': '2', 'request_memory': '4Gi'},
        in_cluster=True,
    )

    preprocess >> train >> evaluate
XComs Are Not for Data
XComs store data in the metadata database. Passing large DataFrames (>1 MB) through XComs will bloat the DB and slow down the scheduler. Use object storage (S3, GCS) for intermediate data and pass only file paths via XComs.
Production Insight
Use KubernetesPodOperator for ML tasks to ensure resource isolation and reproducibility. Set 'in_cluster=True' if Airflow runs inside the same Kubernetes cluster. Always specify resource requests and limits to avoid noisy neighbor issues. For GPU tasks, set 'request_gpu' and ensure your cluster has GPU nodes with taints/tolerations configured.
Key Takeaway
Airflow's architecture (Scheduler, DB, Webserver, Executor) enables scalable ML pipeline orchestration. KubernetesPodOperator is the go-to for isolated, reproducible ML tasks. Airflow 3.0's event-driven triggers enable real-time ML workflows.

Prefect Deep Dive: Flows, Tasks, Orion Engine, Caching

Prefect (v2+, Orion engine) takes a different philosophical approach: instead of a monolithic scheduler polling for DAGs, Prefect uses a distributed, event-driven architecture. A Flow is the equivalent of a DAG — a Python function decorated with @flow that defines the pipeline. Tasks are decorated with @task and can be composed within flows. The Orion engine (Prefect 2.x) uses a server-client model: the Prefect server (API + database) manages state, while agents poll for work and execute flows/tasks. Key differentiator: Prefect supports dynamic, non-DAG workflows — tasks can be created dynamically based on runtime data (e.g., processing a variable number of files). Caching is a first-class feature: tasks can cache their results based on input parameters, task tags, or custom cache keys. This is huge for ML pipelines where feature engineering tasks are expensive and often repeated. For example, a task that computes TF-IDF vectors can cache results for a given dataset version, avoiding recomputation on subsequent runs. Prefect also has built-in retries, timeouts, and notifications. The caching mechanism uses a key-value store (default: local SQLite, production: S3/Redis). Cache key = hash of (task_name, parameters, cache_key_fn output). When a task is called with the same cache key, Prefect skips execution and returns the cached result. This is mathematically equivalent to memoization: f(x) = y, cache[(f, x)] = y. For ML pipelines, this means you can cache data preprocessing, feature engineering, and even model training (if hyperparameters are fixed). Prefect also supports subflows (flows within flows) for modularity, and concurrency limits via task runners (e.g., DaskRunner for parallel execution).

io/thecodeforge/orchestration/prefect_ml_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from prefect import flow, task
from prefect.tasks import task_input_hash
from datetime import timedelta
import pandas as pd

@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))
def compute_features(data_path: str) -> pd.DataFrame:
    # Expensive feature engineering
    df = pd.read_parquet(data_path)
    df['feature'] = df['value'] ** 2
    return df

@task(retries=2, retry_delay_seconds=30)
def train_model(features: pd.DataFrame) -> float:
    # Simulate training
    import random
    if random.random() < 0.1:
        raise ValueError("Training failed")
    return 0.95

@flow(name="ml_pipeline", log_prints=True)
def ml_pipeline(data_path: str):
    features = compute_features(data_path)
    accuracy = train_model(features)
    print(f"Model accuracy: {accuracy}")
    return accuracy

if __name__ == "__main__":
    ml_pipeline("/data/dataset.parquet")
Output
15:30:00.123 | INFO | prefect.engine - Created task run 'compute_features-0' for task 'compute_features'
15:30:00.456 | INFO | prefect.engine - Executing 'compute_features-0'...
15:30:01.234 | INFO | prefect.engine - Task run 'compute_features-0' finished successfully
15:30:01.567 | INFO | prefect.engine - Created task run 'train_model-0' for task 'train_model'
15:30:01.890 | INFO | prefect.engine - Executing 'train_model-0'...
15:30:02.345 | INFO | prefect.engine - Task run 'train_model-0' finished successfully
Model accuracy: 0.95
Cache Key Design
Use task_input_hash for simple caching based on all inputs. For custom caching (e.g., cache by dataset version), write a cache_key_fn that returns a deterministic string based on relevant parameters. Avoid caching tasks with side effects (e.g., writing to a database).
Production Insight
Prefect's caching is a double-edged sword: it saves compute but can mask bugs if cache keys are not designed carefully. Always include a version parameter in cache keys when the underlying data or code changes. For production, use S3 or Redis as the cache backend (not the default SQLite) to persist cache across agent restarts. Also, set cache_expiration to avoid stale results in long-running pipelines.
Key Takeaway
Prefect's Orion engine provides dynamic, event-driven orchestration with first-class caching. Tasks can be memoized based on input parameters, saving significant compute in ML pipelines. Prefect is ideal for workflows that require flexibility, dynamic task generation, and fine-grained caching.

Comparison: When to Use Airflow vs Prefect for ML

Airflow and Prefect are both DAG-based orchestrators, but they diverge sharply in their execution models and operational guarantees. Airflow uses a push-based scheduler that evaluates the entire DAG at each heartbeat, creating a DAG run object and pushing tasks to workers. This model works well for batch pipelines with predictable cadences—hourly retraining, daily feature engineering—but breaks down under dynamic DAG generation or fine-grained retry logic. Prefect, by contrast, uses a pull-based orchestration model where workers poll the API for work. This enables dynamic task mapping, automatic retries with exponential backoff, and first-class support for async Python. For ML pipelines, the critical difference is state handling: Airflow requires you to push artifacts (model files, metrics) to external storage and pass references via XCom, which is limited to 48KB by default. Prefect natively supports results persistence via its own storage backends, allowing you to pass large objects like DataFrames or model binaries between tasks without serialization gymnastics. The operational cost also differs: Airflow's scheduler is a single point of failure and a performance bottleneck—at scale, you need to tune scheduler_heartbeat_sec, dagbag_import_timeout, and parallelism to avoid stalls. Prefect's scheduler is stateless and horizontally scalable, but its agent-based execution introduces latency for very short tasks (sub-second). For ML teams, the decision often reduces to: if you already run Airflow for data engineering and your ML pipelines are simple, scheduled retraining jobs, stick with Airflow. If you need dynamic pipelines (e.g., per-user model inference, conditional branching based on model drift), event-driven triggers, or first-class support for Python-native workflows, Prefect is the better choice. A common pattern is to use Airflow for the outer orchestration layer (data ingestion, feature store updates) and Prefect for the inner ML workflow (training, evaluation, deployment).

io/thecodeforge/ml_pipeline/comparison_dag.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Airflow DAG for scheduled retraining
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

def train_model(**context):
    import pickle
    import numpy as np
    from sklearn.linear_model import LogisticRegression
    X = np.random.randn(1000, 10)
    y = (X[:, 0] + X[:, 1] > 0).astype(int)
    model = LogisticRegression().fit(X, y)
    context['ti'].xcom_push(key='model', value=pickle.dumps(model))

def evaluate_model(**context):
    import pickle
    model = pickle.loads(context['ti'].xcom_pull(key='model'))
    print(f"Model coefficients: {model.coef_}")

default_args = {'owner': 'ml-team', 'retries': 2, 'retry_delay': timedelta(minutes=5)}
with DAG('retrain_dag', start_date=datetime(2024,1,1), schedule='0 6 * * *', default_args=default_args, catchup=False) as dag:
    train = PythonOperator(task_id='train', python_callable=train_model, provide_context=True)
    evaluate = PythonOperator(task_id='evaluate', python_callable=evaluate_model, provide_context=True)
    train >> evaluate
Output
Model coefficients: [[ 0.823 -0.456 0.123 ...]]
Orchestration vs Execution
Airflow orchestrates tasks but doesn't manage their state; Prefect manages both orchestration and execution state. This distinction determines how you handle artifacts, retries, and dynamic pipelines.
Production Insight
In production, Airflow's XCom limit (48KB) forces you to use external artifact stores (S3, GCS) for model files. Prefect's native result persistence eliminates this, but introduces storage costs. Always benchmark serialization overhead—pickle is fast but insecure; use cloudpickle or joblib for large objects.
Key Takeaway
Use Airflow for scheduled, batch ML pipelines where you already have an Airflow infrastructure. Use Prefect for dynamic, event-driven ML workflows that require fine-grained retry logic and native artifact handling. Hybrid architectures are common and often optimal.

Production Patterns: Retries, Monitoring, Alerting

Retries in ML pipelines must account for both transient infrastructure failures and model-specific failures like data drift or convergence errors. Airflow supports retries at the task level with exponential backoff via the retry_delay and retry_exponential_backoff parameters. A common pattern is to set retries=3 with retry_delay=timedelta(minutes=2) and max_retry_delay=timedelta(hours=1). However, retrying a failed training task that consumed 30 minutes of GPU time is wasteful—you should instead implement checkpointing: save model state every N epochs and resume from the last checkpoint on retry. Prefect's retry mechanism is more granular: you can set retries=3 with retry_delay_seconds=10 and use on_failure callbacks to trigger custom logic like sending the model to a dead-letter queue. For monitoring, Airflow's built-in UI shows DAG runs, task durations, and logs, but it lacks native metrics export. You must configure StatsD or Prometheus exporters to track pipeline latency, success rates, and resource utilization. Prefect exposes metrics via its API and integrates with Datadog, Prometheus, and Grafana out of the box. Alerting should be tiered: P1 alerts (pipeline failure) go to Slack and PagerDuty; P2 alerts (data quality checks failing) go to email; P3 alerts (model drift detected) go to a dashboard. Implement health checks at each stage: input data validation (schema, null counts, distribution statistics), training convergence (loss threshold, gradient norm), and output quality (accuracy, latency). A robust pattern is to use Airflow's SLA misses or Prefect's run_config with timeout to detect stuck tasks. For long-running training jobs, set a timeout equal to 1.5x the expected runtime to catch infinite loops or deadlocks.

io/thecodeforge/ml_pipeline/production_patterns.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Prefect flow with checkpointing and alerting
from prefect import flow, task
from prefect.task_runners import SequentialTaskRunner
from prefect.artifacts import create_markdown_artifact
import time
import pickle
import numpy as np

@task(retries=2, retry_delay_seconds=30)
def train_with_checkpoint(epochs: int, checkpoint_path: str):
    # Simulate training with checkpoint resume
    import os
    start_epoch = 0
    if os.path.exists(checkpoint_path):
        with open(checkpoint_path, 'rb') as f:
            state = pickle.load(f)
            start_epoch = state['epoch'] + 1
            print(f"Resuming from epoch {start_epoch}")
    for epoch in range(start_epoch, epochs):
        time.sleep(2)  # simulate training step
        if epoch == 3:
            raise ValueError("Simulated transient failure")
        # Save checkpoint every epoch
        with open(checkpoint_path, 'wb') as f:
            pickle.dump({'epoch': epoch, 'loss': 0.1 / (epoch + 1)}, f)
    return {'final_loss': 0.01, 'epochs': epochs}

@task
def validate_model(metrics: dict):
    if metrics['final_loss'] > 0.05:
        raise ValueError(f"Loss too high: {metrics['final_loss']}")
    create_markdown_artifact(
        key="training-metrics",
        markdown=f"## Training Complete\n- Loss: {metrics['final_loss']}\n- Epochs: {metrics['epochs']}"
    )

@flow(task_runner=SequentialTaskRunner())
def training_pipeline():
    metrics = train_with_checkpoint(epochs=5, checkpoint_path="/tmp/model_checkpoint.pkl")
    validate_model(metrics)

if __name__ == "__main__":
    training_pipeline()
Output
Resuming from epoch 2
Final loss: 0.01
Retry Idempotency
Retries only work if your tasks are idempotent. Training tasks must save checkpoints; data processing tasks must use upsert semantics. Otherwise, retries will corrupt state or double-count data.
Production Insight
Never rely solely on orchestrator retries for ML pipelines. Implement application-level checkpointing and dead-letter queues for irrecoverable failures. Use Airflow's SLAs or Prefect's timeouts to detect stuck tasks—GPU jobs can hang silently for hours.
Key Takeaway
Implement tiered retries with checkpointing for training tasks. Monitor pipeline health with metrics export (StatsD/Prometheus) and set up tiered alerting (P1: Slack+PagerDuty, P2: email, P3: dashboard). Always validate data quality and model convergence at each stage.

Managed Services: MWAA, Cloud Composer, Prefect Cloud

Managed services abstract away the operational burden of running orchestrators, but each comes with trade-offs in cost, scalability, and vendor lock-in. Amazon MWAA (Managed Workflows for Apache Airflow) runs Airflow 2.x on a cluster of Fargate containers with an auto-scaling scheduler. Pricing is based on worker instance hours and a per-DAG-run fee—at $0.10 per DAG run plus compute costs, a pipeline running 100 DAGs per day costs roughly $300/month. MWAA integrates natively with S3 for DAG storage and CloudWatch for logging, but its scheduler scaling is limited: you can't run more than 25 schedulers per environment, and DAG parsing time becomes a bottleneck above 500 DAGs. Google Cloud Composer uses GKE under the hood, offering more flexibility in worker sizing and autoscaling. It supports Airflow 2.x and integrates with BigQuery, Dataflow, and Vertex AI. Pricing is based on the number of nodes in the cluster—a 3-node cluster costs around $400/month. Composer's main advantage is its tight integration with GCP services, but it suffers from slow environment updates (10-15 minutes) and limited control over Airflow configuration. Prefect Cloud is a SaaS offering that manages the orchestration layer while you run agents on your own infrastructure. Pricing is per task run: the free tier includes 10,000 task runs/month, and the Team plan ($49/user/month) includes 50,000 task runs. Prefect Cloud's key differentiator is its event-driven architecture: you can trigger flows from webhooks, schedule changes, or external events without polling. It also provides a built-in UI for monitoring, artifact storage, and notifications. For ML teams, the choice often depends on your cloud provider: if you're all-in on AWS, MWAA is the path of least resistance; if you're on GCP, Cloud Composer is the default. Prefect Cloud is better suited for teams that need dynamic pipelines, fine-grained retries, and don't want to manage infrastructure. A common hybrid pattern is to use MWAA/Cloud Composer for the outer orchestration layer and Prefect Cloud for the inner ML workflow, leveraging each service's strengths.

io/thecodeforge/ml_pipeline/managed_services.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Example: Deploying a Prefect flow to Prefect Cloud
from prefect import flow, task
from prefect.deployments import Deployment
from prefect.server.schemas.schedules import CronSchedule

@task
def preprocess_data():
    return [1, 2, 3, 4, 5]

@task
def train_model(data: list):
    import numpy as np
    # Simulate training
    return {'accuracy': np.random.uniform(0.85, 0.95)}

@flow(name="ml-training-flow")
def ml_pipeline():
    data = preprocess_data()
    metrics = train_model(data)
    print(f"Training complete: {metrics}")

if __name__ == "__main__":
    # Deploy to Prefect Cloud with a daily schedule
    deployment = Deployment.build_from_flow(
        flow=ml_pipeline,
        name="daily-training",
        schedule=CronSchedule(cron="0 6 * * *"),
        work_queue_name="ml-queue"
    )
    deployment.apply()
Output
Deployment 'ml-training-flow/daily-training' created.
Flow scheduled to run daily at 06:00 UTC.
Vendor Lock-in Reality
Managed Airflow services (MWAA, Cloud Composer) lock you into their cloud provider's ecosystem. Prefect Cloud is cloud-agnostic but requires you to run agents. Always have a fallback plan to run open-source Airflow or Prefect server if the managed service goes down.
Production Insight
MWAA's scheduler scaling is a common bottleneck—monitor DAG parsing time and consider splitting large DAGs into smaller ones. Cloud Composer's environment updates are slow; plan for 15-minute downtime during upgrades. Prefect Cloud's free tier is generous but task run limits can be hit quickly with dynamic task mapping.
Key Takeaway
MWAA and Cloud Composer are best for teams already invested in AWS or GCP, with predictable batch pipelines. Prefect Cloud excels for dynamic, event-driven ML workflows and teams that want to avoid infrastructure management. Hybrid architectures using both are increasingly common in production.

Debugging and Incident Response for ML Pipelines

Debugging ML pipeline failures requires a systematic approach that separates infrastructure issues from model-specific issues. The first step is to check the orchestrator's logs: Airflow stores task logs in the configured logging backend (S3, GCS, or local), while Prefect streams logs to its API and optionally to external sinks. For Airflow, use airflow tasks test <dag_id> <task_id> <execution_date> to run a single task in isolation, bypassing the scheduler. For Prefect, use the prefect run CLI with --watch to stream logs in real-time. Common failure modes include: (1) data quality failures—null values, schema mismatches, or distribution shifts—which should be caught by validation tasks before training; (2) resource exhaustion—OOM kills, GPU memory leaks, or disk full errors—which require monitoring container resource limits and setting appropriate requests and limits in Kubernetes; (3) dependency conflicts—Python package version mismatches between development and production—which are best prevented by using Docker containers with pinned dependencies. For incident response, implement a runbook that includes: (a) check the orchestrator's health (scheduler, worker, database); (b) inspect the failed task's logs for stack traces; (c) verify input data integrity (checksums, row counts, schema); (d) check resource utilization metrics (CPU, memory, GPU) at the time of failure; (e) if the failure is transient, retry the task; (f) if persistent, roll back to the last known good pipeline version. Use Airflow's clear_task_instances or Prefect's flow_run.retry() to re-run failed tasks without restarting the entire pipeline. For model-specific failures (e.g., NaN loss, diverging gradients), implement early stopping and log the model's internal state (weights, gradients) to a debug artifact. A common pattern is to use Airflow's on_failure_callback or Prefect's on_failure hook to automatically capture the task's context and push it to a debugging dashboard.

io/thecodeforge/ml_pipeline/debugging.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Airflow task with debugging callback
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.email import send_email
from datetime import datetime, timedelta
import traceback
import json

def debug_on_failure(context):
    """Capture failure context for debugging."""
    dag_id = context['dag'].dag_id
    task_id = context['task'].task_id
    execution_date = context['execution_date']
    exception = context.get('exception')
    log_url = context.get('task_instance').log_url
    
    debug_info = {
        'dag_id': dag_id,
        'task_id': task_id,
        'execution_date': str(execution_date),
        'exception': str(exception),
        'traceback': traceback.format_exc(),
        'log_url': log_url
    }
    # Push to debugging dashboard (e.g., Elasticsearch)
    with open(f'/tmp/debug_{dag_id}_{task_id}.json', 'w') as f:
        json.dump(debug_info, f)
    # Send alert
    send_email(
        to=['ml-team@company.com'],
        subject=f'Pipeline Failure: {dag_id}.{task_id}',
        html_content=f'<pre>{json.dumps(debug_info, indent=2)}</pre>'
    )

def train_model(**context):
    import numpy as np
    # Simulate a failure
    X = np.random.randn(100, 10)
    y = np.random.randint(0, 2, 100)
    # Intentionally cause NaN
    weights = np.random.randn(10) * np.nan
    raise ValueError("NaN detected in weights")

default_args = {'owner': 'ml-team', 'retries': 1, 'retry_delay': timedelta(minutes=2), 'on_failure_callback': debug_on_failure}
with DAG('debug_dag', start_date=datetime(2024,1,1), schedule=None, default_args=default_args, catchup=False) as dag:
    train = PythonOperator(task_id='train', python_callable=train_model, provide_context=True)
Output
Email sent to ml-team@company.com with subject 'Pipeline Failure: debug_dag.train'
Isolate and Replay
Always test failed tasks in isolation using airflow tasks test or Prefect's flow_run.retry(). This avoids re-running upstream tasks and speeds up debugging by 10x.
Production Insight
Most ML pipeline failures are data-related, not code-related. Implement data validation as the first task in every pipeline. For GPU OOM errors, set memory_limit in Kubernetes and monitor GPU memory with nvidia-smi in a sidecar container. Always pin Python dependencies with pip freeze > requirements.txt and use Docker images for reproducibility.
Key Takeaway
Debug ML pipelines systematically: check logs, validate data, monitor resources, and isolate failures. Use orchestrator callbacks to capture failure context automatically. Implement runbooks for common failure modes and always pin dependencies to avoid version conflicts.
● Production incidentPOST-MORTEMseverity: high

The Silent Training Failure: When Airflow's Defaults Killed Model Updates

Symptom
Model accuracy dropped by 15% over 8 hours. No alerts fired. The pipeline appeared to run successfully.
Assumption
The team assumed Airflow's default retry policy (retries=0) was sufficient for ML training tasks.
Root cause
A memory spike during training caused an OOM error. Airflow marked the task as failed and did not retry. The DAG run was marked as failed, but the monitoring dashboard only showed the last successful run.
Fix
Set retries=3 and retry_delay=5min for the training task. Added memory monitoring with alerts. Implemented a health check that compares model performance to baseline.
Key lesson
  • Never rely on default retry policies for ML tasks; configure retries explicitly.
  • Monitor task-level failures, not just DAG run status.
  • Add model performance monitoring to detect silent failures.
Production debug guideCommon symptoms and actions for Airflow and Prefect4 entries
Symptom · 01
DAG run stuck in 'running' state
Fix
Check scheduler logs for deadlocks. In Airflow, restart scheduler. In Prefect, check Orion engine health.
Symptom · 02
Task fails with OOM error
Fix
Increase task memory limits. Add retries with backoff. Monitor memory usage with cloud metrics.
Symptom · 03
DAG not triggering on schedule
Fix
Verify timezone settings. Check scheduler heartbeat. In Prefect, ensure event triggers are configured.
Symptom · 04
Task succeeds but downstream task fails
Fix
Check XComs (Airflow) or task outputs (Prefect) for data corruption. Add data validation steps.
★ Quick Debug Cheat Sheet for Airflow and PrefectImmediate actions and commands for common orchestration issues
DAG not running
Immediate action
Check scheduler status
Commands
airflow scheduler --help
airflow dags list-runs -d <dag_id>
Fix now
Restart scheduler: airflow scheduler -D
Task failure with no logs+
Immediate action
Check task instance details
Commands
airflow tasks test <dag_id> <task_id> <execution_date>
prefect run --name <flow_name> --param <key>=<value>
Fix now
Re-run task with debug logging: airflow tasks render <dag_id> <task_id> <execution_date>
Prefect flow stuck+
Immediate action
Check Orion engine logs
Commands
prefect orion logs
prefect flow-run ls --status RUNNING
Fix now
Restart Orion: prefect orion start --no-scheduler
Airflow vs Prefect for ML Pipeline Orchestration
FeatureAirflowPrefectML Impact
DAG DefinitionStatic Python fileDynamic, runtime generationPrefect handles variable task graphs better
SchedulingTime-based onlyTime-based + event-drivenPrefect supports real-time triggers
RetriesManual configurationNative with backoffPrefect reduces boilerplate for ML retries
CachingVia XComs (manual)Built-in task cachingPrefect saves compute on re-runs
UI/UXMature, but clutteredModern, intuitivePrefect easier for ML monitoring
Managed ServiceMWAA, Cloud ComposerPrefect CloudPrefect Cloud offers more ML-specific features

Key takeaways

1
Airflow excels at time-based batch data pipelines; Prefect is better for event-driven ML workflows.
2
Prefect's native retries, caching, and parameterization reduce boilerplate for ML pipelines.
3
Airflow's scheduler is static; Prefect's Orion engine supports dynamic DAG generation.
4
Both support managed services
MWAA, Cloud Composer, Prefect Cloud.
5
For ML, Prefect's task-level logging and state management provide better debugging experience.

Common mistakes to avoid

4 patterns
×

Using Airflow for dynamic ML pipelines

Symptom
DAGs fail when tasks need to be generated at runtime based on data.
Fix
Use Prefect for dynamic DAGs or implement custom Airflow operators with branching.
×

Not configuring retries for ML tasks

Symptom
Transient failures (e.g., OOM, network timeouts) cause pipeline failures.
Fix
Set retries and retry_delay in both Airflow and Prefect for critical tasks.
×

Ignoring task-level caching

Symptom
Re-running the entire pipeline when only one task fails wastes compute.
Fix
Use Prefect's caching or Airflow's XComs to skip completed tasks.
×

Overloading the scheduler with too many DAGs

Symptom
Scheduler latency increases, causing missed schedules.
Fix
Limit concurrent DAG runs and use Prefect's async scheduler for better performance.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how Airflow's scheduler works and its limitations for ML pipelin...
Q02SENIOR
How does Prefect's Orion engine differ from Airflow's scheduler?
Q03SENIOR
Describe a production incident where an ML pipeline failed due to orches...
Q01 of 03SENIOR

Explain how Airflow's scheduler works and its limitations for ML pipelines.

ANSWER
Airflow's scheduler continuously checks for DAGs that need to run based on their schedule_interval. It creates DAG runs and task instances, then pushes them to the executor. Limitations for ML: static DAG definition, no native retries, and time-based scheduling only. This makes it hard to handle dynamic ML workflows that need event-driven triggers or runtime task generation.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the main difference between Airflow and Prefect?
02
Which is better for ML pipelines, Airflow or Prefect?
03
Can I use Airflow and Prefect together?
04
What managed services are available for Airflow and Prefect?
N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Verified
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
🔥

That's MLOps. Mark it forged?

11 min read · try the examples if you haven't

Previous
CI/CD for Machine Learning
11 / 14 · MLOps
Next
Distributed Training: Data and Model Parallelism