Easy 12 min · May 28, 2026

Data and Model Versioning with DVC: Production MLOps

Master DVC for data and model versioning in production ML.

N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Production
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • DVC brings Git-like version control to datasets and ML models, storing pointers in Git and data in remote storage.
  • It enables reproducible ML pipelines by versioning data, code, and model artifacts together.
  • DVC pipelines define stages (e.g., prepare, train, evaluate) with dependencies and outputs, cached and tracked.
  • Experiments can be run and compared using dvc exp run and dvc exp show, tracking metrics and hyperparameters.
  • Remote storage (S3, GCS, SSH) decouples large files from Git, keeping repos lean.
  • DVC integrates with CI/CD for automated model retraining and deployment in production.
✦ Definition~90s read
What is Data and Model Versioning with DVC?

DVC (Data Version Control) is an open-source tool that extends Git to handle large files, datasets, and ML models. It uses a Git-like interface to version data and model artifacts, storing pointers in Git and actual content in remote storage (S3, GCS, SSH, etc.). DVC also provides pipeline functionality to define and execute reproducible ML workflows.

Think of DVC as a librarian for your ML project.
Plain-English First

Think of DVC as a librarian for your ML project. Git tracks your code like a table of contents, but DVC tracks the actual heavy books (datasets and models) by storing a lightweight card (a hash pointer) in Git and the book itself in a warehouse (remote storage). When you need to reproduce an experiment, DVC fetches the exact books from the warehouse, so you never lose track of which version of data produced which result.

Reproducibility is the bottleneck. A model that passed validation on Monday can silently degrade by Friday—not because the code changed, but because someone updated a CSV, overwrote a Parquet file, or swapped a feature column without telling anyone. Data Version Control (DVC) solves this by extending Git's versioning semantics to datasets and model binaries, which Git was never designed to track.

DVC is a pipeline orchestrator, not just a versioning tool. You define stages—data prep, feature engineering, training, evaluation—as a directed acyclic graph (DAG). It caches intermediate outputs and only re-executes stages whose inputs changed. In production ML where a single retraining run spans hours, this caching is the difference between a five-minute sanity check and a full rebuild.

This article skips the tutorials you've already read. We'll walk through real project setup, remote storage configuration, experiment tracking and comparison, and CI/CD integration for automated retraining. We'll also cover production debugging: what to do when a pipeline fails mid-run, how to roll back a model to a known-good checkpoint, and how to avoid cache corruption or remote storage drift.

By the end, you'll have a production-grade mental model for data and model versioning. You'll understand why DVC is the default choice for MLOps reproducibility and how to use it without ceremony.

Why Data and Model Versioning Matter in Production ML

In production machine learning, the model is only half the story. The other half is the data it was trained on, the features it used, and the exact environment it ran in. Without versioning, you cannot reproduce a single experiment, audit a model's lineage, or roll back a bad deployment. The cost of this oversight is staggering: a 2022 Gartner report estimated that 85% of ML projects fail to reach production, with data and model management being the top cited reason. Versioning is not a nice-to-have; it is the foundation of MLOps.

Consider a typical scenario: you train a model on a dataset collected in Q1, deploy it in Q2, and by Q3 the data distribution has shifted. Without versioned data, you cannot tell whether the performance drop is due to a code change, a data drift, or a model decay. Versioning allows you to pin every experiment to a specific dataset snapshot, a specific model artifact, and a specific code commit. This is the reproducibility guarantee that separates ad-hoc science from engineering.

Moreover, regulatory compliance increasingly demands traceability. The EU AI Act, for instance, requires that high-risk AI systems maintain logs of data and model versions for the system's lifetime. In practice, this means you need to be able to answer: which data was used to train this model? Which hyperparameters? Which evaluation metrics? Versioning is the only scalable way to provide these answers.

Versioning also enables collaboration. When multiple data scientists work on the same project, they need to share datasets and models without stepping on each other's toes. Git alone cannot handle large binary files (datasets, model weights). DVC fills this gap by treating data and models as first-class citizens in your version control system, allowing you to track, share, and revert them just like code.

Finally, versioning is critical for continuous training pipelines. When you retrain a model on new data, you need to know exactly which data version triggered the retraining, which model version was produced, and whether the new model is better than the previous one. Without versioning, you are guessing. DVC provides the hooks to automate this, making it a core component of any production ML pipeline.

Reproducibility is not optional
If you cannot reproduce a model's results from scratch, you do not have a model; you have a guess. Versioning is the only way to guarantee reproducibility.
Production Insight
In production, always version your data and models before you even think about deploying. Start with a simple DVC setup; you can always add complexity later. The cost of retrofitting versioning is 10x the cost of doing it from day one.
Key Takeaway
Data and model versioning is the bedrock of MLOps. It enables reproducibility, auditability, collaboration, and continuous training. Without it, production ML is fragile and non-compliant.
DVC Data & Model Versioning Pipeline THECODEFORGE.IO DVC Data & Model Versioning Pipeline From dataset versioning to CI/CD automated retraining DVC Init & Remote Install, init, configure remote storage dvc add Dataset/Model Track large files with .dvc files dvc.yaml Pipeline Define stages, dependencies, and outputs dvc exp Run & Compare Execute experiments, compare metrics CI/CD Integration Automate retraining with DVC in CI ⚠ Forgetting to push .dvc cache to remote Always run dvc push after dvc add to avoid data loss THECODEFORGE.IO
thecodeforge.io
DVC Data & Model Versioning Pipeline
Data Versioning Dvc

DVC Fundamentals: Git for Data

DVC (Data Version Control) is an open-source tool that extends Git to handle large files and directories. It does not replace Git; it works alongside it. The core idea is simple: instead of storing large binary files directly in Git (which would bloat the repository), DVC stores them in a separate remote storage (like S3, GCS, or a local directory) and keeps a small pointer file in Git. This pointer file is a text file that contains the hash of the data file, which acts as a version identifier.

When you run dvc add data.csv, DVC computes the MD5 hash of the file, moves the file to the DVC cache (usually .dvc/cache), and creates a .dvc file (e.g., data.csv.dvc) that contains the hash. This .dvc file is then committed to Git. When you run git checkout on a different branch, you get the .dvc file for that branch, and then dvc checkout pulls the corresponding data file from the cache or remote storage. This is the "Git for data" analogy: Git tracks code changes, DVC tracks data changes.

DVC also supports versioning of directories, models, and even entire datasets. The hash is computed recursively for directories, so any change in any file within the directory results in a new hash. This makes it easy to version entire training datasets or model artifacts. The hash is stored in the .dvc file, which is a YAML file that can be diffed and merged like any other text file.

One of the key advantages of DVC is that it is language-agnostic and works with any ML framework. You can use it with PyTorch, TensorFlow, scikit-learn, or even custom scripts. It also integrates with cloud storage providers, making it easy to share data across teams. The remote storage can be S3, GCS, Azure Blob, SSH, or even a local network drive.

DVC also provides a pipeline feature (dvc run, dvc repro) that allows you to define and execute ML pipelines with automatic caching of intermediate results. This is a powerful feature for reproducibility and efficiency, but the core versioning functionality is what makes DVC indispensable for production ML.

io/thecodeforge/dvc_fundamentals.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# This is a conceptual example; DVC is a CLI tool, not a Python library.
# But we can show the workflow in a script.
import subprocess
import os

# Simulate a project directory
os.makedirs('my_project', exist_ok=True)
os.chdir('my_project')

# Initialize Git and DVC
subprocess.run(['git', 'init'], capture_output=True)
subprocess.run(['dvc', 'init'], capture_output=True)

# Create a sample dataset
with open('data.csv', 'w') as f:
    f.write('feature1,feature2,label\n1.0,2.0,0\n3.0,4.0,1\n')

# Add the dataset to DVC
subprocess.run(['dvc', 'add', 'data.csv'], capture_output=True)

# Commit the .dvc file to Git
subprocess.run(['git', 'add', 'data.csv.dvc', '.gitignore'], capture_output=True)
subprocess.run(['git', 'commit', '-m', 'add dataset v1'], capture_output=True)

# Now, modify the dataset
with open('data.csv', 'a') as f:
    f.write('5.0,6.0,0\n')

# Add the new version
subprocess.run(['dvc', 'add', 'data.csv'], capture_output=True)
subprocess.run(['git', 'add', 'data.csv.dvc'], capture_output=True)
subprocess.run(['git', 'commit', '-m', 'add dataset v2'], capture_output=True)

print('DVC workflow completed. Two versions of data.csv are tracked.')
Output
DVC workflow completed. Two versions of data.csv are tracked.
DVC cache is your local data store
The DVC cache (.dvc/cache) stores all versions of your data locally. This allows you to switch between versions without re-downloading from remote, as long as the data is in cache.
Production Insight
Always configure a remote storage (S3, GCS) for your DVC cache. Local cache is fine for development, but in production, you need a shared remote so that all team members and CI/CD pipelines can access the same data versions.
Key Takeaway
DVC stores large files outside Git, using pointer files with hashes. It works with Git, not against it. The cache and remote storage enable efficient versioning and sharing of data and models.

Setting Up DVC: Installation, Initialization, and Remote Storage

Setting up DVC is straightforward. First, install it via pip: pip install dvc. For cloud storage support, install the appropriate extra, e.g., pip install dvc[s3] for AWS S3, pip install dvc[gs] for Google Cloud Storage, or pip install dvc[azure] for Azure Blob Storage. For local development, the base installation is sufficient.

Once installed, navigate to your project directory and run dvc init. This creates a .dvc directory with the cache and configuration files. It also modifies .gitignore to exclude the cache from Git. You should commit this initialization to Git: git add .dvc .gitignore && git commit -m "initialize DVC".

Next, configure a remote storage. This is where DVC will push and pull data. For example, to add an S3 bucket as a remote: dvc remote add myremote s3://my-bucket/dvc-store. You can also set a default remote: dvc remote default myremote. For local testing, you can use a local directory: dvc remote add myremote /tmp/dvc-remote. This is useful for development but not for production.

After adding the remote, you need to push your data to it: dvc push. This uploads all data files tracked by DVC to the remote. Conversely, dvc pull downloads data from the remote to your local cache. This is how you share data across team members or CI/CD pipelines.

DVC also supports multiple remotes, which is useful for hybrid cloud setups or for separating training data from model artifacts. You can configure credentials via environment variables (e.g., AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS) or via DVC's configuration file (.dvc/config). For production, always use environment variables or a secrets manager, never hardcode credentials.

Finally, verify your setup by running dvc status. This shows which files are tracked, which are cached, and which are out of sync with the remote. A clean status means everything is in order. You are now ready to version data and models.

io/thecodeforge/dvc_setup.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Install DVC with S3 support
pip install dvc[s3]

# Initialize DVC in your project
dvc init

# Add a remote storage (S3 bucket)
dvc remote add myremote s3://my-ml-bucket/dvc-store

# Set default remote
dvc remote default myremote

# Push existing data to remote
dvc push

# Verify setup
dvc status
Output
Data and pipelines are up to date.
Use a dedicated remote for each project
Avoid sharing a single remote across multiple projects. Use separate paths or buckets to prevent accidental overwrites and to simplify access control.
Production Insight
In production, always use a remote that supports versioning (e.g., S3 with versioning enabled). This provides an additional safety net against accidental data loss. Also, configure lifecycle policies to clean up old versions if needed.
Key Takeaway
DVC setup is quick: install, init, add remote, push. Use cloud storage for production, local for development. Always verify with dvc status.

Versioning Datasets and Models with `dvc add` and `dvc checkout`

The core workflow for versioning data and models with DVC revolves around two commands: dvc add and dvc checkout. dvc add registers a file or directory with DVC, computes its hash, moves it to the cache, and creates a .dvc file. This .dvc file is a lightweight pointer that you commit to Git. dvc checkout restores the file from the cache based on the hash in the .dvc file.

To version a dataset, simply run dvc add data/train.csv. This creates data/train.csv.dvc. You then commit this .dvc file to Git. When you want to switch to a different version of the dataset, you checkout the corresponding Git commit (which contains the old .dvc file) and then run dvc checkout. DVC will pull the correct version from the cache or remote.

For models, the process is identical. After training, you can run dvc add models/model.pkl to version the model artifact. This is especially useful for tracking which model was deployed in production. You can also version entire directories, such as data/ or models/, with a single dvc add command.

One common pitfall is forgetting to run dvc checkout after switching branches. This leaves your working directory with stale data. To avoid this, you can use dvc checkout as a post-checkout hook, or simply make it a habit to run dvc checkout after every git checkout. DVC also provides dvc update for updating tracked files from a remote, which is useful when someone else has pushed new data.

DVC also supports dvc diff to compare two versions of a tracked file. This shows the hash difference and, for text files, the actual diff. This is invaluable for auditing what changed between model versions.

For large datasets, dvc add can be slow because it computes hashes. You can speed this up by using dvc cache with a fast filesystem (e.g., SSD) or by using dvc add --no-commit to skip the cache move, then manually running dvc commit later. However, for production, always use the default behavior to ensure consistency.

io/thecodeforge/dvc_versioning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Version a dataset
dvc add data/train.csv

# Commit the pointer file to Git
git add data/train.csv.dvc .gitignore
git commit -m "add training dataset v1"

# Later, switch to a different version
git checkout <commit-hash>
dvc checkout

# Version a model after training
dvc add models/model.pkl
git add models/model.pkl.dvc
git commit -m "add model v1"

# Compare two versions
dvc diff data/train.csv.dvc
Output
diff --git a/data/train.csv.dvc b/data/train.csv.dvc
index abc123..def456 100644
--- a/data/train.csv.dvc
+++ b/data/train.csv.dvc
@@ -1,5 +1,5 @@
md5: 1234567890abcdef
-md5: 0987654321fedcba
+md5: 1234567890abcdef
Always run `dvc checkout` after `git checkout`
Git only tracks the pointer file, not the actual data. If you forget dvc checkout, your working directory will have stale or missing data, leading to silent errors.
Production Insight
Automate dvc checkout in your CI/CD pipeline. For example, in a GitHub Actions workflow, add a step to run dvc pull (which does checkout from remote) after checking out the code. This ensures that every pipeline run uses the correct data and model versions.
Key Takeaway
Use dvc add to version datasets and models, creating pointer files for Git. Use dvc checkout to restore specific versions. Always run checkout after switching branches. Automate this in CI/CD.

Building Reproducible ML Pipelines with `dvc.yaml` and `dvc repro`

DVC pipelines are the foundation of reproducible ML workflows. A dvc.yaml file defines stages as directed acyclic graphs (DAGs) where each stage has dependencies (code, data, config) and outputs (metrics, models, plots). Unlike Makefiles or shell scripts, DVC tracks every dependency hash, so a change in any input triggers automatic invalidation of downstream stages. The dvc repro command then executes only the affected stages, caching unchanged results. This is critical when your pipeline has 50+ stages and training takes hours. For example, a stage like train might depend on src/train.py, data/processed.dvc, and params.yaml, and output models/model.pkl and metrics/accuracy.json. If you tweak a hyperparameter in params.yaml, dvc repro will re-run only the training stage, not data preprocessing or feature engineering. This granular caching reduces iteration time from hours to minutes. Under the hood, DVC uses content-addressable storage (CAS) for all artifacts, ensuring that the same input always produces the same output—assuming deterministic code. For non-deterministic steps (e.g., GPU ops with floating-point variance), pin library versions and seed random generators. The pipeline DAG is stored in dvc.lock, which acts as a manifest of all input hashes and output hashes. Commit both dvc.yaml and dvc.lock to Git to freeze the exact state of every run. This is the foundation of reproducibility: anyone can git clone your repo, run dvc pull to fetch data, and dvc repro to reconstruct the exact same model.

io/thecodeforge/dvc_pipeline_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# dvc.yaml (YAML, but shown as Python dict for clarity)
# stages:
#   prepare:
#     cmd: python src/prepare.py
#     deps:
#       - data/raw.csv
#       - src/prepare.py
#     outs:
#       - data/processed.csv
#   train:
#     cmd: python src/train.py --lr 0.001
#     deps:
#       - data/processed.csv
#       - src/train.py
#     params:
#       - lr
#     outs:
#       - models/model.pkl
#     metrics:
#       - metrics/accuracy.json:
#           cache: false

# Run: dvc repro train
# Output:
# Stage 'train' is cached, skipping.
# If params.yaml changes lr to 0.01, dvc repro will re-run train.
Output
Stage 'train' is cached, skipping.
To force re-run: dvc repro --force train
Pipeline DAG vs. Script Order
DVC automatically resolves the execution order from the DAG. You don't need to specify stage order manually—just define dependencies and outputs.
Production Insight
Always use dvc.lock in your CI pipeline to verify that no unintended changes occurred. If a data scientist modifies a script without updating dvc.lock, the CI should fail. This prevents silent drift.
Key Takeaway
Dvc.yaml + dvc.lock = executable, versioned pipeline definition. dvc repro ensures only changed stages run, saving hours of compute. Commit both files to Git for full reproducibility.

Running and Comparing Experiments with `dvc exp`

DVC experiments (dvc exp) provide a lightweight way to run and compare multiple model iterations without polluting your Git history with dozens of branches. Each experiment is a snapshot of the workspace: code, data, hyperparameters, and metrics. You can run an experiment with dvc exp run --set-param train.lr=0.01, which creates a new experiment based on the current dvc.yaml but with overridden parameters. The results are stored in .dvc/experiments/ and can be listed with dvc exp show. This command displays a table of metrics (e.g., accuracy, loss) across experiments, making it easy to spot the best performer. You can also apply Git-like operations: dvc exp diff shows changes in metrics and params between experiments, and dvc exp apply <exp_name> promotes an experiment to the workspace. For large-scale hyperparameter sweeps, DVC integrates with tools like Optuna or Hyperopt via the dvc exp run --queue feature, which queues multiple runs and executes them in parallel on a single machine or distributed cluster. The key advantage over manual tracking is that every experiment is fully reproducible: you can dvc exp branch <exp_name> to create a Git branch from an experiment, then git checkout that branch and dvc repro to rebuild the exact model. This is invaluable when you need to revisit a model from three months ago for an audit or to debug a production issue. DVC experiments also support live metrics streaming via dvc exp show --no-cache, which updates the table in real-time as training progresses. This is especially useful for long-running training jobs where you want to monitor convergence without waiting for completion.

io/thecodeforge/dvc_exp_example.shPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Run an experiment with a different learning rate
dvc exp run --set-param train.lr=0.01 --name exp_lr_001

# Run another with different batch size
dvc exp run --set-param train.batch_size=64 --name exp_bs_64

# Compare all experiments
dvc exp show
# Output:
# ─────────────────────────────────────────────────────────
# Experiment                accuracy   loss   train.lr
# ─────────────────────────────────────────────────────────
# workspace                 0.92       0.35   0.001
# exp_lr_001                0.94       0.28   0.01
# exp_bs_64                 0.91       0.38   0.001
# ─────────────────────────────────────────────────────────

# Apply the best experiment
dvc exp apply exp_lr_001
Output
Experiment 'exp_lr_001' has been applied to the workspace.
Experiment Naming Convention
Use descriptive names like lr_0.01_bs_64 instead of auto-generated hashes. This makes dvc exp show readable and helps during model selection.
Production Insight
In production, never apply an experiment directly to the main branch. Always create a Git branch from the experiment, run a full CI pipeline, and then merge. This prevents accidental overwrites of the production model.
Key Takeaway
Dvc exp run creates isolated, reproducible experiments. Use dvc exp show to compare metrics across runs. Promote the best experiment with dvc exp apply. This replaces messy Git branching for hyperparameter tuning.

Integrating DVC into CI/CD for Automated Retraining

Automated retraining is a core MLOps practice, and DVC fits naturally into CI/CD pipelines. The typical flow: a CI trigger (e.g., new data pushed to S3, or a scheduled cron job) runs a pipeline that pulls the latest data with dvc pull, executes dvc repro to retrain the model, and pushes the updated model and metrics back to the DVC remote. This ensures that the production model is always trained on the freshest data. For example, in GitHub Actions, you can define a workflow that runs daily: dvc pull from a cloud remote (S3, GCS, or Azure Blob), dvc repro to retrain, and dvc push to upload the new model. The key is to use dvc.lock as the source of truth: if the pipeline fails, the lock file remains unchanged, and the old model stays in production. You can also integrate model validation: after dvc repro, run a separate stage that checks if the new model's accuracy exceeds a threshold (e.g., 0.95). If not, the CI fails, preventing a degraded model from being deployed. DVC's dvc metrics diff command can compare the new metrics against the previous commit's metrics, enabling automated rollback if performance drops. For multi-branch workflows, you can use DVC experiments in CI: run dvc exp run on a feature branch, compare metrics with the main branch using dvc metrics diff, and only merge if the new model is better. This is especially powerful when combined with model registries like MLflow or DVC's own dvc models (experimental). The entire pipeline is versioned: the data, code, and model are all tracked, so you can always trace a production model back to the exact training run.

io/thecodeforge/ci_cd_dvc.ymlPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# .github/workflows/retrain.yml
name: Daily Retrain
on:
  schedule:
    - cron: '0 6 * * *'  # daily at 6 AM
  workflow_dispatch:

jobs:
  retrain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install DVC
        run: pip install dvc[s3]
      - name: Pull latest data
        run: dvc pull data/raw.dvc
      - name: Reproduce pipeline
        run: dvc repro
      - name: Validate model
        run: |
          ACC=$(dvc metrics show --show-json | jq -r '.accuracy')
          if (( $(echo "$ACC < 0.95" | bc -l) )); then
            echo "Model accuracy $ACC below threshold 0.95"
            exit 1
          fi
      - name: Push new model
        run: dvc push
Output
Model accuracy 0.96 above threshold 0.95. Pipeline succeeded.
Data Pull Security
Never hardcode cloud credentials in CI. Use secrets (e.g., AWS_ACCESS_KEY_ID) and configure DVC remotes with environment variables.
Production Insight
Add a manual approval step before deploying the new model to production. Even with automated retraining, a human-in-the-loop prevents catastrophic failures from data drift or silent bugs.
Key Takeaway
DVC + CI/CD enables automated retraining on schedule or data triggers. Use dvc metrics diff for validation and dvc push to update the model artifact. Always validate model quality before deployment.

Production Debugging: Common Pitfalls and How to Avoid Them

Even with DVC, production ML systems fail. The most common pitfall is data drift: the distribution of input data changes over time, causing model performance to degrade. DVC's dvc data commands (e.g., dvc data status) can detect if the remote data has changed, but they don't measure drift. You need to combine DVC with monitoring tools like Evidently or WhyLabs. Another pitfall is cache corruption: if the DVC cache directory (.dvc/cache) gets corrupted due to disk errors or concurrent writes, dvc repro may produce incorrect results. Mitigate this by using a remote cache (e.g., S3) and running dvc cache verify periodically. A third issue is environment drift: the Python environment or system libraries change between training and inference. Always use dvc.yaml to pin dependencies (e.g., via requirements.txt or a Conda environment file). DVC's dvc freeze command can lock a stage to prevent accidental re-runs, but it's rarely used—better to rely on dvc.lock. A fourth pitfall is large file handling: DVC's default cache stores files as hardlinks, which can cause issues on network filesystems (NFS). Use cache.type = symlink or cache.type = copy in .dvc/config to avoid hardlink limitations. Finally, the most insidious bug is non-reproducibility due to non-deterministic operations (e.g., GPU ops, random seeds). Always set random.seed(42) and torch.manual_seed(42) in your code, and log the seed as a parameter in params.yaml. DVC cannot fix non-determinism; it can only version the inputs. If you suspect non-reproducibility, run dvc repro twice on the same commit and compare metrics with dvc metrics diff. If they differ, you have a non-deterministic step that needs fixing.

io/thecodeforge/debug_non_deterministic.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Example of non-deterministic training (bad)
import numpy as np
# No seed set -> different results each run

# Fixed version (good)
import numpy as np
np.random.seed(42)

# In params.yaml:
# seed: 42

# In dvc.yaml:
# params:
#   - seed

# Run twice and compare:
# dvc repro && dvc metrics diff
# If metrics differ, check for unseeded randomness.
Output
Path Metric Old New Change
metrics.json accuracy 0.92 0.92 0.00 # Good: deterministic
DVC is a Version Tracker, Not a Debugger
DVC ensures you can reproduce a run, but it doesn't tell you why the model failed. Use DVC to isolate the change, then use traditional debugging tools (logging, profiling) to find the root cause.
Production Insight
Set up a monitoring dashboard that tracks DVC metrics over time. If accuracy drops by more than 5% compared to the previous commit, trigger an alert. This catches data drift and model degradation early.
Key Takeaway
Common DVC pitfalls: data drift, cache corruption, environment drift, non-determinism. Mitigate with monitoring, dvc cache verify, pinned dependencies, and seeded randomness. DVC versions inputs, not outputs—debug the code, not the tool.
● Production incidentPOST-MORTEMseverity: high

The Silent Data Drift: How a Missing DVC Remote Caused a Production Model to Fail

Symptom
Production model accuracy dropped from 92% to 72% over two weeks, but no code changes were made.
Assumption
The team assumed that dvc pull always fetches the latest data from the remote, but they had not set up a remote storage at all—they were using local cache only.
Root cause
The DVC remote was never configured. The team was pulling data from a local cache that was not shared across machines. When a new data scientist joined and ran dvc pull, they got an older version of the dataset from their own local cache, not the latest. The model was retrained on that stale data.
Fix
1) Configured an S3 bucket as the default remote: dvc remote add -d prod s3://ml-bucket/dvc. 2) Ran dvc push to upload the latest data. 3) Updated CI/CD to always run dvc pull from the remote before training. 4) Added a validation step to compare data hash between training and production.
Key lesson
  • Always configure a remote storage for DVC in team projects; local cache is not shared.
  • Validate that dvc pull fetches the expected data by checking the hash in dvc.lock.
  • Automate data version checks in CI/CD to catch drift early.
Production debug guideCommon issues and actions to resolve them quickly.4 entries
Symptom · 01
dvc pull fails with 'cache not found'
Fix
Check remote storage URL and credentials. Run dvc remote list and verify access. Use dvc cache dir to ensure local cache path is correct.
Symptom · 02
Pipeline re-runs all stages unexpectedly
Fix
Check dvc.lock for changed hashes. Run dvc status to see which dependencies changed. Ensure all dependencies are listed in dvc.yaml.
Symptom · 03
dvc repro produces different results on different machines
Fix
Verify that all dependencies (code, data, configs) are versioned. Check for non-deterministic operations (e.g., random seed not set). Use dvc diff to compare pipeline states.
Symptom · 04
Disk space full due to DVC cache
Fix
Run dvc gc -w to remove unused cache files. Consider moving cache to a larger volume or using a cloud cache. Set up automatic cleanup with cron.
★ DVC Quick Debug Cheat SheetImmediate actions for the most common DVC production issues.
Data not found on pull
Immediate action
Check remote config
Commands
dvc remote list
dvc config --list
Fix now
Re-add remote: dvc remote add -d myremote s3://bucket/path
Pipeline not re-running changed stage+
Immediate action
Check dependency hashes
Commands
dvc status
cat dvc.lock | grep md5
Fix now
Force repro: dvc repro --force <stage>
Cache corruption+
Immediate action
Verify cache integrity
Commands
dvc cache dir
dvc fsck
Fix now
Clear cache and re-pull: rm -rf .dvc/cache && dvc pull
DVC vs. Alternatives for Data and Model Versioning
FeatureDVCGit LFSMLflowWeights & Biases
Data versioningYes, with remote storageYes, but Git-centricNo (only model artifacts)No (only model artifacts)
Pipeline orchestrationYes, with cachingNoNoNo
Experiment trackingYes, built-inNoYes, rich UIYes, rich UI
Model registryNo (can integrate)NoYesYes
Remote storage flexibilityS3, GCS, SSH, Azure, etc.Git server onlyS3, GCS, etc. (for artifacts)S3, GCS, etc. (for artifacts)

Key takeaways

1
DVC decouples large data/model files from Git by storing content-addressable pointers in Git and actual data in remote storage.
2
DVC pipelines enable reproducible ML workflows with automatic caching and incremental re-execution based on dependency changes.
3
Experiments with DVC allow you to run, track, and compare multiple training runs with different hyperparameters and data versions.
4
Remote storage configuration is critical for team collaboration and production deployments; always use versioned buckets.
5
Integrating DVC into CI/CD pipelines enables automated model retraining and deployment with full traceability.

Common mistakes to avoid

4 patterns
×

Not configuring remote storage properly

Symptom
Team members cannot pull data; dvc pull fails with authentication errors or missing files.
Fix
Set up remote storage with dvc remote add -d myremote s3://mybucket/dvcstore and ensure credentials are configured via environment variables or IAM roles.
×

Ignoring DVC cache size and cleanup

Symptom
Disk space fills up quickly; dvc gc takes too long or removes needed files.
Fix
Regularly run dvc gc -w to remove unused cache files. Use a separate cache directory on a large volume and set up cron jobs for cleanup.
×

Not versioning pipeline dependencies correctly

Symptom
Pipeline re-runs all stages even when only code changed; or doesn't re-run when data changes.
Fix
Explicitly list all dependencies in dvc.yaml (code files, data files, configs). Use dvc repro to automatically detect changes and re-run only affected stages.
×

Using DVC without Git commits

Symptom
DVC pointer files (.dvc) are not tracked in Git; team members cannot reproduce the exact state.
Fix
Always commit .dvc files and dvc.lock to Git. Use git add and git commit after dvc add or dvc run to ensure traceability.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how DVC ensures reproducibility in ML pipelines.
Q02SENIOR
How would you integrate DVC into a CI/CD pipeline for automated model re...
Q03JUNIOR
What is the difference between `dvc add` and `dvc run`?
Q01 of 03SENIOR

Explain how DVC ensures reproducibility in ML pipelines.

ANSWER
DVC ensures reproducibility by versioning data, code, and model artifacts together. It stores content-addressable hashes of data files in Git (via .dvc files) and the actual data in a cache/remote. Pipelines defined in dvc.yaml specify dependencies and outputs; dvc repro checks hashes of all dependencies and re-runs only stages where inputs changed. This guarantees that given the same Git commit, you can recreate the exact dataset and model.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How does DVC differ from Git LFS?
02
Can I use DVC without a remote storage?
03
How do I roll back to a previous model version with DVC?
04
Does DVC support experiment tracking like MLflow?
N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Verified
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
🔥

That's MLOps. Mark it forged?

12 min read · try the examples if you haven't

Previous
Testing Machine Learning Systems
14 / 14 · MLOps
Next
Introduction to TensorFlow