Missing tzinfo in Feature Stores — 2 Weeks of Silent Skew
Model predictions drifted nightly due to 7-hour tz offset between Spark and Py.
- A feature store is a centralised system that stores, manages, and serves ML features for both training and serving.
- Dual-store architecture: offline store (batch, historical) and online store (low-latency, real-time).
- Point-in-time correctness ensures training data doesn't leak future information.
- Feast (open-source) vs Tecton (managed) differ in serving latency and cost.
- Biggest production mistake: ignoring timezone handling across offline and online pipelines.
- Materialization latency is the single most critical operational metric — monitor it like CPU.
Imagine your school cafeteria prepares chopped vegetables every morning and stores them in labeled containers so every chef can grab exactly what they need without re-chopping the same carrots ten times. A feature store is that prep kitchen for machine learning — it pre-computes the derived facts about your data (like 'how many purchases did this user make in the last 7 days?') and stores them so every model, every team, and every experiment can reuse the same trusted numbers instantly. Without it, every data scientist re-chops the same carrots differently, and your models quietly disagree about what 'last 7 days' even means.
Every ML team eventually hits the same wall. You have ten models in production, each computing 'user average order value' slightly differently — one uses a 30-day window, one uses 28, one forgot to exclude refunds. The numbers diverge silently. A model that aced staging starts misbehaving in production because the training pipeline computed features one way and the serving pipeline computed them another. Nobody notices until revenue drops. Feature stores exist to break this cycle, and by 2026 they're no longer optional infrastructure — they're the foundation of any ML platform serious about reliability at scale.
The core problem feature stores solve is deceptively simple to state but brutally hard to fix without them: the same feature must be computed identically at training time and at serving time, across every team that uses it, forever. This is called training-serving skew, and it silently corrupts model performance more often than bad algorithms do. Alongside skew, you have the duplication problem — ten teams writing ten slightly-different Spark jobs to compute the same customer lifetime value feature — and the discovery problem, where a new data scientist has no idea what signals already exist and reinvents the wheel for six weeks.
By the end of this article you'll understand how a feature store's dual-store architecture works under the hood, why point-in-time correctness is the hardest problem it solves, how to write production-grade feature definitions using Feast, where Tecton and Hopsworks make different architectural trade-offs, and exactly which production mistakes will silently wreck your models even after you've adopted a feature store. This is the article your future self wishes existed the first time you debugged a skew issue at 2am.
What is a Feature Store?
A feature store is a system that separates feature computation from model training and inference. It provides two APIs: one for writing features (typically batch or streaming pipelines) and one for reading features (low-latency for serving, high-throughput for training). The key mental model: you don't ship feature code with your model; you ship a feature reference. The model asks the feature store at runtime for the feature values it needs, and the store guarantees they were computed exactly the same way as during training.
That abstraction sounds clean but introduces a runtime dependency. If the feature store is down during inference, your model returns nulls or errors — silent degradation. Teams often discover this only after a production incident where the Redis cluster goes down and every model starts returning zeros. The lesson: always cache critical feature values with a local fallback for high-throughput paths.
Here's another thing that catches people off guard: the feature store becomes the single source of truth, but it also becomes a single point of failure. You need to design for that. Use circuit breakers in your inference pipeline — if the online store times out after 100ms, fall back to a local cache or a static default. Don't let a Redis outage take down your entire prediction service.
- You register the feature once (definition + computation logic).
- Any model that needs it just asks by name.
- The store handles versioning, deprecation, and consistency.
- But if the registry goes down, nothing can resolve features.
Dual-Store Architecture: Offline and Online Stores
Every production feature store ships two distinct storage engines. The offline store handles large-scale historical data for training — think Parquet files in S3, BigQuery tables, or Delta Lake partitions. It's optimised for bulk reads and point-in-time joins. The online store serves features at low latency for model inference — usually Redis, DynamoDB, or Cassandra. A materialisation pipeline runs periodically (or continuously) to copy feature values from the offline store to the online store, ensuring the online store has the latest values. The magic is that the same feature definition compiles to two different execution plans: one for Spark (batch) and one for a lightweight streaming job (Flink or Kafka Streams). Feast implements this with a Python SDK that generates SQL for offline and uses Redis for online. Tecton adds a managed materialisation orchestrator with built-in skew detection.
But don't assume materialisation is free. Each run reads from the offline source, transforms, and writes to the online store. If your offline store is Parquet files that require a full scan every time, materialisation becomes a costly Spark job. Feast's incremental materialisation helps, but only if your offline source supports row-level timestamps. Without that, you're re-processing the entire dataset every cycle. That's where teams burn budget.
Another trap: choosing the wrong online store for your latency requirements. Redis gives you sub-millisecond reads but has limited throughput under high concurrent access. If you need thousands of features per request, consider DynamoDB with DAX for caching. Tecton uses DynamoDB by default, but you can swap in ElastiCache for Redis. Test with your real feature vector size — at 100 features per entity, Redis pipeline reads can still hit 5ms p99. At 1000 features, that jumps to 20ms.
- Offline = warehouse: carries everything, but slow for point lookups.
- Online = checkout counter: only holds what you need right now, but fast.
- Materialisation = restocking the counter from the warehouse.
- If the restocker (materialisation pipeline) is slow or broken, the counter runs out of stock.
Point-in-Time Correctness: The Hardest Problem Feature Stores Solve
You're training a model to predict if a user will churn tomorrow. You need features computed 'as of' the prediction time — no future information allowed. A naive SQL join will bring in all past purchases, including ones that happened after the prediction timestamp. That's label leakage. Feature stores solve this with point-in-time correctness: when you request training data, you provide a list of entity IDs and timestamps. The feature store's offline store takes each timestamp and, for that entity, returns the feature value that was most recently computed before that timestamp. Feast implements this with a temporal join that uses the feature's timestamp column to window back. The algorithm is essentially: for each (entity, time) row, find the feature value with max timestamp <= that time, and ensure no duplicate rows. This is computationally expensive — it requires shuffling data and handling time-bound windows. Tecton improves performance by pre-chunking feature data into sorted merge trees.
Here's the gotcha many teams miss: point-in-time joins assume your feature table has a timestamp column that accurately reflects when the feature was computed. If your batch pipeline sets the timestamp to the current time instead of the event time, you've broken the correctness guarantee. Always use event time, not pipeline processing time.
Another silent issue: NULL timestamps in the entity DataFrame. Feast silently drops rows with NULL timestamps — no warning, no error. Your training dataset shrinks mysteriously. Always add a validation step before calling get_historical_features to ensure no NULLs in the timestamp column.
Feature Definitions, Transformation, and Serving with Feast
Feast is the most widely adopted open-source feature store. It uses a declarative YAML/Python configuration to define features, sources, and entities. The lifecycle: (1) define feature views in code, (2) apply to the Feast registry (a metadata store), (3) materialise features from offline to online, (4) serve features via a gRPC endpoint or Python SDK. Transformation can be defined as user-defined functions that run during materialisation or as SQL templates. Feast supports on-demand transformations — features computed during inference using raw values — but beware: this recomputes every request and can add millisecond latency. The canonical pattern is: precompute all derived features offline and materialise the raw values, then compute lightweight aggregations on-demand only when necessary.
A common misstep: using on-demand transforms for calculations that could be precomputed. For example, computing a ratio like 'purchases per session' on-the-fly adds 2-3ms per request. At 1000 QPS, that's 2-3 seconds of extra CPU per second — not sustainable. Push that ratio into the feature view and materialise it. Only use on-demand for truly runtime-specific logic like model-specific normalisation.
Also watch out: the default Feast registry uses SQLite, which locks on writes. Under concurrent apply operations, you'll get database is locked errors. Upgrade to PostgreSQL before you hit 10+ concurrent applies. Tecton handles this with a managed backend, but if you're on Feast, this is a hard requirement for any team larger than a handful of data scientists.
Production Gotchas: Skew, Duplication, and Data Quality
Even with a feature store, three silent killers remain. First, training-serving skew: the feature definition in the registry looks identical, but underlying implementations diverge. Example: the offline store uses Spark SQL's DATEDIFF, while the online store uses Python's date arithmetic — rounding differences creep in. Second, feature duplication: two teams define 'user_ltv' with different lookback windows. The feature store registry doesn't prevent this unless you enforce naming conventions in CI. Third, data quality: missing keys, null features, and timestamp misalignment. A null feature passed to the model may be interpreted as 0 or left as NaN, silently biasing predictions. Prevent these with: (1) a feature validation suite run during CI, (2) a skew dashboard that compares offline and online feature distributions daily, (3) a feature ownership matrix that maps each feature to a responsible team.
One more gotcha: features that are constant during training but vary during serving. If a feature had no variance in the training window (e.g., 'is_weekend' for a dataset collected only on weekdays), the model may assign it high importance. In serving, that feature changes every weekend, causing wild prediction swings. Always check for constant features before training.
Another that bites: silently changing feature semantics. Someone updates the definition of 'user_ltv' from 30-day to 60-day lookback, but forgets to re-materialise the online store. The training pipeline picks up the new definition (because it reads from offline), but the serving pipeline still serves the old value (because the online store is stale). You now have backward skew — the model sees newer features during training than during serving. This is actually worse than forward skew because it doesn't trigger obvious alarms. Monitor for changes in the feature registry and alert on materialisation lag after a definition update.
- Check feature variance before training.
- Flag features with variance < threshold.
- Consider excluding or re-engineering such features.
- This is a silent performance killer — no error, just bad predictions.
Feature Registry Governance and CI/CD
A feature store's registry is the single source of truth for what features exist, how they're defined, and who owns them. Without governance, the registry becomes a dumping ground. Feast uses a simple registry file (SQLite or PostgreSQL) and CLI to manage it. Tecton enforces workspace-based isolation and approval flows. For production, you need at least: (1) versioned feature definitions via Git, (2) automated validation tests that run on feature changes (check for dependency cycles, missing timestamps, type mismatches), (3) a staging environment where new feature views are validated before promotion to production. Many teams skip the staging step and apply feature changes directly to prod — then a broken feature view corrupts the registry for all teams.
Another key practice: register a 'feature deprecation' lifecycle. Old features accumulate because no one removes them. Define a deprecation policy: mark a feature as deprecated, then set a TTL, then delete. Automated cleanup jobs can run weekly.
Also think about access control: who can modify the registry? In Feast, there's no built-in RBAC — anyone with write access to the repo can apply. Tecton provides workspace-level permissions. If you're on Feast, consider wrapping the feast apply command in a CI pipeline that enforces reviews. Use a service account for deployments, not individual developer credentials.
The Timezone Betrayal: How a Missing tzinfo Caused Two Weeks of Silent Skew
- Always pin timezone handling in the very first ETL step — never rely on defaults.
- Add a cross-pipeline timestamp consistency check in your monitoring.
- Treat timezone as a critical data quality dimension, not a mundane config detail.
Key takeaways
Common mistakes to avoid
6 patternsMemorising syntax before understanding the concept
Skipping practice and only reading theory
Ignoring timezone handling in feature pipelines
Using on-demand transformations for heavy computations
Applying feature changes directly to production without staging
Not validating entity DataFrame for NULL timestamps
Interview Questions on This Topic
What is a feature store, and what problem does it solve in MLOps?
Frequently Asked Questions
That's MLOps. Mark it forged?
8 min read · try the examples if you haven't