Data Lake vs Warehouse — Why 40% of Queries Fail
Schema-on-read without metadata catalogs caused $2M in inconsistent query results.
- Data Lake stores raw data in native format (schema-on-read), Data Warehouse transforms and stores structured data (schema-on-write)
- Lake uses cheap object storage (S3, ADLS) with compute engines like Spark; Warehouse uses columnar MPP engines with built-in indexing
- Lake queries are slower without careful partitioning; Warehouse queries are fast but cost more per TB stored
- Lake excels for ML and exploratory analysis; Warehouse excels for BI and operational reporting
- Biggest mistake: assuming one replaces the other — they solve different problems
Imagine your company's data is like a city's water supply. A data warehouse is like a bottled water factory — every drop is filtered, labelled, and sorted before it hits the shelf. A data lake is like a giant reservoir — everything flows in raw: rainwater, river water, runoff. The factory gives you perfectly safe, instant-drink water. The reservoir gives you everything, but you need equipment to make it drinkable. Neither is better — it depends on whether you know exactly what you need today, or whether you're still figuring that out.
Every fast-growing company hits the same wall: their relational database can't keep up with the volume, variety, and velocity of data they're generating. Clickstreams, IoT sensors, application logs, third-party API feeds — data arrives in dozens of shapes and speeds simultaneously. Choosing the wrong storage architecture at this point costs millions in re-platforming and months of engineer time. This isn't a theoretical problem; it's the exact inflection point where companies like Airbnb, Netflix, and Uber had to make hard architectural calls.
Data warehouses and data lakes were invented to solve different sides of this problem. A warehouse optimises for answering known questions reliably and fast — 'What were Q3 revenue figures by region?' A lake optimises for storing everything first and deciding what questions to ask later — essential when your data scientists don't yet know what signals predict churn, or when regulations require you to retain raw event logs for seven years. The confusion arises because modern tooling (Databricks, Snowflake, BigQuery) has blurred the lines, making it feel like you have to pick one. You usually don't — but you have to understand both deeply before you can compose them intelligently.
By the end of this article you'll be able to explain the internal mechanics of both architectures, articulate exactly why schema-on-read vs schema-on-write is the central design decision, debug the most expensive production mistakes teams make, and design a composable lakehouse architecture that gives you the best of both. You'll walk away with a framework you can actually use in a system design interview or a Monday morning architecture meeting.
The Schema-on-Write vs Schema-on-Read Decision
The single most important architectural difference between a data warehouse and a data lake is when the schema is applied. In a warehouse, data must conform to a predefined schema before it's loaded — this is schema-on-write. Every row is validated, transformed, and stored in structured tables (often star or snowflake schemas). In a data lake, data is ingested in its raw format (JSON, CSV, Parquet, Avro, images, etc.) without transformation — schema is applied only at query time, hence schema-on-read.
This difference has profound implications. Schema-on-write ensures data quality, consistency, and fast query performance at the cost of upfront ETL effort and inflexibility when new data sources arrive. Schema-on-read gives you agility — you can land any data shape immediately — but shifts the transformation burden to every query, often causing slower reads and requiring heavy compute resources for large datasets.
Production reality: teams that pick a warehouse for uncertain, exploratory workloads spend months building pipelines that break as soon as the business asks a new question. Teams that pick a lake for operational reporting end up with data scientists frustrated by 30-minute queries.
- Warehouse = factory: water filtered, labeled, bottled before storage. Drinking is instant but preparation is slow.
- Lake = reservoir: all water stored raw. You need a treatment plant (compute) each time you drink.
- Factory is perfect when you know what you'll need daily; reservoir is better when you don't know future water uses.
- Combining both (lakehouse) gives you the reservoir but also a small on-site bottling line for frequent needs.
Storage Internals: Columnar vs Object Storage
Data warehouses use columnar storage internally (e.g., Vertica, Redshift, BigQuery, Snowflake). Columnar storage compresses data by column (same data type) achieving 10x compression vs row-oriented storage. It also enables vectorized execution: only the columns referenced in a query are read, drastically reducing I/O. Indexes, materialized views, and pre-aggregates are typical.
Data lakes run on object storage (S3, ADLS, GCS). Files are stored in partitions (e.g., year/month/day). Common formats: Parquet (columnar), Avro (row-oriented), ORC. Parquet is essential for performance because it's columnar and supports predicate pushdown. Without careful partitioning, a simple query can scan terabytes of data.
Production insight: Warehouse storage costs are ~$20-30/TB/month (compressed). Lake storage is ~$2-3/TB/month (uncompressed raw). But warehouse compute costs are lower per query because of aggressive optimization. Always model total cost: storage + compute over a year.
Query Performance and Concurrency: MPP vs Spark
Warehouse engines (Redshift, Snowflake, BigQuery, Azure Synapse) are massively parallel processing (MPP) databases. They distribute data and compute across hundreds of nodes, optimize query plans globally, and maintain result caches. Queries on curated data return in seconds or sub-seconds, even on terabytes. Concurrency is handled through virtual warehouses or clustering.
Data lakes rely on compute engines like Spark, Presto/Trino, Dremio, or Athena. Spark performs well for ETL and complex transformations (joins, aggregations) but cold query start (JVM initialization) adds seconds. Presto/Trino is faster for ad-hoc queries but still depends on data layout. Without data compaction, too many small files kill performance (each file open incurs overhead).
Production gotcha: warehouse query performance degrades under high concurrency if concurrency scaling isn't properly configured. Lake query performance degrades if data is not optimal — small files, no partitioning, column misalignment.
Production Gotchas: When Data Lakes Become Swamps
The term 'data swamp' describes a data lake with poor governance, inconsistent schema, missing metadata, and lack of data lineage. Common symptoms: 'Which table has the customer data?' — 3 answers from 3 teams. Data scientists spend 80% of their time finding and cleaning data instead of analyzing.
Root causes: no schema enforcement on ingestion, no data catalog (AWS Glue, Apache Atlas), no retention policies, no data quality checks, and siloed team ownership. A raw data dump without metadata is useless.
Fixes: implement a Medallion architecture (Bronze for raw, Silver for cleaned, Gold for aggregated). Use table formats like Delta Lake, Apache Iceberg, or Apache Hudi that provide ACID transactions and schema evolution. Integrate with a metadata catalog. Enforce data quality rules (e.g., Great Expectations) on the Silver layer.
- Bronze layer = raw capture (dirty water, but original). Always keep it untouched for reprocessing.
- Silver layer = cleaned, deduplicated, validated (filtered water). Ready for analysis.
- Gold layer = aggregated, business-specific (bottled water). Used by dashboards and ML models.
- Without layering, you can't distinguish between original and transformed data. That's the swamp.
The Lakehouse Architecture: Merging Both Worlds
The lakehouse is an architectural pattern that combines the flexibility and low-cost storage of a data lake with the ACID transactions, schema enforcement, and query performance of a data warehouse. Modern implementations use table formats (Delta Lake, Apache Iceberg, Apache Hudi) on top of object storage, with compute engines that understand these formats (Spark, Presto, Dremio, Snowflake's iceberg support, Databricks SQL).
Key features: ACID transactions (concurrent reads/writes), schema evolution, time travel (query historical versions), and unified batch/streaming. The lakehouse eliminates the need for separate ETL to move data between lake and warehouse — you can query raw data directly or create curated views.
But it's not magic. The lakehouse still requires careful data engineering: proper partitioning, file compaction, vacuuming old versions, and choosing the right engine for the workload. It's 'the best of both worlds' only if you treat it with the same rigor as a warehouse.
Decision Framework: How to Choose (or Combine)
When designing a data platform, ask these questions in order: 1. Do we know exactly what queries and reports we need? If yes, start with a warehouse (Snowflake, Redshift, BigQuery). If no, start with a lake. 2. Do we need to support ML model training or exploratory data science on raw data? If yes, you need a lake (or lakehouse). 3. Is data latency critical (under 1 minute)? Warehouse with streaming or lakehouse with streaming layer. 4. Do we have a small team (under 5 data engineers)? Warehouse is easier to manage. Lake requires more infrastructure expertise. 5. Will we need both known reporting and unknown exploration? Use a lakehouse: bronze/silver/gold with warehouse-like query layer.
Most mature organizations end up with a hybrid: a warehouse for governed, curated data (BI dashboards, operational reports) and a lake for raw ingestion, data science, and archival storage. The lakehouse pattern is the convergence point.
The Data Swamp That Cost $2M
- Data lakes require governance from day one, not after data grows.
- Always use a metadata catalog (e.g., AWS Glue, Hive Metastore) to track what data exists.
- Convert raw ingestion to a columnar format (Parquet) with careful partitioning.
- Implement data quality checks at the ingestion layer, not just at query time.
Key takeaways
Common mistakes to avoid
5 patternsTreating the data lake as a dumping ground without governance
Using a warehouse for exploratory data science on raw data
Choosing a lake because storage is cheap, ignoring compute costs
Skipping partitioning in the data lake
Assuming one data platform can serve all use cases equally well
Interview Questions on This Topic
Explain schema-on-read vs schema-on-write and when you'd use each.
Frequently Asked Questions
That's Databases in Design. Mark it forged?
5 min read · try the examples if you haven't