GLOSSARY

Data Lake

A data lake stores raw structured and unstructured data on object storage with open formats — Parquet, Iceberg, Delta — for flexible downstream use.

Last updated:

Quick answer
A data lake is a storage layer that holds raw, unstructured, or semi-structured data in open file formats (Parquet, Avro, JSON, ORC) on cheap cloud object storage — typically Amazon S3, Azure Data Lake Storage, or Google Cloud Storage. Query engines sit on top of the files; the lake is not natively queryable without tooling like Trino, Spark, or a lakehouse query engine.

WHAT IT IS

Data lakes typically run on object storage (Amazon S3, Azure Data Lake Storage, Google Cloud Storage) with open file formats (Parquet, ORC, Avro) and a table format layer (Apache Iceberg, Delta Lake, Hudi) that adds ACID transactions, schema evolution, and time travel. The combination of lake storage plus table format is commonly called a lakehouse.

HOW IT WORKS

Where a data warehouse enforces schema on write, a lake defers schema to read. That flexibility is valuable for ML training data, event streams, clickstream logs, IoT, and audio/video — but without cataloging and lineage a lake can become a 'data swamp' nobody trusts.

WHEN TO USE

Adopt a data lake or lakehouse when workloads are large, semi-structured, or ML-driven; when storage cost in a warehouse is prohibitive; or when open formats and vendor independence are architectural priorities.

RELATED

SOURCES

Related questions.

What is a data lake?
A data lake is a storage layer that holds raw, unstructured, or semi-structured data in open file formats (Parquet, Avro, JSON, ORC) on cheap cloud object storage — typically Amazon S3, Azure Data Lake Storage, or Google Cloud Storage. Query engines sit on top of the files; they are not natively queryable.
How is a data lake different from a data warehouse?
A warehouse stores structured, modeled data optimized for fast analytical queries. A lake stores raw files cheaply, including data you do not yet know how to use. Warehouses are tuned for BI and analytics; lakes are tuned for data science, ML training, and long-term retention.
What is a lakehouse?
A lakehouse combines the economics of a lake with the query performance and governance of a warehouse, usually via open table formats like Apache Iceberg, Delta Lake, or Apache Hudi. Databricks, Snowflake, and BigQuery have all converged on lakehouse patterns.
When should a company add a data lake?
When raw event data volumes exceed what a warehouse can store economically, when machine-learning teams need access to raw source files, or when long-term retention and compliance require immutable storage. Many mid-market companies do not need a separate lake — a warehouse plus cloud storage is sufficient.
How does NUUN Digital use data lakes?
We treat lakes as the cheap, durable layer and push query workloads into warehouse or lakehouse engines on top. We avoid the trap of landing data in a lake without a path to activation — a data lake that no one queries is a data swamp.

Need this term in action?