Data Lake Explained
Why Data Lake?
With the
advent of Bigdata, Enterprises needed new ways to store, process and analyze
data. The existing traditional data warehouses were not capable of handling
such huge volume and variety of data. Data lakes became popular because they
provided a cost-effective and technologically feasible way to meet these big
data challenges.
What is Data Lake?
Data Lake is
an architecture that allows collecting, storing and processing, analyzing and
consuming all data that flows into an organization. Data lake stores data in
its raw (as-is) or native format.
It can also
be considered as a repository created from the data coming in from disparate
data sources.
Pentaho CEO
James Dixon coined the word data lake:
“If you
think of a Data Mart as a store of bottled water – cleansed and packaged and
structured for easy consumption – the data lake is a large body of water in a
more natural state”
A data lake
can be imagined as a huge grid, with billions of rows and columns. Where each
cell of the grid may contain different types of data. Thus, a cell can contain
a document, another photograph and other cell can contain a paragraph or a
single word of a text. No matter where the data came from, it will just be
stored in a cell.
How is a data lake different from a data warehouse?
Data lake
and data warehouse both are very different as both are optimized for different
purposes. There is a significant difference in the way the data is stored,
processed and analyzed in each of them.
(Note: DL –
Data lake, DW – Data warehouse)
Data: DW stores orange data which is highly
structured, processed and transformed. DL stores data in raw format and it can
be structured, semi structured or unstructured.
Storage: DW follows a ‘schema on
write’ approach where the data structure is decided before ingesting the data.
DL follows a ‘schema on read’ (Late binding) approach where any form of data
can be ingested and schema is decided while reading the data.
Processing: DW has capability to
process structured data whereas DL has capability to process data in any format
(Structured, semi structured and unstructured). DW follows ETL (Extract,
transform and load), the structures are predefined and data is loaded after transformation
into those predefined structures. Top down approach is used.
DL follows
ELT (Extract, load and transform), the data is loaded first and then transformed
based on need. Bottom up approach is used.
Consumption: Since DW has a
predefined structure only pre-determined questions can be answered and data is aggregated
so visibility into the lowest levels is lost. DL offers endless possibility
of querying data in different ways a data is stored at lowest
granularity.
Cost: Cost of building a
DL is low as it uses open source technologies like Hadoop, NoSQL etc.
Flexibility: DL
can adapt to change easily as data is stored in raw format and its ‘schema on
read’. Changes in DW would require considerable time and effort.
Data Lake Architecture
Data Lake Zones
Coutsey: DZone
- Transient zone – this layer is usually the landing layer where data is pulled or pushed from different source systems. It’s a temporary zone where data is kept before its inserted into data lake.
- Raw Zone – This layer stores data ingested from multiple sources in raw (native or as-is) format.
- Trusted Zone – This layer stores data after applying data quality rules on the raw data.
- Refined Zone – This layer stores Manipulated and enriched data.
Benefits to Enterprise
Data Lake
acts as a central storage of data and assists enterprises to get meaningful
business insights.
It eliminates
the need to query multiple systems existing in silos to get relevant information.
Maintenance,
monitoring and security become comparatively easy as data is managed at a
central location.
All kinds of
data (structured, semi structured or unstructured) can be stored in the data
lake.
Its ‘schema
on read’ hence allows any kinds of queries to be designed on raw data.
Conclusion
Data lakes serve as a single repository to store vast amount
of data in its native format. Data warehouse in conjunction with data lakes can
be used to deliver accelerating returns. Data warehouse and Data Lake can complement
each other.