Thursday, 21 September 2017

Data Lake Explained

Why Data Lake?

With the advent of Bigdata, Enterprises needed new ways to store, process and analyze data. The existing traditional data warehouses were not capable of handling such huge volume and variety of data. Data lakes became popular because they provided a cost-effective and technologically feasible way to meet these big data challenges.

What is Data Lake?

Data Lake is an architecture that allows collecting, storing and processing, analyzing and consuming all data that flows into an organization. Data lake stores data in its raw (as-is) or native format.

It can also be considered as a repository created from the data coming in from disparate data sources.

Pentaho CEO James Dixon coined the word data lake:

“If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state”

A data lake can be imagined as a huge grid, with billions of rows and columns. Where each cell of the grid may contain different types of data. Thus, a cell can contain a document, another photograph and other cell can contain a paragraph or a single word of a text. No matter where the data came from, it will just be stored in a cell.

How is a data lake different from a data warehouse?

Data lake and data warehouse both are very different as both are optimized for different purposes. There is a significant difference in the way the data is stored, processed and analyzed in each of them.

(Note: DL – Data lake, DW – Data warehouse)

Data: DW stores orange data which is highly structured, processed and transformed. DL stores data in raw format and it can be structured, semi structured or unstructured.

Storage: DW follows a ‘schema on write’ approach where the data structure is decided before ingesting the data. DL follows a ‘schema on read’ (Late binding) approach where any form of data can be ingested and schema is decided while reading the data.

Processing: DW has capability to process structured data whereas DL has capability to process data in any format (Structured, semi structured and unstructured). DW follows ETL (Extract, transform and load), the structures are predefined and data is loaded after transformation into those predefined structures. Top down approach is used.

DL follows ELT (Extract, load and transform), the data is loaded first and then transformed based on need. Bottom up approach is used.

Consumption: Since DW has a predefined structure only pre-determined questions can be answered and data is aggregated so visibility into the lowest levels is lost. DL offers endless possibility of querying data in different ways a data is stored at lowest granularity.

Cost: Cost of building a DL is low as it uses open source technologies like Hadoop, NoSQL etc.

Flexibility: DL can adapt to change easily as data is stored in raw format and its ‘schema on read’. Changes in DW would require considerable time and effort.

Data Lake Architecture

Data Lake Zones

Coutsey: DZone

Transient zone – this layer is usually the landing layer where data is pulled or pushed from different source systems. It’s a temporary zone where data is kept before its inserted into data lake.
Raw Zone – This layer stores data ingested from multiple sources in raw (native or as-is) format.
Trusted Zone – This layer stores data after applying data quality rules on the raw data.
Refined Zone – This layer stores Manipulated and enriched data.

Benefits to Enterprise

Data Lake acts as a central storage of data and assists enterprises to get meaningful business insights.

It eliminates the need to query multiple systems existing in silos to get relevant information.

Maintenance, monitoring and security become comparatively easy as data is managed at a central location.

All kinds of data (structured, semi structured or unstructured) can be stored in the data lake.

Its ‘schema on read’ hence allows any kinds of queries to be designed on raw data.

Conclusion

Data lakes serve as a single repository to store vast amount of data in its native format. Data warehouse in conjunction with data lakes can be used to deliver accelerating returns. Data warehouse and Data Lake can complement each other.

Friday, 15 September 2017

Planning a cloud based Hadoop Environment

The first step in starting a project is to plan the Hadoop cluster. The Hadoop cluster needs to be planned carefully based on the current and anticipated future needs. There are many aspects in designing the cluster:

Hadoop distribution selection

Hardware Selection

Selection of storage

Determine data volume and number of machines

1. Hadoop distribution Selection

There are multiple vendors in this space; the most prominent ones are Cloudera and Hortonworks.

Below are the features based on which we can pick one of them:

Cost of license

Professional Services

Training

Integration with end user tools

Interoperability with other systems

Security and data protection

Customer preference

2. Hardware Selection

Important parameters to consider are - Type of workload, Disk space, I/O bandwidth, computational power, Memory etc.

Type of workload

Type of workload gives us an idea on what type of machine should we use. Following are the types of workloads that can be looked at:

Courtsey: Cloudera

Balanced Workload

Workloads that are distributed equally across the various job types (CPU bound, Disk I/O bound, or Network I/O bound). This is a good default configuration for unknown or evolving workloads.