Thursday 14 September 2017

Designing BigData Solution (Part 2)

Designing BigData Architecture (Part 2)

In part1 of Designing BigData Architecture we looked at various activities involved in planning BigData architecture. This article covers each of the logical layers in architecting the BigData Solution

Data Source Identification
Source profiling is one of the most important steps in deciding the architecture. It involves identifying the different source systems and categorizing them based on their nature and type.
Data sources can be of different types: Legacy systems (Mainframes, ERP, CRM), operational database, transactional database, Smart devices, Social Media websites, Data providers etc.

Points to be considered while profiling the data sources:

  •  Identify the internal and external sources systems
  • High Level Guesstimate for the amount of data ingested from each source
  • Identify the mechanism used to get data – push or pull
  • Determine the type of data source – Database, File, web service, streams etc
  • Determine the type of data – structured, semi structured or unstructured

Data Ingestion Strategy and Acquisition
Data ingestion in the Hadoop world means ELT (Extract, Load and Transform) as opposed to ETL (Extract Transform Load) in case of traditional warehouses. Different ingestion tools and strategy have to be selected for various types of data.

Points to be considered:

  • Determine the frequency at which data would be ingested from each source
  • Is there a need to change the semantics of the data append replace etc.
  • Is there any data validation or transformation required before ingestion (Pre-processing)
  • Segregate the data sources based on mode of ingestion – Batch or real-time

Batch Data Ingestion
Batch data ingestion may happen at a regular interval (once or twice) a day,   
depending on customer requirement. Usually it is a csv file, text file, JSON file or    
RDBMS.
Technologies used:
Flume : Used to ingest flat files, logs etc.
Sqoop: Used to ingest data from RDBMS’s.

Real time Data ingestion
Real-time means near to zero latency and access to information whenever it is required. Stream is a constantly created data that is fit for real-time ingestion and processing.
Ex: Data coming in from sensors, Social Media etc.

Technology used:
Apache storm – It’s a Distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, for real-time processing.
Apache Kafka - Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable.
Flume - Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.
Spark streaming - Spark Streaming allows reusing the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state. 

Data Storage
Big data storage should be able to store large amount of data of any type and should be able to scale on need basis. We should also consider the number of IOPS (Input output operations per second) that it can provide. Hadoop distributed file system is the most commonly used storage framework in BigData world, others are the NoSql data stores – MongoDB, HBase, Cassandra etc. One of the salient features of Hadoop storage is its capability to scale, self-manage and self-heal.

There are 2 kinds of analytical requirements that storage can support:
Synchronous – Data is analysed in real-time or near real-time, the storage should be optimized for low latency.
Asynchronous – Data is captured, recorded and analysed in batch.          

Hadoop Storage is best suited for Batch and asynchronous analytics requirements, NoSql are better suited for scenarios where real-time analytics and flexible schema are required.
There are multiple storage options available like SAS/SATA solid-state drives (SSDs) PCI Express (PCIe) card-based solid-state etc., PCIe are most popular as their implementation offers the lowest latency.

Things to consider while planning storage methodology:
  • Type of data  (Historical or Incremental)
  • Format of data ( structured, semi structured and unstructured)
  • Compression requirements
  • Frequency of incoming data
  • Query pattern on the data
  • Consumers of the data


Data Processing
With the advent of BigData, the amount of data being stored and processed has increased multifold.
Earlier frequently accessed data was stored in Dynamic RAMs but now due to the sheer volume its being stored on multiple disks on a number of machines connected via network. Instead of bringing the data to processing, processing is taken closer to data which significantly reduce the network I/O.
Processing methodology is driven by business requirements. It can be categorized into Batch , real-time or Hybrid based on the SLA.

Batch Processing – Batch is collecting the input for a specified interval of time and running transformations on it in a scheduled way. Historical data load is a typical batch operation
Technology Used: MapReduce, Hive, Pig

Real-time Processing – Real-time processing involves running transformations as and when data is acquired.
Technology Used: Impala, Spark, spark SQL, Tez, Apache Drill

Hybrid Processing – It’s a combination of both batch and real-time processing needs.
Best example would be lambda architecture.

Data Consumption
This layer consumes the output provided by processing layer. Different users like administrator, Business users, business analyst, customers, vendor, partners etc. can consume data in different format. Output of analysis can be consumed by recommendation engine or business processes can be triggered based on the analysis.
Different forms of data consumption are:
Export Data sets – There can be requirements for third party data set generation. Data sets can be generated using hive export or directly from HDFS.

Reporting and visualization – Different reporting and visualization tool scan connect to Hadoop using JDBC/ODBC connectivity to hive. Tools – Microstrategy, Talend, Pentaho, Jaspersoft, Tableau etc. Adhoc or scheduled reports can be created.

Data Exploration – Data scientist can build models and perform deep exploration in a sandbox environment. Sandbox can be a separate cluster (Recommended approach) or a separate schema within same cluster that contains subset of actual data.

Adhoc Querying – Adhoc or Interactive querying can be supported by using Hive, Impala or spark SQL.


Conclusion
The two biggest challenges in designing BigData Architecture are:
·         Dynamics of use case: There a number of scenarios as illustrated in the article which need to be considered while designing the architecture – form and frequency of data, Type of data, Type of processing and analytics required. 
·         Myriad of technologies: Proliferation of tools in the market has led to a lot of confusion around what to use and when, there are multiple technologies offering similar features and claiming to be better than the others.

To be nimble, flexible and fluid in our approach to a solution we need to use these technologies in a way that will adapt to the changing business and technology dynamics.


No comments:

Post a Comment

Bigdata Testing – What to do? “All code is guilty, until proven innocent.” – Anonymous This blog is first in the series ...