Designing BigData
Architecture (Part 2)
In part1 of Designing BigData Architecture
we looked at various activities involved in planning BigData architecture. This
article covers each of the logical layers in architecting the BigData Solution
Data
Source Identification
Source profiling is one of the most
important steps in deciding the architecture. It involves identifying the
different source systems and categorizing them based on their nature and type.
Data sources can be of different types:
Legacy systems (Mainframes, ERP, CRM), operational database, transactional
database, Smart devices, Social Media websites, Data providers etc.
Points to be considered while profiling the data sources:
- Identify the internal and external sources systems
- High Level Guesstimate for the amount of data ingested from each source
- Identify the mechanism used to get data – push or pull
- Determine the type of data source – Database, File, web service, streams etc
- Determine the type of data – structured, semi structured or unstructured
Data
Ingestion Strategy and Acquisition
Data ingestion in the Hadoop world means
ELT (Extract, Load and Transform) as opposed to ETL (Extract Transform Load) in
case of traditional warehouses. Different ingestion tools and strategy have to
be selected for various types of data.
Points to be considered:
- Determine the frequency at which data would be ingested from each source
- Is there a need to change the semantics of the data append replace etc.
- Is there any data validation or transformation required before ingestion (Pre-processing)
- Segregate the data sources based on mode of ingestion – Batch or real-time
Batch Data Ingestion
Batch data ingestion may happen at a regular interval (once or twice)
a day,
depending on customer requirement. Usually it
is a csv file, text file, JSON file or
RDBMS.
Technologies used:
Technologies used:
Flume : Used to ingest flat files, logs etc.
Sqoop: Used to ingest data from RDBMS’s.
Real time Data ingestion
Real-time means near to zero
latency and access to information whenever it is required. Stream is a constantly
created data that is fit for real-time ingestion and processing.
Ex: Data coming in from sensors,
Social Media etc.
Technology used:
Apache storm – It’s a Distributed
real-time computation system. Storm makes it easy to reliably process unbounded
streams of data, for real-time processing.
Apache Kafka - Kafka is a distributed
publish-subscribe messaging system that is designed to be fast, scalable, and
durable.
Flume - Flume is a distributed,
reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of streaming event data.
Spark streaming - Spark Streaming allows
reusing the same code for batch processing, join streams against historical
data, or run ad-hoc queries on stream state.
Data
Storage
Big data storage should be able to store
large amount of data of any type and should be able to scale on need basis. We
should also consider the number of IOPS (Input output operations per second)
that it can provide. Hadoop distributed file system is the most commonly used storage
framework in BigData world, others are the NoSql data stores – MongoDB, HBase,
Cassandra etc. One of the salient features of Hadoop storage is its capability
to scale, self-manage and self-heal.
There are 2 kinds of analytical
requirements that storage can support:
Synchronous – Data is analysed in
real-time or near real-time, the storage should be optimized for low latency.
Asynchronous – Data is captured,
recorded and analysed in batch.
Hadoop Storage is best suited for
Batch and asynchronous analytics requirements, NoSql are better suited for
scenarios where real-time analytics and flexible schema are required.
There are multiple storage options
available like SAS/SATA solid-state drives (SSDs) PCI Express (PCIe) card-based
solid-state etc., PCIe are most popular as their implementation offers the
lowest latency.
Things to consider while planning
storage methodology:
- Type of data (Historical or Incremental)
- Format of data ( structured, semi structured and unstructured)
- Compression requirements
- Frequency of incoming data
- Query pattern on the data
- Consumers of the data
Data
Processing
With the advent of BigData, the amount of data being stored and
processed has increased multifold.
Earlier frequently accessed data was stored in Dynamic RAMs but now
due to the sheer volume its being stored on multiple disks on a number of
machines connected via network. Instead of bringing the data to processing,
processing is taken closer to data which significantly reduce the network I/O.
Processing methodology is driven by business requirements. It can be
categorized into Batch , real-time or Hybrid based on the SLA.
Batch Processing – Batch is collecting
the input for a specified interval of time and running transformations on it in
a scheduled way. Historical data load is a typical batch operation
Technology Used: MapReduce, Hive, Pig
Real-time
Processing – Real-time processing involves running
transformations as and when data is acquired.
Technology Used: Impala, Spark, spark SQL, Tez, Apache
Drill
Hybrid Processing – It’s a combination of both batch and real-time processing needs.
Best example would be lambda architecture.
Data
Consumption
This layer consumes the output provided by
processing layer. Different users like administrator, Business users, business
analyst, customers, vendor, partners etc. can consume data in different format.
Output of analysis can be consumed by recommendation engine or business
processes can be triggered based on the analysis.
Different forms of data consumption are:
Export Data sets – There can be requirements for third party data set generation. Data
sets can be generated using hive export or directly from HDFS.
Reporting and
visualization – Different reporting and visualization
tool scan connect to Hadoop using JDBC/ODBC connectivity to hive. Tools –
Microstrategy, Talend, Pentaho, Jaspersoft, Tableau etc. Adhoc or scheduled
reports can be created.
Data Exploration – Data scientist can build models and perform deep exploration in a
sandbox environment. Sandbox can be a separate cluster (Recommended approach)
or a separate schema within same cluster that contains subset of actual data.
Adhoc Querying – Adhoc or Interactive querying can be supported by using Hive, Impala
or spark SQL.
Conclusion
The two biggest challenges in designing BigData Architecture are:
·
Dynamics of use case: There a number of
scenarios as illustrated in the article which need to be considered while
designing the architecture – form and frequency of data, Type of data, Type of
processing and analytics required.
·
Myriad of technologies: Proliferation of
tools in the market has led to a lot of confusion around what to use and when,
there are multiple technologies offering similar features and claiming to be
better than the others.
To be nimble, flexible and fluid in our approach to a solution we
need to use these technologies in a way that will adapt to the changing business and technology
dynamics.
No comments:
Post a Comment