Designing BigData
Solution (Part 1)
Designing Bigdata architecture is a complex
task in itself looking at the data volume, variety and velocity at which data
is generated and consumed. Keeping pace with the speed of technology
innovations, competitive products in the market and their fitment imposes a
great challenge for a BigData Architect.
This article is first in the series of
articles that would be focussed on illustrating components of BigData
architecture, the myths around them and how they are different from the way they
have been handled traditionally.
Planning
Analyse the Business
problem
First step is to look at the business problem objectively and
identify whether it is a Bigdata problem or not. Sheer volume or cost may not
be the deciding factor, multiple criteria’s like velocity, variety, challenges
with the current system , time taken for processing etc. should be considered.
Common Use cases:
·
Data archival/ Data Offload – Despite
the cumbersome process and long SLAs for retrieval of data from tapes, it’s the
most commonly used method of backup as the cost prohibits the amount of active
data maintained in the current systems. Alternatively Hadoop facilitates
storing huge amounts of data spanning across years (active data) at a very low
cost.
·
Process Offload – Offload jobs that
consume expensive MIPS cycles or consume extensive CPU cycles on the current
systems.
·
Data Lake Implementation– Data lakes
help in storing and processing massive amounts of data.
·
Unstructured data processing – BigData
technologies provide capabilities to store and process any amount of
unstructured data natively. RDBMS’s can also store unstructured data as BLOB or
CLOB but wouldn’t provide processing capabilities natively.
·
Datawarehouse Modernization – Integrate
the capabilities of BigData and Datawarehouse to increase operational
efficiency.
Vendor Selection
Vendor selection for the Hadoop distribution may be guided by client
most of the time, depending on their personal bias, market capture of the
vendor or their partnership. The vendors for Hadoop distribution are Cloudera,
Hortonworks, Mapr, BigInsights (Cloudera and Hortonworks being the prominent
ones). As far as the capabilities are concerned all are similar with small
nuances in terms of cost and the services that they offer.
Deployment Strategy
Deployment strategy is to decide whether an on
premise, a cloud based or a mix deployment is required. Each of these has their
own pros and cons. *
An on premise solution tends to be more secure (at least in the customers mind :)
and it is typically a concern for BFS and healthcare customers) as data
doesn’t leave the premise. Although the hardware procurement and maintenance
would cost a lot more money, effort and time.
A cloud based solution is cost effective pay as you go
model which provides a lot of flexibility in terms of scalability and eliminates
procurement and maintenance overhead. (Prominent cloud vendors are AWS and Rackspace)
A mix deployment strategy gives us best of both worlds
and can be planned to retain PII data in premise and rest in cloud.
* Deployment strategy is purely based on the use case
and customers security requirements.
Capacity Planning
Capacity planning plays a pivotal role in hardware and
infrastructure sizing. Important factors to be considered are:
·
Data volume for one time
historical load
·
Daily data ingestion volume
·
Retention period of data
·
HDFS Replication factor based
on criticality of data
·
Time period for which cluster
is sized (typically 6months -1 year), after which cluster is scaled
horizontally based on requirement.
·
Multi datacentre deployment
Infrastructure sizing
Infrastructure sizing is based on our
capacity planning and decides the type of hardware required. The number of
machines, CPU, memory, HDD, network etc.
It also involves deciding the number of
clusters/environments required. Typically we may have Development, QA, Prod and
DR.
Important factors to be considered
·
Types of processing Memory or
I/O intensive
·
Type of disk
·
No of disks per machine
·
Memory size
·
HDD size
·
No of CPU and cores
·
Data retained and stored in
each environment (Ex: Dev may be 30% of prod)
Backup and Disaster
Recovery planning
Backup and disaster recovery is a very important part of planning
and involves the following consideration:
·
The criticality of data stored
·
RPO (Recovery Point Objective)
and RTO (Recovery Time Objective) requirements.
·
Active-Active or Active-Passive
Disaster recovery mechanism
·
Multi datacentre deployment
·
Backup Interval (Can be
different for different type of data)
Part2: Would explain each of the logical steps in designing BigData
Architecture
No comments:
Post a Comment