Thursday, 14 September 2017

Designing BigData Solution (Part 1)

Designing BigData Solution (Part 1)


Designing Bigdata architecture is a complex task in itself looking at the data volume, variety and velocity at which data is generated and consumed. Keeping pace with the speed of technology innovations, competitive products in the market and their fitment imposes a great challenge for a BigData Architect.
This article is first in the series of articles that would be focussed on illustrating components of BigData architecture, the myths around them and how they are different from the way they have been handled traditionally.

Planning

Analyse the Business problem
First step is to look at the business problem objectively and identify whether it is a Bigdata problem or not. Sheer volume or cost may not be the deciding factor, multiple criteria’s like velocity, variety, challenges with the current system , time taken for processing etc. should be considered.

Common Use cases:
·         Data archival/ Data Offload – Despite the cumbersome process and long SLAs for retrieval of data from tapes, it’s the most commonly used method of backup as the cost prohibits the amount of active data maintained in the current systems. Alternatively Hadoop facilitates storing huge amounts of data spanning across years (active data) at a very low cost.
·         Process Offload – Offload jobs that consume expensive MIPS cycles or consume extensive CPU cycles on the current systems.
·         Data Lake Implementation– Data lakes help in storing and processing massive amounts of data.
·         Unstructured data processing – BigData technologies provide capabilities to store and process any amount of unstructured data natively. RDBMS’s can also store unstructured data as BLOB or CLOB but wouldn’t provide processing capabilities natively.
·         Datawarehouse Modernization – Integrate the capabilities of BigData and Datawarehouse to increase operational efficiency.


Vendor Selection
Vendor selection for the Hadoop distribution may be guided by client most of the time, depending on their personal bias, market capture of the vendor or their partnership. The vendors for Hadoop distribution are Cloudera, Hortonworks, Mapr, BigInsights (Cloudera and Hortonworks being the prominent ones). As far as the capabilities are concerned all are similar with small nuances in terms of cost and the services that they offer.

Deployment Strategy
Deployment strategy is to decide whether an on premise, a cloud based or a mix deployment is required. Each of these has their own pros and cons. *

An on premise solution tends to be more secure (at least in the customers mind :) and it is typically a concern for BFS and healthcare customers) as data doesn’t leave the premise. Although the hardware procurement and maintenance would cost a lot more money, effort and time.

A cloud based solution is cost effective pay as you go model which provides a lot of flexibility in terms of scalability and eliminates procurement and maintenance overhead. (Prominent cloud vendors are AWS and Rackspace)

A mix deployment strategy gives us best of both worlds and can be planned to retain PII data in premise and rest in cloud.
* Deployment strategy is purely based on the use case and customers security requirements.

Capacity Planning
Capacity planning plays a pivotal role in hardware and infrastructure sizing. Important factors to be considered are:
·         Data volume for one time historical load
·         Daily data ingestion volume
·         Retention period of data
·         HDFS Replication factor based on criticality of data
·         Time period for which cluster is sized (typically 6months -1 year), after which cluster is scaled horizontally based on requirement.
·         Multi datacentre deployment

Infrastructure sizing
Infrastructure sizing is based on our capacity planning and decides the type of hardware required. The number of machines, CPU, memory, HDD, network etc.
It also involves deciding the number of clusters/environments required. Typically we may have Development, QA, Prod and DR.

Important factors to be considered
·         Types of processing Memory or I/O intensive
·         Type of disk
·         No of disks per machine
·         Memory size
·         HDD size
·         No of CPU and cores
·         Data retained and stored in each environment (Ex: Dev may be 30% of prod)

Backup and Disaster Recovery planning
Backup and disaster recovery is a very important part of planning and involves the following consideration:
·         The criticality of data stored
·         RPO (Recovery Point Objective) and RTO (Recovery Time Objective) requirements.
·         Active-Active or Active-Passive Disaster recovery mechanism
·         Multi datacentre deployment
·         Backup Interval (Can be different for different type of data)


Part2: Would explain each of the logical steps in designing BigData Architecture

No comments:

Post a Comment

Bigdata Testing – What to do? “All code is guilty, until proven innocent.” – Anonymous This blog is first in the series ...