Friday, 15 September 2017

Planning a cloud based Hadoop Environment

Planning a cloud based Hadoop Environment

The first step in starting a project is to plan the Hadoop cluster. The Hadoop cluster needs to be planned carefully based on the current and anticipated future needs. There are many aspects in designing the cluster:

Hadoop distribution selection

Hardware Selection

Selection of storage

Determine data volume and number of machines

1. Hadoop distribution Selection

There are multiple vendors in this space; the most prominent ones are Cloudera and Hortonworks.

Below are the features based on which we can pick one of them:

Cost of license

Professional Services

Training

Integration with end user tools

Interoperability with other systems

Security and data protection

Customer preference

2. Hardware Selection

Important parameters to consider are - Type of workload, Disk space, I/O bandwidth, computational power, Memory etc.

Type of workload

Type of workload gives us an idea on what type of machine should we use. Following are the types of workloads that can be looked at:

Courtsey: Cloudera

Balanced Workload

Workloads that are distributed equally across the various job types (CPU bound, Disk I/O bound, or Network I/O bound). This is a good default configuration for unknown or evolving workloads.

Recommended Configuration:

T2, M4 and M3 type of AWS instances a baseline level of CPU performance with the ability to burst above the baseline

Compute Intensive

Workloads that are CPU bound and require a large number of CPU cycles and large amounts of memory to store in-process data. (Ex: natural language processing or HPCC workloads)

Example: Clustering/Classification, Complex text mining, Natural-language processing, Feature extraction

Recommended Configuration:

C4 and C3 type of AWS instances are Compute-optimized instances.

Memory Intensive

Workloads that are memory bound and require a huge amount of memory.

Recommended Configuration:

X1 and R4 type of AWS Instances are optimized for large-scale, enterprise-class, in-memory applications

I/O Intensive

Workloads that require I/O bound capacity of the cluster.

For this type of workload, more disks per machine should be used.

Example: Indexing, Grouping, Data importing and exporting, Data movement and transformation

Recommended Configuration:

I3 type of AWS Instances are optimized for high Storage Instances that provide Non-Volatile Memory Express (NVMe) SSD backed instance storage optimized for low latency, very high random I/O performance, high sequential read throughput and provide high IOPS at a low cost.

Unknown or evolving workload patterns

Workloads where the pattern may not be known in advance.

Recommended Configuration:

We should start with balanced workload type of configuration.

We can either use EC2 machines to install Hadoop or use amazon EMR service. Amazon EMR can help in quickly provisioning hundreds or thousands of instances, automatically scale to match compute requirements, and shut cluster down when the job is complete This is very useful if you have variable or unpredictable processing requirements.

Master hardware selection

Master processes tend to be RAM-hungry but low on disk space consumption. The namenode and jobtracker are also rather adept at producing logs on an active cluster, so plenty of space should be reserved on the disk or partition on which logs will be stored. Namenode disk requirements are modest in terms of storage as all metadata must fit in memory.

Cloudera recommended specifications for NameNode/JobTracker/Standby NameNode nodes:

4–6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1 for Apache ZooKeeper, and 1 for Journal node)

2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz

64-128GB of RAM

Bonded Gigabit Ethernet or 10Gigabit Ethernet

Worker hardware selection

Worker node hardware is selected based on the type of workload. 20-30% of the machine’s raw disk capacity needs to be reserved for temporary data.

Cloudera recommended specifications for for DataNode/TaskTrackers in a balanced Hadoop cluster:

12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration

2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz

64-512GB of RAM

Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)

3. Selection of storage

There are different types of storage that Amazon provides:

Instance store – These are local stores attached to instance and are temporary. Their lifecycle is dependent on the instance. Since these are ephemeral, regular backups need to be taken.

Elastic Block store (EBS) – These are persistent external to instance and independent of the instance lifecycle. These volumes are automatically replicated within their Availability Zone to protect from component failure, offering high availability and durability.

EBS provides volumes of two major categories: SSD-backed storage volumes and HDD-backed storage volumes.

HDD should be used for large block sequential workloads

SSD backed should be used for random small block workloads

Simple storage service (S3) – They are external data store and can be accessed via simple API. Bandwidth is dependent on instance. Amazon S3 has cross-region replication to automatically copy objects across S3 buckets in different AWS Regions asynchronously, providing disaster recovery solutions for business continuity

Amazon Glacier - Data that requires encrypted archival storage with infrequent read access with a long recovery time objective (RTO) can be stored in Amazon Glacier more cost-effectively.

4. Determine data volume and number of machines

Data volume

Total Data volume consists of two things:

Historical Data – Existing data residing in current systems that needs to be migrated into the Bigdata solution.

Daily Incremental Data – Data ingested into the Bigdata solution as a part of daily ETL.

We should also think about the year on year growth percentage.

Data retention period

Data retention period is the time to which we want to retain data in active state before archiving it. It is different for different industry based on multiple factors like compliance and regulation needs, reporting requirements etc.

With the advent of Bigdata systems and data storage becoming cheaper by the day now it is possible to store multiple years of data as active data as opposed to archiving it on tapes. Tape archival and management was cumbersome and required greater SLA for fetching the data back.

Data Volume Estimation

Yearly Data volume estimation

Current Data volume = (Historical data + (Daily Incremental data * 365))

Data volume after replication = Data volume * Replication Factor

Buffer = 35-40% of the current data volume

Total data = Data volume after replication + Buffer

Note: Replication factor = 3 (default). This can be changed based on criticality of data

We should start small and scale horizontally as the data volume grows. Initial cluster sizing can be done for a year and we can have estimates for consecutive years based on the year on year growth.

This sizing applies to hadoop, for HBase this calculation would not apply.

1 comment:

Anonymous16 September 2017 at 00:16
Excellent information about planning hadoop cluster
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)