Planning a cloud based Hadoop Environment
The first step in starting a project is to plan the Hadoop cluster.
The Hadoop cluster needs to be planned carefully based on the current and
anticipated future needs. There are many aspects in designing the cluster:
Hadoop distribution selection
Hardware Selection
Selection of storage
Determine data volume and number of
machines
1.
Hadoop distribution Selection
There
are multiple vendors in this space; the most prominent ones are Cloudera and
Hortonworks.
Below
are the features based on which we can pick one of them:
Cost
of license
Professional
Services
Training
Integration
with end user tools
Interoperability
with other systems
Security
and data protection
Customer
preference
2.
Hardware Selection
Important parameters to consider are
- Type of workload, Disk space, I/O bandwidth, computational power, Memory etc.
Type of workload
Type
of workload gives us an idea on what type of machine should we use. Following
are the types of workloads that can be looked at:
Courtsey: Cloudera
Balanced Workload
Workloads that are distributed
equally across the various job types (CPU bound, Disk I/O bound, or Network I/O
bound). This is a good default configuration for unknown or evolving workloads.
Recommended
Configuration:
T2, M4 and M3 type of AWS instances a
baseline level of CPU performance with the ability to burst above the baseline
Compute Intensive
Workloads that are CPU bound
and require a large number of CPU cycles and large amounts of memory to store
in-process data. (Ex: natural language processing or HPCC workloads)
Example: Clustering/Classification,
Complex text mining, Natural-language processing, Feature extraction
Recommended
Configuration:
C4 and C3 type of AWS instances
are Compute-optimized instances.
Memory Intensive
Workloads that are memory bound
and require a huge amount of memory.
Recommended
Configuration:
X1 and R4 type of AWS Instances
are optimized for large-scale, enterprise-class, in-memory applications
I/O Intensive
Workloads that require I/O
bound capacity of the cluster.
For this type of workload, more
disks per machine should be used.
Example: Indexing, Grouping, Data
importing and exporting, Data movement and transformation
Recommended
Configuration:
I3 type of AWS Instances are
optimized for high Storage Instances that provide Non-Volatile Memory Express
(NVMe) SSD backed instance storage optimized for low latency, very high random
I/O performance, high sequential read throughput and provide high IOPS at a low
cost.
Unknown or evolving workload
patterns
Workloads
where the pattern may not be known in advance.
Recommended
Configuration:
We
should start with balanced workload type of configuration.
We
can either use EC2 machines to install Hadoop or use amazon EMR service. Amazon
EMR can help in quickly provisioning hundreds or thousands of instances,
automatically scale to match compute requirements, and shut cluster down when the
job is complete This is very useful if you have variable or unpredictable
processing requirements.
Master hardware selection
Master
processes tend to be RAM-hungry but low on disk space consumption. The namenode
and jobtracker are also rather adept at producing logs on an active cluster, so
plenty of space should be reserved on the disk or partition on which
logs will be stored. Namenode disk requirements are modest in terms of storage as all metadata must fit in memory.
Cloudera
recommended specifications for NameNode/JobTracker/Standby NameNode nodes:
4–6
1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID
1], 1 for Apache ZooKeeper, and 1 for Journal node)
2
quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
64-128GB
of RAM
Bonded
Gigabit Ethernet or 10Gigabit Ethernet
Worker
hardware selection
Worker node hardware is selected based on
the type of workload. 20-30% of the machine’s raw disk capacity
needs to be reserved for temporary data.
Cloudera
recommended specifications for for DataNode/TaskTrackers in a balanced Hadoop
cluster:
12-24
1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
2
quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
64-512GB
of RAM
Bonded
Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher
the network throughput needed)
3.
Selection of storage
There are different types of storage
that Amazon provides:
Instance store – These are local stores attached to instance and are
temporary. Their lifecycle is dependent on the instance. Since these are
ephemeral, regular backups need to be taken.
Elastic
Block store (EBS) – These are persistent external to instance and
independent of the instance lifecycle. These volumes are automatically
replicated within their Availability Zone to protect from component failure,
offering high availability and durability.
EBS provides volumes of two major
categories: SSD-backed storage volumes and HDD-backed storage volumes.
HDD should be used for large block
sequential workloads
SSD backed should be used for random
small block workloads
Simple storage service (S3) – They are external data store and
can be accessed via simple API. Bandwidth is dependent on instance. Amazon S3 has
cross-region replication to automatically copy objects across S3 buckets in
different AWS Regions asynchronously, providing disaster recovery solutions for
business continuity
Amazon
Glacier - Data that requires encrypted archival storage with
infrequent read access with a long recovery time objective (RTO) can be stored
in Amazon Glacier more cost-effectively.
4.
Determine data volume and number
of machines
Data volume
Total Data volume consists of two
things:
Historical Data
– Existing data residing in current systems that needs to be migrated into
the Bigdata solution.
Daily Incremental
Data – Data ingested into the Bigdata solution as a part of daily
ETL.
We
should also think about the year on year growth percentage.
Data
retention period
Data retention period is the
time to which we want to retain data in active state before archiving it. It is
different for different industry based on multiple factors like compliance and
regulation needs, reporting requirements etc.
With the advent of Bigdata systems
and data storage becoming cheaper by the day now it is possible to store
multiple years of data as active data as opposed to archiving it on tapes. Tape
archival and management was cumbersome and required greater SLA for fetching
the data back.
Data Volume Estimation
Yearly Data
volume estimation
Current
Data volume = (Historical data + (Daily Incremental data * 365))
Data
volume after replication = Data volume * Replication Factor
Buffer
= 35-40% of the current data volume
Total
data = Data volume after replication + Buffer
Note: Replication
factor = 3 (default). This can be changed based on criticality of data
We should start small and scale
horizontally as the data volume grows. Initial cluster sizing can be done for a
year and we can have estimates for consecutive years based on the year on year
growth.
This sizing applies to hadoop, for HBase this
calculation would not apply.
Excellent information about planning hadoop cluster
ReplyDelete