Thursday, 7 June 2018

Bigdata Testing – What to do?

“All code is guilty, until proven innocent.” – Anonymous

This blog is first in the series of blogs that would be focused on illustrating what should be kept in mind while doing Bigdata testing.

Why is testing so important?

Testing is an integral part of the software development life cycle. It ensures that the software is working as per the specifications and is of high quality. Testing is required for an effective performance of software application or product.

Testing in the Bigdata world is even more complicated due to the size and variety of data.

Why is Bigdata testing complex?

The 3 Vs of Bigdata that makes it so powerful, also makes Bigdata Testing very complex:

Volume: volume of data is huge so it makes it next to impossible to test the entire data

Variety: Sources can have data coming in different formats – Structured, Semi structured and unstructured.

Velocity: Rate of ingestion can vary with type of source (batch, real-time, near real time etc.) With the need to ingest data at real time, replicating this kind of testing scenarios is challenging.

Technology Landscape: There are a number of open source technologies present in the market for data ingestion, processing and analytics. This increases the learning curve for the tester and also increases the time and effort required for testing.

Bigdata testing Methodology

Bigdata testing can be primarily divided into 3 areas:

Data validation testing (Preprocessing)
Business logic validation testing (Processing)
Output validation testing (Consumption)

This blog talks about the different things that should be tested in the data validation testing phase.

Data Validation Testing

Data validation testing ensures that right data has been ingested into the system. It consists of testing the data ingestion pipeline and the data storage.

Data Ingestion

Prominent Bigdata technologies used for Data ingestion are Flume, Sqoop, Spark, Kafka, Nifi etc.

Multi-Source Integration – There is a need to integrate data from multiple sources into the ingestion pipeline. These sources would have different volumes of data getting generated at different times.

Things to test:

What is the mechanism of data ingestion from each source? (Data is getting pulled or pushed from each source)
What is the format in which data is received from each source?
Has input been checked for consistency with a minimum/maximum range?
Is the data secure while getting transferred from source to target?
Who has access to the source systems (RDBMS tables or SFTP folders)?
Are the files (In case of SFTP) getting archived or purged after copy, based on the pre decided frequency?
Are audit logs getting generated?
Are ingestion error logs getting generated?

Integrity Check – Data integrity can be checked via multiple methods like record count, file size, checksum validation etc.

Things to test:

What is the size of the file transferred?
The number of columns in the source and target are matching?
Is the sequence of columns correct?
Are the column data types matching?
Is the record count matching for both source and target?
Are the checksums for source and target matching?

Change data Capture (CDC) – For incremental data load all the scenarios of insert, update and delete of records/fields need to be tested.

Things to test:

Are the new records getting inserted?
Are the records deleted at the source side marked as delete at target?
Are new versions getting created for the updated records (if we are maintaining history else are they getting overwritten with latest version)?
Are audit logs getting generated?
Are error logs getting generated?

Data Quality – Quality of data is one of the most important aspects of data ingestion. Data quality rules are defined by functional SMEs after discussion with business users. These rules are specific to the source system.

Things to test:

Are any of the key fields Null?
Are there any duplicate records?
Have all the data quality rules being applied?
Is the total number of good and bad records matching the number of records ingested from each source?
Are the good records getting passed on to the next layer?
Are the bad records getting stored for future reference?
Can the bad records be traced back to the source?
Can the reason of bad records be traced?
Can we trace the good and bad records to the respective data loads?
Can the good and bad records be cleaned up for a particular load?
Are error/Audit logs getting generated?

Data Storage

Data ingested from multiple sources can be stored in premise or in cloud based on the deployment model chosen:

In Premise - Hadoop Distributed Filesystem, S3 or NoSQL data store.

Cloud based – S3, Glacier, HDFS, NoSQL, RDS etc.

Things to test:

Common across storage:

What is the file format of the ingested data?
Is the data getting compressed as per the decided compression format?
Is the integrity of data maintained after the compression?
Is the data getting archived to correct location?
Is the data getting archived at the decided time frame?
Is the data getting purged at the decided time frame?
Is the data stored in file system or NoSQL accessible?
Is the archived data accessible

Simple Storage Service (S3):

Is there a requirement to encrypt data in S3? If yes, is the data getting encrypted correctly?
Is there a requirement for versioning data in S3? If yes is data getting versioned?
Are rules set to move data to RRS (Reduced Redundancy Storage) or Glacier? Is the data moving according to the policy?
Is all the access to buckets tested as per bucket policy and ACLs that have been set?
Is the data in S3 encrypted?
Can S3 data be accessed from the nodes in the cluster?

How to Test?

The most common way to test data is the following:

Data Comparison – Simplest way is to compare the source and target file for any differences.
Minus Query - Minus Queries purpose is to perform source-minus-target and target-minus-source queries for all data, making sure the extraction process did not provide duplicate data in the source and all unnecessary columns are removed before loading the data for validation.

Both of these methods are not easy to use in Bigdata world as the volume is huge. Hence we have to use different methods:

Data Sampling - Entire data cannot be tested record by record as the volume is huge. Data needs to be tested by sampling.
It’s very important to select the correct sample of data to ensure we have tested all scenarios. Samples should be selected to cover the maximum number of variations in the data.
Basic Data Profiling – Another method to ensure that the data has been ingested correctly is to do data profiling example -count of records ingested, Aggregation on certain columns, minimum value and maximum value ranges etc.
Running queries to check data correctness – Queries like grouping on certain columns or fetching top and bottom 5 can help in checking whether correct no of records has been ingested for each scenario.

Part 2 of the blog would cover the details on Business logic validation testing and Data Consumption

Part 3 of the blog would cover aspects of performance and Security Testing

Thursday, 21 September 2017

Data Lake Explained

Why Data Lake?

With the advent of Bigdata, Enterprises needed new ways to store, process and analyze data. The existing traditional data warehouses were not capable of handling such huge volume and variety of data. Data lakes became popular because they provided a cost-effective and technologically feasible way to meet these big data challenges.

What is Data Lake?

Data Lake is an architecture that allows collecting, storing and processing, analyzing and consuming all data that flows into an organization. Data lake stores data in its raw (as-is) or native format.

It can also be considered as a repository created from the data coming in from disparate data sources.

Pentaho CEO James Dixon coined the word data lake:

“If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state”

A data lake can be imagined as a huge grid, with billions of rows and columns. Where each cell of the grid may contain different types of data. Thus, a cell can contain a document, another photograph and other cell can contain a paragraph or a single word of a text. No matter where the data came from, it will just be stored in a cell.

How is a data lake different from a data warehouse?

Data lake and data warehouse both are very different as both are optimized for different purposes. There is a significant difference in the way the data is stored, processed and analyzed in each of them.

(Note: DL – Data lake, DW – Data warehouse)

Data: DW stores orange data which is highly structured, processed and transformed. DL stores data in raw format and it can be structured, semi structured or unstructured.

Storage: DW follows a ‘schema on write’ approach where the data structure is decided before ingesting the data. DL follows a ‘schema on read’ (Late binding) approach where any form of data can be ingested and schema is decided while reading the data.

Processing: DW has capability to process structured data whereas DL has capability to process data in any format (Structured, semi structured and unstructured). DW follows ETL (Extract, transform and load), the structures are predefined and data is loaded after transformation into those predefined structures. Top down approach is used.

DL follows ELT (Extract, load and transform), the data is loaded first and then transformed based on need. Bottom up approach is used.

Consumption: Since DW has a predefined structure only pre-determined questions can be answered and data is aggregated so visibility into the lowest levels is lost. DL offers endless possibility of querying data in different ways a data is stored at lowest granularity.

Cost: Cost of building a DL is low as it uses open source technologies like Hadoop, NoSQL etc.

Flexibility: DL can adapt to change easily as data is stored in raw format and its ‘schema on read’. Changes in DW would require considerable time and effort.

Data Lake Architecture

Data Lake Zones

Coutsey: DZone

Transient zone – this layer is usually the landing layer where data is pulled or pushed from different source systems. It’s a temporary zone where data is kept before its inserted into data lake.
Raw Zone – This layer stores data ingested from multiple sources in raw (native or as-is) format.
Trusted Zone – This layer stores data after applying data quality rules on the raw data.
Refined Zone – This layer stores Manipulated and enriched data.

Benefits to Enterprise

Data Lake acts as a central storage of data and assists enterprises to get meaningful business insights.

It eliminates the need to query multiple systems existing in silos to get relevant information.

Maintenance, monitoring and security become comparatively easy as data is managed at a central location.

All kinds of data (structured, semi structured or unstructured) can be stored in the data lake.

Its ‘schema on read’ hence allows any kinds of queries to be designed on raw data.

Conclusion

Data lakes serve as a single repository to store vast amount of data in its native format. Data warehouse in conjunction with data lakes can be used to deliver accelerating returns. Data warehouse and Data Lake can complement each other.

Friday, 15 September 2017

Planning a cloud based Hadoop Environment

The first step in starting a project is to plan the Hadoop cluster. The Hadoop cluster needs to be planned carefully based on the current and anticipated future needs. There are many aspects in designing the cluster:

Hadoop distribution selection

Hardware Selection

Selection of storage

Determine data volume and number of machines

1. Hadoop distribution Selection

There are multiple vendors in this space; the most prominent ones are Cloudera and Hortonworks.

Below are the features based on which we can pick one of them:

Cost of license

Professional Services

Training

Integration with end user tools

Interoperability with other systems

Security and data protection

Customer preference

2. Hardware Selection

Important parameters to consider are - Type of workload, Disk space, I/O bandwidth, computational power, Memory etc.

Type of workload

Type of workload gives us an idea on what type of machine should we use. Following are the types of workloads that can be looked at:

Courtsey: Cloudera

Balanced Workload

Workloads that are distributed equally across the various job types (CPU bound, Disk I/O bound, or Network I/O bound). This is a good default configuration for unknown or evolving workloads.