Thursday, 23 March 2017

Introduction of 'BIG DATA'

This post is regarding the basic introduction of BigData.

Introduction to Big Data

The term ‘Big Data’ is used for collections of large datasets that include huge volume, high velocity, and a variety of data that is increasing day by day. Using traditional data management systems, it is difficult to process Big Data. Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve Big Data management and processing challenges.

Hadoop

Hadoop is an open-source framework to store and process Big Data in a distributed environment. It contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS).

· MapReduce: It is a parallel programming model for processing large amounts of structured, semi-structured, and unstructured data on large clusters of commodity hardware.
· HDFS: Hadoop Distributed File System is a part of Hadoop framework, used to store and process the datasets. It provides a fault-tolerant file system to run on commodity hardware.

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules.

· Sqoop: It is used to import and export data to and fro between HDFS and RDBMS.
· Pig: It is a procedural language platform used to develop a script for MapReduce operations.
· Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

Note: There are various ways to execute MapReduce operations:

· The traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data.
· The scripting approach for MapReduce to process structured and semi structured data using Pig.
· The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using Hive.

No comments:

Post a Comment