Thursday, 23 November 2017

Thesis on Big Data - Topics and Research Areas

What is Big Data?
Big Data refers to large volume of data which may be structured or unstructured and which make use of certain new technologies and techniques to handle it. Organised form of data is known as structured data while unorganised form of data is known as unstructured data. The data sets in big data are so large and complex that we cannot handle them using traditional application softwares. There are certain frameworks like Hadoop designed for processing big data. These techniques are also used to extract useful insights from data using predictive analysis, user behavior and analytics. You can explore more on big data introduction while working on thesis in Big Data. Big Data is defined by three Vs:
    Volume – It refers to the amount of data that is generated. The data can be low-density, high volume, structured/unstructured or data with unknown value. This unknown data is converted into useful one using technologies like Hadoop. The data can range from terabytes to petabytes.
    Velocity – It refers to the rate at which the data is generated. The data is received at an unprecedented speed and is acted upon in a timely manner. It also require real time evaluation and action in case of Internet of Things(IoT) applications.
    Variety – Variety refers to different formats of data. It may be structured, unstructured or semistructured. The data can be audio, video, text or email. In this additional processing is required to derive the meaning of data and also to support the metadata.
In addition to these three Vs of data, following Vs are also defined in big data.
    Value – Each form of data has some value which need to be discovered. There are certain qualitative and quantitative techniques to derive meaning from data. For deriving value from data, certain new discoveries and techniques are required.

    Variability – Another dimension for big data is the variability of data i.e the flow of data can be high or low. There are challenges in managing this flow of data.

Big Data Hadoop
Hadoop is an open-source framework provided to process and store big data. Hadoop make use of simple programming models to process big data in a distributed environment across clusters of computers. Hadoop provides storage for large volume of data along with advanced processing power. It also gives the ability to handle multiple tasks and jobs.

Big Data Hadoop Architecture
HDFS is the main component of Hadoop architecture. It stands for Hadoop Distributed File Systems. It is used to store large amount of data and multiple machines are used for this storage. MapReduce Overview is another component of big data architecture. The data is processed here in a distributed manner across multiple machines. YARN component is used for data processing resources like CPU, RAM, and memory. Resource Manager and Node Manager are the elements of YARN. These two elements work as master and slave. Resource Manager is the master and assigns resources to the slave i.e. Node Manager. Node Manager sends signal to the master when it is going to start the work. Big Data Hadoop for thesis will be plus point for you.

Importance of hadoop in big data
Hadoop is very essential especially in terms of big data. The importance of Hadoop is highlighted in the following points:
    Processing of huge chunks of data – With Hadoop, we can process and store huge amount of data mainly the data from social media and IoT(Internet of Things) applications.
    Computation power – The computation power of hadoop is high as it can process big data pretty fast. Hadoop make use of distributed models for processing of data.
    Fault tolerance – Hadoop provide protection against any form of malware as well as from hardware failure. If a node in the distributed model goes down, then other nodes continue to function. Copies of data are also stored.
    Flexibility – As much data as you require can be stored using Hadoop. There is no requirement of preprocessing the data.
    Low Cost – Hadoop is an open-source framework and free to use. It provides additional hardware to store the large quantities of data.
    Scalability – The system can be grown easily just by adding nodes in the system according to the requirements. Minimal administration is required.

R Programming Language
R is an open-source programming language and software environment for statistical study, graphical representation, and reporting. The R language is extensively used by statisticians and data miners for data analysis and software statistics. Robert Gentleman and Ross Ihaka are the two authors of this language. The language is named ‘R’ from the first letter of the name of these authors.
R software environment’s source code is written mainly in C, FORTRAN, and R language. R is a GNU Package and is freely available under GNU General Public License. 

Working with Big Data in R
R language has been there for the last 20 years but it gained attention recently due to its capacity to handle Big Data. R language provides series of packages and an environment for statistical computation of Big Data. The project of programming with Big Data in R was developed a few years ago. This project is mainly used for data profiling and distributed computing. R packages and functions are available to load data from any source.
Hadoop is a Big Data technology to handle large amount of data. R and Hadoop can be integrated together for Big Data analytics. 

How to integrate R with Hadoop?
R packages and R scripts are used by data scientists for data processing. These R packages and R scripts need to be rewritten in Java language or any such programming language that implements Hadoop MapReduce algorithm to use these scripts and packages with Hadoop. A software written in R language is required with data stored on distributed storage Hadoop. Following are some of the methods to integrate R with Hadoop:
1.      RHADOOP – It is the most commonly used solution to integrate R with Hadoop. This analytics solution allows user to directly take data from HBase database systems and HDFS file systems. It also offers the advantage of simplicity and cost. It is a collection of 5 packages to manage and analyze data using programming language R. Following are the 5 packages:
·         Rhbase – This provides database management functions for HBase within R.
·         Rhdfs – This package provides connectivity to Hadoop distributed file system.
·         Plyrmr – This package provides data manipulation operations on large datasets.
·         Ravro – This allows users to read and write Avro files from HDFS.
·         Rmr2 – This is used to perform statistical analysis on data stored in Hadoop.
2.      RHIPE – It is an acronym for R and Hadoop Integrated Programming Environment. It is an R library that provides users the ability to MapReduce within R. It provides data distribution scheme and integrates well with Hadoop.
3.      R and Hadoop Streaming – Hadoop Streaming makes it possible for the user to run MapReduce using any executable script. This script reads data from standard input and writes data as a mapper or reducer. Hadoop Streaming can be integrated with R programming scripts.
4.      RHIVE – It is based on installing R on workstations and connecting to data in Hadoop. RHIVE is the package to launch Hive queries. It has functions to retrieve metadata from Apache Hive like database names, column names, and table names. RHIVE also provides libraries and algorithms in R to the data stored in Hadoop. The main advantage of this is parallelizing of operations.
5.      ORCH – It is an acronym for Oracle Connector for Hadoop. It allows users to test MapReduce program’s ability without any need of learning a new programming language. 

No comments:

Post a Comment