Featured Post

Step Wise Project Planning

Planning is the most difficult process in project management. The framework described is called the Stepwise method to help to distinguis...

  1. Home

Apache Hadoop

It is a software solution for distributed computing of large datasets. Hadoop provides a distributed file system and a Map Reduce implementation.

A special computer acts as a "name node" . This computer saves the information about to filter and aggregate data.
Hadoop allow that map and reduce function are written in java.
Hadoop provides also linkers so that map and reduce functions can be written in other language eg- c++, python, perl, etc.


Map Reduce - Hadoop map reduce is a software framework for easily writing applications which process big amount of data in parallel on large clusters of community hardware in a reliable, fault tolerant manner.
The term Map reduce actually refers to the following two different tasks that hadoop program perform -

  • the map task - this is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples.
  • the reduce task - this task output from a map as input and combines those data tuples into smaller set of tuples. The reduce task is always performed after map task.
HDFS (Hadoop Distributed File System) - The HDFS is based on google file system and provides a distributed file system that is designed to run on large cluster of small computer machines in a reliable , fault-tolerant manner. 
HDFS uses a master/slave architecture where master consist of a single name node that manages the file system metadata and one or more slave data nodes that store actual data.
HDFS provides a shell like any other file system and a list of commands are available to interact with the file system.

Apache Hive - It is a data warehouse infrastructure built on top of hadoop for providing data summarization, query, and analysis.
Apache hive support analysis of large data sets stored in hadoop's HDFS and compatible file such as Amazon s3 file system. It is provide an SQL link language called Hive QL.
currently, there are four file formats supported in hive, which are text file, sequence file, ORC, and RC file.

features-
  • different storage such as plain text, rc file, hbase, orc, and many others.
  • metadata are stored in rdbms
  • built in user defined functions to manipulate data, strings and other data minig tools.
  • sql link queries(hive ql), where are implicitly converted into map reduce
Hadoop as a service - It is an important technology for many big data project and applications. Hadoop as a service has a grown to satisfy need created by this situation.
but there are many services-
  • aleron - promotes a range of big data services including hadoop focused offering
  • century link - the cloud service provider, has sir hadoop blue prints
  • csc - the large integrator and MSP, offers big data platforms as a service (DBPaas)
  • hp cloud - provides an elastic cloud computing and cloud storage platform to analyze and index layer data volume in the hundreds of peta byte in size.
  • ibm big insight - on cloud provide hadoop as a service on ibm's soft layer global cloud infrastructure a bare metal design,
  • qubole - main focus in hadoop as a service
  • tie to - introduction a big data paas platform in2012.


Previous
Next Post »