Hadoop is an open source framework which is overseen by Apache software foundation, Hadoop is for storing and processing the huge data sets with cluster of commodity hardware. It is not preferable for small data. Hadoop can work with cheap hardware’s.
- Hadoop has a file system called HDFS, we can store the data in HDFS-Hadoop Distributed File System and process it using Map Reduce.
- Google have an idea to store data in their own Google File system and Map Reduce as processing technique but not implemented.
- Also, Yahoo have developed HDFS (Hadoop Distributed File System) and Map Reduce but not implemented.
- Hadoop is introduced by Dough Cutting to store and process the Big Data with a logo of elephant.
- HDFS and Map reduce are the core concept of Hadoop.
Now we are in the data world, so there is a huge storage and data processes. The data which is beyond the existing storage and processing power is Big Data.
- The examples of data generated sources are Hospital records, Sensors, CC cameras, Social networks, online shopping’s, Airlines etc.
- We must have proper storage capacity to store all the data, extra data’s cannot be ignored. So if we don’t have enough storage capacity in our system, we can store it in external hard disk or cloud etc…
- Initially program is processor bound, If our data is increasing rapidly, we can save it in data centers where they have servers or cloud, when the data is needed; we can fetch it from data center and can process it locally.
- Volume of data, velocity and variety of data are together called Big Data.
Fundamentals of Hadoop:
HDFS and Map Reduce are the core concept of Hadoop.
Hadoop Distributed File System is specially designed file system to store huge data sets with cluster of commodity hardware with streaming access pattern. We can read any number of times without changing the content of file by writing once.
HDFS is given a block size of 64MB by default. We can also use 128MB block size. So by using hadoop , the free space is occupied by some other file and not wasted like normal hard disk. Hadoop Administrator will take care of HDFS.
HDFS have five Core services as below:
- Name Node
- Secondary Name Node
- Job Tracker
- Data Node
- Task Tracker
Every Master node service can talk to each other internally and slave node services can talk to another slave services. Name Node is the master service for Data Node. Job Tracker is master service for Task tracker. Name node cannot interact with task tracker and likewise job tracker with data node.
Normal File system has sectors with 4KB of blocks each, but HDFS file system can have 64MB or 128MB of blocks hence it overcomes the data processing slowness.
Meta Data will have all the details about Data Nodes like File name, Block names, Data in blocks, file size, Data Node will have the actual storage and Meta Data is data about data nodes, in HDFS the block size is huge, so we can easily maintain Meta Data with less data.
Read on Map Reduce