HDFS Architecture and Hive in Hadoop

HDFS Architecture

When client wants to store a file which has huge data using Hadoop, the following flow occurs:

Initially the client will send his file to cluster where Name Node will split the file with filename for each according to file size into number of blocks with 64 MB each.

When client requests data (i.e system names to store his file) to Name Node, it will response back with the help of Meta Data.

Now client can directly approach systems (Data Nodes) to store his splitted files.

HDFS will give three replications for each file. So when one system is down, there will be two more backup files in other two systems. Acknowledgement for storing exact data in three systems will give back to client by Data Node.

All the Data Nodes will give proper block report and heart beat to Name Node for every short period of time. Block report says about client and filenames, Heart beat says that it is still alive and processing.

If one of Data Nodes failed to give heartbeat, then Name node assumes that Data Node is dead and removes Meta Data and will choose another Data Node to store the replication in seconds.

Now data is stored. If we want to process the stored data, Map Reduce is used.

Data Nodes are commodity hardware’s where Name Node is highly reliable because if Meta Data lose, HDFS will never work.

What is Hive in Hadoop?

Hadoop supports both structured and unstructured data. Hive is an open source software supports in processing of structured data. Hive have HiveQL which is given by facebook.

Hive has its datatypes as TinyInt, SmallInt, Int, BigInt, float, double, string; map, array, struct are collection datatypes.In Hive, Every table is a directory.

Differences between SQL  and HiveQL:

SQL HIVEQL
We can add data row by rows by insert command Hadoop is designed for huge data at a time. No insert command.
Update command can be used In HDFS,once data is written,cannot be changed.
Delete command can be used We cannot use delete.
 Creating Hive Tables:

It can be created in two ways:

1)Managed Tables /Internal Tables

2)External Tables

1)Managed Tables /Internal Tables :

  • $hive (enter)
  • Hive> create table emp(id int,name string,salary float)
  • Row format delimited
  • Fields terminated by ‘\t’;

We have explicitly informed the row format as above.

Delimited -\n -> It will take to next line.

In SQL, we must create table and have to insert data, but in HiveQL we can create data and map table and vice versa.

Loading data into Hive table:

Data can be loaded in two ways:

  • From local file system
  • By using HDFS

Loading Data from local file system:

hive> load data local inpath space <file path>  space into table <table name>

Loading Data from HDFS:

hive> load data inpath space <file path>  space into table <table name>

Warehouse – The data which is loaded into table is loaded into warehouse.

MetaStore – Keeping metadata of the table.

2)External Tables

How to Create external table?

  • hive> create external table empl(id int,name string,depid int,depname string)
  • Row format delimited
  • Fields terminated by ‘\t’;
  • Location “/c/hadoop”; -> C is under root directory

If internal table is created, it is created as directory under ware house and for external table,it will not be created as directory & external location is created.

Share it and like it:

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *