Reducers will combine all the key-value pairs output of Mappers. The data which is present between Mappers and Reducers is Intermediate data. (Key-value pairs)
In Intermediate data, Values can be duplicate but not Keys. If there is any duplicate keys, it will undergo two phases shuffling and sorting.
Sorting cannot be done with duplicate keys .So in Shuffling phase, all the values are combined associated to single identical key. After shuffling phase there won’t be any duplicate key.
All the box classes by default have writable comparable interface. In this interface one key is compared with other key.So automatically all the keys are compared with each other and sorting is done. Basically all unique keys will compare each other and gives output in some sorting order because of Box classes. Now this key-value pair is given to Reducers. Reducers will work once for each key-value pair at a time.
Number of key-value pairs from Record Reader is equal to Number of Mappers.
Number of key-value pairs after sorting is equal to Number of Reducers.
By default, Hadoop framework has given Identity Reducer.We can over write our own reducer through reducer code. For shuffling and sorting our own reducer code is required otherwise identity reducer comes to role and there is only sorting, not shuffling.
|Mappers||No sorting & shuffling|
|Identity reducer||Sorting,No shuffling|
|Own Reducer||Both Sorting,shuffling|
Now reducers will work on key-value pairs and give final output to Record Writer. Record writer writes only one key-value pair in the output files.
RDBMS stores and process structured data.If there is huge structured data,it is impossible to RDBMS to store and process data. Sqoop is an interface/tool helps in importing or exporting huge data between RDBMS and HDFS. It is built on Map Reduce.
- By default, sqoop is working with 4 mappers and not working with reducers.
- We can import entire table by using sqoop
- We can import part of the table with “where” clause or with “columns” clause
- We can import all tables from a specific database
- We can set/specify number of mappers on our own using the command –m <no. of mappers>
- We can export table from HDFS to RDBMS.
Difference between External and Internal Table?
|Internal Table||External Table|
|Internal table emp will be created as:
Emp(file) – File having data When it is loaded from local system or HDFS
|External table emp will be created as:
/External – root directory
Emp(f) – Data will be loaded directly.
|When we drop the table, there will be no data file.
Data loss is there.
|When we drop the table Empl,there will be data file Emp. No data loss.|
|One file can be referred by one table.||One file can be referred by any number of tables.
Ex: Department table.
|Individual tables||Global usage tables|