What is combiner in Hadoop and why we need it?
As we know number of input splits is equal to number of mappers.Mappers will give the output of key-value pairs for all the input splits.Reducers will combine all the key-value pairs output from mappers.
Hadoop is for huge data sets and not for small data sets. If there is huge data sets,then there will be increase in traffic to process the data.All the mappers will give huge data and reducers have traffic to process the data and may not give required output in time.So performance will be down.
- Combiners are introduced to increase the performance of application by decreasing the network traffic.Combiner is mini reducer.Code written in combiners is same as reducers. Number of mappers is equal to number of combiners
- Each combiner will work individually to mappers. Combiners take the key-value pairs from individual mappers, makes shuffling and sorting and gives output. Once all the combiner works and gives output, reducers will combine all the output of combiners
What is Partitioner in Hadoop(Map Reduce)?
- Partitioner gives more performance and readability on your application
- All the mappers give key-value pairs which are intermediate data. Shuffling and sorting is done on intermediate data and give as input key-value pairs to Reducers
- Hash Partitioner is a default partitioner. This will share the key-value pairs to different reducers based on hash code of an object and not send specific key-value pairs to specific reducer
- We can also design our own partitioner to send specific key-value pairs to specific reducers through partitioner interface or partitioner abstract class
- Configure Partitioner class in Driver code using conf.setPartitionerClass(MyPartitioner.Class); to execute it
- To set number of reducers, use conf.setNumReduceTasks(Number of Reducers);
- We can use maximum 1 Lakh(Part-00000 to Part-99999) number of reducers
- Less number of reducers will throw exception and if more number of reducers are taken,then empty files are given by extra reducers
The Basic four formats of input files in hadoop (MapReduce) are
- Text Input Format
- Key Value Text Input Format
- Sequence File Input Format
- Sequence File As Text Input Format