Need an implementation of Hadoop MapReduce RecordReader?

From an Apache document on the Hadoop MapReduce InputFormat interface :

" [L] ogical splits based on input size are not enough for many applications because the boundaries of the records must be respected. In such cases, the application must also implement a RecordReader which is responsible for maintaining the boundaries of the records and presents a record-oriented representation of the logical InputSplit to a separate task."

Is WordCount an example of an application in which logical gaps based on input size are not enough? If so, where is the RecordReader implementation found in the source code?

+4
source share
2 answers

Input delimiters are logical references to data. If you look at the API , you will see that it knows nothing about record boundaries. For each input division, a cartographer is started. For each record, mapper map() launched (in WordCount, each line in the file).

But how does the cartographer know where the boundaries of the records are?

Here is your quote from the Hadoop MapReduce InputFormat Interface Included

the application must also implement RecordReader, which is responsible for maintaining the boundaries of the records and presenting a record-oriented view of the logical InputSplit for a particular task

Each converter is associated with an InputFormat. This InputFormat contains information about which RecordReader use. Look at the API , you will find that it knows about the input splits and which data reader to use. If you want to know a little more about input splits and a recorder, you should read this answer.

A RecordReader defines the boundaries of a recording; InputFormat determines that InputFormat is being used.

WordCount does not specify an InputFormat , so the default is TextInputFormat , which uses LineRecordReader and displays each row as a different record. And this is your source code


[L] ogical splits based on input size are not enough for many applications, because the boundaries of the records must be respected.

This means for an example file such as

 abcde fghij klmno 

and we want each line to be a record. when logical splits are based on input size, there may possibly be two splits, such as:

 abcde fg 

and

  hijklmn 0 

If it were not for RecordReader , he would consider fg and hij different records; Clearly, this is not what most applications require.

When answering your question, in WordCount it does not matter what the boundaries of the record are, but there is a chance that the same word will be divided into different logical gaps. Therefore, logical splits in size are not enough for WordCount.

Each MapReduce map respects the boundaries of the entry. Otherwise, it is not very useful.

+3
source

You cannot see the RecorderReader implementation in the WordCount Example, because it uses the standard RecordReader by default and the default InputSplit specified in the structure.

If you want to see their implementation, you can find it in the Hadoop source code.

For more information about Recorder readers and how they work, see refer: https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/

0
source

All Articles