Input delimiters are logical references to data. If you look at the API , you will see that it knows nothing about record boundaries. For each input division, a cartographer is started. For each record, mapper map() launched (in WordCount, each line in the file).
But how does the cartographer know where the boundaries of the records are?
Here is your quote from the Hadoop MapReduce InputFormat Interface Included
the application must also implement RecordReader, which is responsible for maintaining the boundaries of the records and presenting a record-oriented view of the logical InputSplit for a particular task
Each converter is associated with an InputFormat. This InputFormat contains information about which RecordReader use. Look at the API , you will find that it knows about the input splits and which data reader to use. If you want to know a little more about input splits and a recorder, you should read this answer.
A RecordReader defines the boundaries of a recording; InputFormat determines that InputFormat is being used.
WordCount does not specify an InputFormat , so the default is TextInputFormat , which uses LineRecordReader and displays each row as a different record. And this is your source code
[L] ogical splits based on input size are not enough for many applications, because the boundaries of the records must be respected.
This means for an example file such as
abcde fghij klmno
and we want each line to be a record. when logical splits are based on input size, there may possibly be two splits, such as:
abcde fg
and
hijklmn 0
If it were not for RecordReader , he would consider fg and hij different records; Clearly, this is not what most applications require.
When answering your question, in WordCount it does not matter what the boundaries of the record are, but there is a chance that the same word will be divided into different logical gaps. Therefore, logical splits in size are not enough for WordCount.
Each MapReduce map respects the boundaries of the entry. Otherwise, it is not very useful.