Recording and Recording Charts

Suppose I have one input file, and for this file there are three blocks created in HDFS. Assuming I have three data nodes and each node information stores one block. If I have 3 input splits, then 3 modules will work in parallel to process data locally to the corresponding data nodes. Each mapper receives input data in pairs of key values ​​using the Input Format and Record Reader. This script is with TextInputFormat, where the entry is the full line of text from the file.

The question is what happens if there is a write break at the end of the first block.

1) How does Hadoop read the full record in this script?

2) Does data node1 contain node 2 data to get the full record?

3) What happens if Data node 2 starts processing data and identifies an incomplete record in the first line?

+4
source share
3 answers
  • Hadoop will continue reading beyond the end of the first block until the EOL or EOF character is reached.
  • These data nodes do not interact with each other outside of data replication (when specifying the node name). The HDFS client will read data from host1, and then host2
  • Some examples to clarify
    • If you have one line record spanning a 300MB file with a 128MB block size, Mapper 2 and 3 will start reading data from the specified offset file offset (128 MB and 256 MB, respectively). They will both skip ahead trying to find the next EOL and start recording from that point there. In this example, both mappers will actually process 0 entries.
    • 300 MB file with two lines of 150 MB in length, 128 MB block size - handler 1 processes the first line, finding the EOL character in block 2. Mapper 2 will start at 128 MB offset (block 2) and scan ahead to find the EOL character with offset 150 MB It will scan ahead and find the EOF after block 3 and process this data. Mapper 3 starts at an offset of 256 MB (block 3) and starts scanning forward to EOF before striking the EOL character and, therefore, processes 0 records
    • 300 MB file with 6 lines, each 50 MB long:
      • mapper 1 - offset 0 β†’ 128MB, lines 1 (0-> 50), 2 (50-> 100), 3 (100-> 150)
      • mapper 2 - offset 128 MB β†’ 256 MB, lines 4 (150-> 200), 5 (200-> 250), 6 (250-> 300)
      • mapper 3 - offset 256 MB β†’ 300 MB, 0 lines

Hope that helps

+4
source
  • Hadoop will do a remote read of node 2 to get the rest of the record
  • Yes
  • From what I understand, node 2 will ignore incomplete write

If you have the Hadoop: The Definitive Guide, take a look at page 246 (in the latest release), which discusses this exact issue (albeit rather briefly, unfortunately).

+1
source

From source code from LineRecordReader.java contructor database: I find comments:

// If this is not the first split, we always throw away first record // because we always (except the last split) read one extra line in // next() method. if (start != 0) { start += in.readLine(new Text(), 0, maxBytesToConsume(start)); } this.pos = start; 

from this I believe (not confirmed) hadoop will read one additional line for each division (at the end of the current division, read the next line in the next split), and if not the first split, the first line will be thrown out. so no line record will be lost and incomplete

0
source

All Articles