Hadoop reads a few lines at a time

I have a file in which a set of every four lines represents a record.

for example, the first four lines represent record 1, the next four represent record 2, etc.

How can I ensure that Mapper enters these four lines at a time?

In addition, I want Hadoop to split files on the record boundary (line number must be a multiple of four), so records do not get a spread over several split files.

How can I do that?

+7
source share
2 answers

A few approaches, somewhat more dirty than others:


The right way

You may need to define your own RecordReader , InputSplit, and InputFormat . Depending on what you are trying to do, you may be able to reuse some of the existing three of the above. You will probably have to write your own RecordReader to define a key / value pair, and you probably have to write your own InputSplit to help determine the boundary.


Another correct way that may not be possible

The above task is quite complicated. Do you have control over your data set? Can you pre-process it in any way (either while it is coming or resting)? If so, you should seriously consider trying to convert your data set into something that is easier to read from a window in Hadoop.

Something like:

ALine1 ALine2 ALine1;Aline2;Aline3;Aline4 ALine3 ALine4 -> BLine1 BLine2 BLine1;Bline2;Bline3;Bline4; BLine3 BLine4 

Down and dirty

Do you have control over the size of your data? If you manually split your data at the block boundary, you can make Hadoop not worry about records covering splits. For example, if your block size is 64 MB, write files in blocks of 60 MB in size.

Without worrying about breaking the input, you can do something dirty: in your map function, add a new key / value pair to the list object. If there are 4 elements in the list object, do the processing, emit something, and then clear the list. Otherwise, do not emit anything and do not move without doing anything.

The reason you need to manually split the data is because you are not guaranteed that the entire 4-line record will be provided to the same map task.

+11
source

Another way (simple but can be ineffective in some cases) is to implement FileInputFormat # isSplitable () . Then the input files are not split and processed one per card.

 import org.apache.hadoop.fs.*; import org.apache.hadoop.mapred.TextInputFormat; public class NonSplittableTextInputFormat extends TextInputFormat { @Override protected boolean isSplitable(FileSystem fs, Path file) { return false; } } 

And as orangeoctopus said

In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.

It has some overhead for the following reasons.

  • The processing time of the largest file drags the completion time of the job.
  • A lot of data can be transmitted between data nodes.
  • The cluster is not used properly, as # from maps = # files.

** The above code from Hadoop: final guide

+3
source

All Articles