A few approaches, somewhat more dirty than others:
The right way
You may need to define your own RecordReader , InputSplit, and InputFormat . Depending on what you are trying to do, you may be able to reuse some of the existing three of the above. You will probably have to write your own RecordReader to define a key / value pair, and you probably have to write your own InputSplit to help determine the boundary.
Another correct way that may not be possible
The above task is quite complicated. Do you have control over your data set? Can you pre-process it in any way (either while it is coming or resting)? If so, you should seriously consider trying to convert your data set into something that is easier to read from a window in Hadoop.
Something like:
ALine1 ALine2 ALine1;Aline2;Aline3;Aline4 ALine3 ALine4 -> BLine1 BLine2 BLine1;Bline2;Bline3;Bline4; BLine3 BLine4
Down and dirty
Do you have control over the size of your data? If you manually split your data at the block boundary, you can make Hadoop not worry about records covering splits. For example, if your block size is 64 MB, write files in blocks of 60 MB in size.
Without worrying about breaking the input, you can do something dirty: in your map function, add a new key / value pair to the list object. If there are 4 elements in the list object, do the processing, emit something, and then clear the list. Otherwise, do not emit anything and do not move without doing anything.
The reason you need to manually split the data is because you are not guaranteed that the entire 4-line record will be provided to the same map task.
Donald miner
source share