Parsing PDF Files in Hadoop Map Reduce

Question

Parsing PDF Files in Hadoop Map Reduce

I need to parse PDF files that are in HDFS in a map reduction program in Hadoop. So I get the PDF file from HDFS as Input Split , and it needs to be parsed and sent to the Mapper class. To implement this InputFormat, I went through this link. How can I split these input splits and convert them to text format?

+5

pdf mapreduce pdf-parsing hadoop

Wr10 Feb 24 '12 at 8:41

source share

2 answers

It depends on your splits. I think (maybe it’s wrong) that you will need every PDF as a whole to parse it. There are Java libraries for this, and Google knows where they are.

, , , . , , , . , , , , . , , PDF mapper .

+1

Don Branson 24 . '12 15:26

Wr10 · Accepted Answer · 2012-02-25T11:42:49+0000

Processing of PDF files in Hadoop can be done by extending the FileInputFormat class. Let the class extending it be WholeFileInputFormat. In the WholeFileInputFormat class, you override the getRecordReader () method. Now each PDF will be received as an Individual input split . Then these individual splits can be analyzed to extract text. This link provides a clear example of understanding how to extend FileInputFormat.

Parsing PDF Files in Hadoop Map Reduce

More articles: