Parsing PDF Files in Hadoop Map Reduce

I need to parse PDF files that are in HDFS in a map reduction program in Hadoop. So I get the PDF file from HDFS as Input Split , and it needs to be parsed and sent to the Mapper class. To implement this InputFormat, I went through this link. How can I split these input splits and convert them to text format?

+5
source share
2 answers

Processing of PDF files in Hadoop can be done by extending the FileInputFormat class. Let the class extending it be WholeFileInputFormat. In the WholeFileInputFormat class, you override the getRecordReader () method. Now each PDF will be received as an Individual input split . Then these individual splits can be analyzed to extract text. This link provides a clear example of understanding how to extend FileInputFormat.

+6
source

It depends on your splits. I think (maybe it’s wrong) that you will need every PDF as a whole to parse it. There are Java libraries for this, and Google knows where they are.

, , , . , , , . , , , , . , , PDF mapper .

+1

All Articles