Is advanced PDF analysis performed with current software?

We have a project that we hope to implement, and in this project we need to deal with PDF files (unfortunately) and analyze their contents. Over the past few days, we have been versed in various libraries, and we have tried a few of them.

Although it is, we still do not know whether we will be able to complete such a task. Basically, each page in our PDF document will contain 6-7 questions, possibly with images and 5 answers with multiple choice answers. We will need to separate the segment from these questions and further segment the answers to multiple choices on the corresponding question.

We found PDFBox ( Java ) and PDFMiner ( Python ) as the most reliable libraries for parsing PDFs, but I personally think that creating a reliable system that will satisfy our requirements will be difficult. Is this not the best library? the question is, but is it more like if such tasks are feasible and such advanced requirements are currently being implemented in the world of parsing PDF?

Of course, I am open to any other advice (image processing, cropping software, manual cropping, etc.) that can help us complete our task.

Example: 6 of them per page:

question format

+5
source share

All Articles