You can also take a look at PDFMiner , another PDF parser in Python.
A feature of PDFMiner is that you can control how it groups text parts when they are extracted. You do this by defining the space between lines, words, characters, etc. So, perhaps by changing this, you can achieve what you want (it depends on the variability of your documents). PDFMiner can also give you the location of the text on the page, it can retrieve data by object ID and other things. So delve into PDFMiner and be creative!
But your problem is actually not easy to solve, because in PDF the text is not continuous, but consists of many small groups of characters located absolutely on the page. The main task of PDF is to keep the layout intact. It is not focused on content, but focused on presentation.
Etienne Dec 05 '09 at 3:07 2009-12-05 03:07
source share