I am looking for an elegant solution to find on which page (s) in a document a specific word that I recorded in the python dictionary / list.
At first I looked at the .docx format and looked at PythonDocx , which has a search function, but there clearly is not really a page attribute in docx / xml format. If I analyze the document, I could look for occurrences of <w:br w:type="page"/> in the xml tree, but unfortunately they do not show inadvertent page breaks.
I even considered converting files to PDF and using something like PDFminer to parse a document.
Is there any direct solution for finding a .docx document for a string and returning the pages to which it occurs, for example
[('foo' ,[1, 4, 7 ]), ('bar', [2]), ('baz', [2, 5, 8, 9 )]
python pdfminer python-docx
birgit
source share