Search for a word on a page (s) in a document

Question

Search for a word on a page (s) in a document

I am looking for an elegant solution to find on which page (s) in a document a specific word that I recorded in the python dictionary / list.

At first I looked at the .docx format and looked at PythonDocx , which has a search function, but there clearly is not really a page attribute in docx / xml format. If I analyze the document, I could look for occurrences of <w:br w:type="page"/> in the xml tree, but unfortunately they do not show inadvertent page breaks.

I even considered converting files to PDF and using something like PDFminer to parse a document.

Is there any direct solution for finding a .docx document for a string and returning the pages to which it occurs, for example

 [('foo' ,[1, 4, 7 ]), ('bar', [2]), ('baz', [2, 5, 8, 9 )]

+7

python pdfminer python-docx

birgit Sep 05 '15 at 10:21

source share

1 answer

mabe02 · Answer 1 · 2016-01-22T22:31:40+0000

Parsing XML files that make up docx

It seems that the biggest problem in your question is how to analyze page by page. This answer of the document is not always the same, and it depends on the fields, settings of the sheet of paper, the application that you use to open it, etc. A good argument for the accuracy of any script for this purpose can be found in the google group .

However, if you can be almost 100% satisfied, you will start to find the solution suggested in this google group :

I found that I can unzip the .docx file and extract docProps/app.xml and then docProps/app.xml XML using ElementTree to get the <Pages></Pages> element. I found that most of the time this number is accurate, but I have seen several examples where the number in this element is incorrect.

Use Win32com.Client

Another approach might be to use win32com.client to open a file, paginate it, perform a search, and then return the results in the format you want.

You can find an example syntax in this answer :

 from win32com.client import Dispatch #open Word word = Dispatch('Word.Application') word.Visible = False word = word.Documents.Open(doc_path) #get number of sheets word.Repaginate() num_of_sheets = word.ComputeStatistics(2)

You can also see this answer regarding find and replace in a text document using win32com.client.

Search for a word on a page (s) in a document

Parsing XML files that make up docx

Use Win32com.Client

More articles: