Extracting data from MS Word using pywin32

Question

Extracting data from MS Word using pywin32

I am running python 3.3 on Windows and I need to pull lines from Word documents. I searched far and all week about the best method for this. I initially tried to save .docx files as .txt and parse using RE, but I had some formatting problems with hidden characters - I used a script to open .docx and save as .txt. I am wondering if I made the correct file> SaveAs> .txt, would it split the odd formatting, and then could I parse correctly? I do not know, but I abandoned this method.

I tried using the docx module but was told that it is not compatible with python 3.3. Therefore, I remain using pywin32 and COM. I used this successfully with Excel to get the data I need, but I am having problems with Word because there is less FAR documentation and reading through the object model on the Microsoft website above my head.

Here is what I have so far opened for document (s):

import win32com.client as win32
import glob, os

word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True

for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
    print(infile)
    doc = word.Documents.Open(infile)

So at this moment I can do something like

print(doc.Content.Text)

And look at the contents of the files, but the odd formatting still looks somehow, and I have no idea how to actually parse it to capture the data I need. I can create a RE that successfully finds the strings I'm looking for, I just don't know how to implement them in a program using COM.

, , Google. , , , - Microsoft - . . .

: , docx txt:

for path, dirs, files in os.walk(r'mypath'):
    for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
        print("processing %s" % doc)
        wordapp.Documents.Open(doc)
        docastxt = doc.rstrip('docx') + 'txt'
        wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
        wordapp.ActiveDocument.Close()

+4

python ms-word pywin32

griffsterb 26 . '13 20:12

2

abarnert · Answer 1 · 2013-11-26T21:41:56+0000

Word, , Office, , Word .

. tempfile , doc ? Unicode (, Microsoft, UTF-16-LE ) ? . , - , Document.SaveAs, WdSaveFormat .. docs, .

wdFormatUnicodeText = 7

for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
    print(infile)
    doc = word.Documents.Open(infile)
    txtpath = os.path.splitext('infile')[0] + '.txt'
    doc.SaveAs(txtpath, wdFormatUnicodeText)
    doc.Close()
    with open(txtpath, encoding='utf-16') as f:
        process_the_file(f)

, , , , .., , , . , , wdFormatFilteredHTML, Python . ( BeautifulSoup , win32com-Word.)

Fred the Fantastic · Answer 2 · 2013-11-27T15:00:41+0000

oodocx - python-docx, Python 3.3. replace . :

from oodocx import oodocx

d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')

, "replace".

Extracting data from MS Word using pywin32

More articles: