I am running python 3.3 on Windows and I need to pull lines from Word documents. I searched far and all week about the best method for this. I initially tried to save .docx files as .txt and parse using RE, but I had some formatting problems with hidden characters - I used a script to open .docx and save as .txt. I am wondering if I made the correct file> SaveAs> .txt, would it split the odd formatting, and then could I parse correctly? I do not know, but I abandoned this method.
I tried using the docx module but was told that it is not compatible with python 3.3. Therefore, I remain using pywin32 and COM. I used this successfully with Excel to get the data I need, but I am having problems with Word because there is less FAR documentation and reading through the object model on the Microsoft website above my head.
Here is what I have so far opened for document (s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
So at this moment I can do something like
print(doc.Content.Text)
And look at the contents of the files, but the odd formatting still looks somehow, and I have no idea how to actually parse it to capture the data I need. I can create a RE that successfully finds the strings I'm looking for, I just don't know how to implement them in a program using COM.
, , Google. , , , - Microsoft - . . .
: , docx txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()