Extract data from MS Word

I am looking for a way to extract / copy data from Word files to a database. Our corporate procedures contain customer meeting protocols documented in MS Word files, mainly due to history and inertia.

I want to be able to pull action items from these meeting protocols into a database so that we can access them from the web interface, turn them into tasks, and update them as they are completed.

What is the best way to do this:

  • a VBA macro from inside Word to create a CSV and then load into a DB?
  • VBA macro in Word with database connection (how to connect to MySQL from VBA?)
  • Python script via win32com, then load into DB?

The latter is attractive to me, since the web interface is built using Django, but I never used win32com or tried to use the Word script from python.

EDIT: I started extracting text using VBA because it makes it easier to work with the Word object model. I have a problem: all the text in the tables, and when I pull the rows from CELLS that I want, a strange little character appears at the end of each row. My code looks like this:

sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum

num_rows = Application.ActiveDocument.Tables(2).Rows.Count

For n = 1 To num_rows
    Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
    Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
    Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
    If Target = "" Then
        ExportText = ""
    Else
        ExportText = Descr & Chr(44) & Assign & Chr(44) & _
            Target & Chr(13) & Chr(10)
        Print #fnum, ExportText
    End If
Next n

Close #fnum

What happens to a small control field? Is some kind of character code that comes from Word?

+5
source share
6 answers

Word , .

, : .

Left(), , ..

 Left(Target, Len(Target)-1))

,

 num_rows = Application.ActiveDocument.Tables(2).Rows.Count
 For n = 1 To num_rows
      Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text

:

 For Each row in Application.ActiveDocument.Tables(2).Rows
      Descr = row.Cells(2).Range.Text
+4

, Word, win32com. - :

from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?)  # not sure what to use for ?

, , - ( , ) - python . , , ; , VBA , .

: http://mail.python.org/pipermail/python-list/2002-October/168785.html COMTools.py; .

makepy.py( pythonwin) "" python COM, .

+1

OpenOffice. , python.

+1
source

I would say look at the related questions on the right -> top one , there seem to be good ideas for going to the python route.

0
source

how to save the file as xml. then use python or something else and pull the data from the word and into the database.

0
source

You can programmatically save a Word document as HTML and import the tables (tables) contained in Access. This requires very little effort.

0
source

All Articles