How to read table contents in MS-Word file using Python?

How can I read and process the contents of each cell in a table in a DOCX file?

I am using Python 3.2 for Windows 7 and PyWin32 to access an MS-Word document.

I am new, so I don’t know how to access the table cells correctly. So far, I just did this:

import win32com.client as win32
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False 
doc = word.Documents.Open("MyDocument")
+5
source share
3 answers

Here is what works for me in Python 2.7:

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument

To find out how many tables your document has:

doc.Tables.Count

Then you can select the desired table by its index. Note that, unlike python, COM indexing starts at 1:

table = doc.Tables(1)

To select a cell:

table.Cell(Row = 1, Column= 1)

To get its contents:

table.Cell(Row =1, Column =1).Range.Text

Hope this helps.

EDIT:

An example of a function that returns the index of a column based on its header:

def Column_index(header_text):
for i in range(1 , table.Columns.Count+1):
    if table.Cell(Row = 1,Column = i).Range.Text == header_text:
        return i

, , :

table.Cell(Row =1, Column = Column_index("The Column Header") ).Range.Text
+14

, , : (2015) python doc: https://python-docx.readthedocs.org/en/latest/. :

from docx import Document

wordDoc = Document('<path to docx file>')

for table in wordDoc.tables:
    for row in table.rows:
        for cell in row.cells:
            print cell.text
+13

Python etienne

, python.

The docx file format is described in Open Office XML .

import zipfile
import xml.etree.ElementTree

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'

with zipfile.ZipFile('<path to docx file>') as docx:
    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))

for table in tree.iter(TABLE):
    for row in table.iter(ROW):
        for cell in row.iter(CELL):
            print ''.join(node.text for node in cell.iter(TEXT))
+5
source