How to get the author of an office file in python?

The name explains the problem, there are doc and docs documents that I want to restore their copyright information so that I can restructure my files.

os.statreturns only the size and date and time, information associated with the real file.
open(filename, 'rb').read(200)returns a lot of characters that I could not parse.

For reading files xlsxthere is a module called xlrd. However, this still does not allow me to read docor files docx. I know that new office files are not easy to read in programs non-msoffice, so if this is not possible, collecting information from old office files will be sufficient.

+5
source share
3 answers

Since the files docxare just zipped XML, you can simply unzip the docx file and presumably extract the author information from the XML file. Not quite sure where it will be stored, just looking back at it, I sometimes suspect that it is stored as dc:creatorin docProps/core.xml.

Here you can open the docx file and get the creator:

import zipfile, lxml.etree

# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text
+6
source

You can use COM interop to access the Word object model. This link talks about the technique: http://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/

- , . BuiltInDocumentProperties. - " ".

- .ActiveDocument.BuiltInDocumentProperties( " " )

+2

hachoir-metadata. script, . , .

+2
source

All Articles