For newer formats, they are often just zipped xml, so you can use standard libraries to unpack and parse XML. Some code to capture the creator of the document previously fooobar.com/questions/1083641 / ... .
import zipfile, lxml.etree
zf = zipfile.ZipFile('my_doc.docx')
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text
For older formats you can look at the hachoir-metadata library
source
share