Python - remove header and footer from docx file

I need to remove headers and footers in many docx files. I'm currently trying to use the python-docx library, but currently it does not support the header and footer in the docx document (work continues).

Is there a way to achieve this in Python?

As I understand it, docx is an xml-based format, but I don't know how to use it.

PSI has an idea to use lxml or BeautifulSoup to parse xml and replace some parts, but it looks dirty

UPD Thanks Sean, for a good starting point. I made some changes to the script. This is my final version (it is useful for me because I need to edit many .docx files. I use BeautifulSoup because the standard XML parser cannot get a valid xml tree. Also, my docx documents do not have a header and footer in xml: They just put the header and footer images at the top of the page, and you can use lxml instead of Soup for faster speed.

import zipfile import shutil as su import os import tempfile from bs4 import BeautifulSoup def get_xml_from_docx(docx_filename): """ Return content of document.xml file inside docx document """ with zipfile.ZipFile(docx_filename) as zf: xml_info = zf.read('word/document.xml') return xml_info def write_and_close_docx(self, edited_xml, output_filename): """ Create a temp directory, expand the original docx zip. Write the modified xml to word/document.xml Zip it up as the new docx """ tmp_dir = tempfile.mkdtemp() with zipfile.ZipFile(self) as zf: zf.extractall(tmp_dir) with open(os.path.join(tmp_dir, 'word/document.xml'), 'w') as f: f.write(str(edited_xml)) # Get a list of all the files in the original docx zipfile filenames = zf.namelist() # Now, create the new zip file and add all the filex into the archive zip_copy_filename = output_filename docx = zipfile.ZipFile(zip_copy_filename, "w") for filename in filenames: docx.write(os.path.join(tmp_dir, filename), filename) # Clean up the temp dir su.rmtree(tmp_dir) if __name__ == '__main__': directory = 'your_directory/' files = os.listdir(directory) for file in files: if file.endswith('.docx'): word_doc = directory + file new_word_doc = 'edited/' + file.rstrip('.docx') + '-edited.docx' tree = get_xml_from_docx(word_doc) soup = BeautifulSoup(tree, 'xml') shapes = soup.find_all('shape') for shape in shapes: if 'margin-left:0pt' in shape.get('style'): shape.parent.decompose() write_and_close_docx(word_doc, soup, new_word_doc) 

So this is me :) I know the code is not clean, sorry for that.

+6
source share
1 answer

Well, I never thought about that, but I just created test.docx with a header and footer. If you have this docx, you can unzip it to get XML files. For my simple test, this gave:

 word/ _rels footer1.xml styles.xml document.xml footnotes.xml stylesWithEffects.xml endnotes.xml header1.xml theme fontTable.xml settings.xml webSettings.xml 

Opening word/documents.xml gives you the main problem area. You can see that there are elements where the header and footer are involved. In my simple case, I got:

 <w:headerReference w:type="default" r:id="rId7"/> <w:footerReference w:type="default" r:id="rId8"/> 

and

 <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/> 

The whole document is really small, so

 <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14"> <w:body> <w:pw:rsidR="009E6E8F" w:rsidRDefault="009E6E8F"/> <w:pw:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"/> <w:pw:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"/><w:pw:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"> <w:r> <w:t>MY BODY</w:t> </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack"/> <w:bookmarkEnd w:id="0"/> </w:p> <w:sectPr w:rsidR="00B53FFA" w:rsidSect="009E6E8F"> <w:headerReference w:type="default" r:id="rId7"/> <w:footerReference w:type="default" r:id="rId8"/> <w:pgSz w:w="12240" w:h="15840"/> <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>""" 

Thus, XML manipulation will not be a problem, both in function and in performance, for something of this size. Here is some code your document should receive in python, parsed as an xml tree and saved as docx. I have to go now, so this is not your complete decision, but I think it should get you down well. If you still have problems, I will come back later and see where you are with him.

 import zipfile import shutil as su import os import tempfile import xml.etree.cElementTree def get_word_xml(docx_filename): with open(docx_filename, mode='rt') as f: zip = zipfile.ZipFile(f) xml_content = zip.read('word/document.xml') return xml_content def write_and_close_docx (self, xml_content, output_filename): """ Create a temp directory, expand the original docx zip. Write the modified xml to word/document.xml Zip it up as the new docx """ tmp_dir = tempfile.mkdtemp() self.zipfile.extractall(tmp_dir) with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f: xmlstr = tree.tostring(xml_content, pretty_print=True) f.write(xmlstr) # Get a list of all the files in the original docx zipfile filenames = self.zipfile.namelist() # Now, create the new zip file and add all the filex into the archive zip_copy_filename = output_filename with zipfile.ZipFile(zip_copy_filename, "w") as docx: for filename in filenames: docx.write(os.path.join(tmp_dir,filename), filename) # Clean up the temp dir su.rmtree(tmp_dir) def get_xml_tree(f): return xml.etree.ElementTree.parse(f) word_doc = 'TEXT.docx' new_word_doc = 'SLIM.docx' doc = get_word_xml(word_doc) tree = get_xml_tree(doc) write_and_close_docx(word_doc, tree, new_word_doc) 
+3
source

All Articles