How to load in Python xlsx that originally had a .xls file extension?

I use xlrd to process .xls files and openpyxl to process .xlsx files, and this works well.

Then the .xls file is transferred to me, so I try xlrd.open_workbook() and get:

 XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve' 

I am considering this question, and I assume that my file, although ending with the extension .xls, should actually be .xlsx. And indeed, I can view it in a text editor:

 <?xml version="1.0" encoding="UTF-8"?> <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40"> : : : 

(for privacy reasons, I cannot publish the whole file, but it is probably not required for our analysis).

So, I assume that if I just copy ( cp ) it to .xlsx, I should open it with openpyxl.load_workbook() , but I get:

 BadZipfile: File is not a zip file 

If it is actually xls (unlikely), but cannot be opened using xlrd , and if it is actually xlsx, but cannot be opened using openpyxl , even after I cp it is equal to a. xlsx what to do?

Note. If I open .xls in Excel, save it as .xlsx and try again using openpyxl , it will actually load, but this manual step is not a luxury that I will have in running my program.

+6
source share
3 answers

One thing is clear: the file you are trying to open has a different format than its extension suggests.

As you already know, Excel file formats include (but are not limited to) xls and xlsx .

  • The Excel 2003 format ( xls ) is a binary format. This means that if you open the xls file with a text editor, you will only see gibberish.

  • The format of Excel 2007 ( xlsx ) is completely different. An xlsx file is a zip file with a bunch of XML files inside. You can use the zip archiver to extract the contents of the xlsx file. You can then edit the XML files using any text editor. However, opening the xlsx file directly with a text editor is similar to opening a zip file with a text editor: you will only see gibberish.

The fact that you can open the file with a text editor (and read its contents) shows that it is not an xls or xlsx file. Your file is neither a binary file nor a zip file; it is a simple XML file.

In addition, this error message says a lot.

 BadZipfile: File is not a zip file 

This means that openpyxl trying to open your file as an xlsx file and therefore a zip file. But when he tries to extract its contents, it fails because your file is not even a zip file.

But if the file is neither an xlsx file nor an xls file, how can Microsoft Excel read it? I wondered too. After some research, I believe your file is in XML Spreadsheet 2003 format. This example looks very similar to the contents of the file you posted. Because Microsoft Excel supports this format, it is not surprising that it can read your file.

Unfortunately, Python libraries like xlrd and openpyxl only support xls and xlsx file formats, so they won’t be able to read your file. I think you just have to manually convert it to a supported format.

+8
source

I am not on OSX, so this is not verified. You may be able to use the appscript package, despite the lack of support, open the intruder file and save it.

 from appscript import * excel = app('Microsoft Excel') wb = excel.open('/path/to/file.xls') wb.save_as('/path/to/fileout.xlsx', file_format=k.XLSX_file_format) #not sure the exact name of k.excel_file 
+2
source

I had a similar problem. It turned out that he needed an absolute path to the file. For example, "c: /dir/filename.xlsx" instead of "filename.xlsx". Relative paths worked on osx, but not on Windows.

0
source

All Articles