I have been struggling with this exact problem in the last few days and have written a small .NET utility to extract and normalize Excel files in such a way that they are much easier to store in the source control. I published the executable here:
https://bitbucket.org/htilabs/ooxmlunpack/downloads/OoXmlUnpack.exe
.. and the source is here:
https://bitbucket.org/htilabs/ooxmlunpack
If you have any interest, I will gladly make it more customizable, but for now you should put the executable in a folder (for example, the root of your source repository), and when you run it, it will be:
- Scan the folder and its subfolders for any .xlsx and .xlsm files
- Take a copy of the file as * .orig
- Unzip each file and re-write it without compression
- It’s enough to print any files in the archive that are valid XML
- Delete the calcchain.xml file from the archive (since it changes a lot and does not affect the contents of the file)
- Enter any unformatted text values (otherwise, they are saved in the lookup table, which causes large changes in the internal XML if even one cell changes)
- Delete values from any cells containing formulas (since you can simply calculate them the next time you open the sheet)
- Create a * .extracted subfolder containing the extracted contents of the zip archive
Obviously, not all of these things are necessary, but the end result is a spreadsheet file that will still be open in Excel, but which is much more susceptible to different and incremental compression. In addition, storing the extracted files also makes it much more obvious in the version history which changes were applied in each version.
If you have any appetite, I’m happy to make the tool more customizable, as I think that not everyone wants the contents to be extracted, or perhaps the values removed from the formula cells, but they are both very useful to me at the moment.
In tests, a 2 MB table is “unpacked” up to 21 MB, but then I was able to save its five versions with slight changes between them, in the 1.9 MB mercury data file and visualize the differences between the versions, effectively using Beyond Compare in text mode.
nb although I am using Mercurial, I am reading this question exploring my solution, and there is nothing that may be in terms of merchandise in the solution, should work fine for git or any other vcs
Jon G Jun 10 '14 at 16:12 2014-06-10 16:12
source share