Make git ignore date in pdf files

First: I know a general comment: do not keep track of the generated files.

Let's say I want to track the generated PDF files and git ignore the date written to the PDF file. This means that I want git to process two PDF files in the same way if the only difference is the date information.

What I tried is a filter, which in its pure part sets the date for some arbitrary value.

(--- a comment ----
basically, the filter does sth along:

## dump the pdf metadata to a file and replace the dates pdftk "$FILENAME" dump_data | sed -e '{N;s/Date\nInfoValue: D:.*/Date\nInfoValue: D:19790101072619/}' > "$TMPFILE" ## update the pdf metadata pdftk "$FILENAME" update_info "$TMPFILE" output "$TMPFILE2" 

) --- end of comment ----

The filter works (fixed pdf has a date set for my arbitrary value), but I ran into files reinstalled from the git repository, with the filter 'clean' ending up with a changed status

So my filter is apparently not the one I want to do here.

My question is:
1) Can I use a smart filter approach to get git to completely ignore the date values ​​in the PDF file? And How?
or
2) What would be the right approach if not for the filters?

+6
source share
2 answers

Finally, after solving this, use the git mailing list. After all, this is not a git problem, but the problem of my filters regarding pdftk. (Maybe a coding thing? Didn’t dig deeper.)

A useful post on the git mailing list is here: http://permalink.gmane.org/gmane.comp.version-control.git/224797

Basically, the script filter that I wrote was not go-powerful, which means that re-applying a clean filter to the cleaned file will change the file.

Background: When pdftk is used to update pdf metadata using the metadata that it extracted from this exact pdf, first of all, to my surprise, it changes the pdf file.

So, I included a security check in my filter, and the problem disappeared.

For reference, here is the complete filter:

  #!/bin/bash ## use GNU coreutils on OS X explicitely ## (install via homebrew, for instance: ## > brew install coreutils ## > brew install gnu-sed ## ) if [ ${OSTYPE:0:6} == "darwin" ]; then MKTMP=gmktemp SED=gsed else MKTMP=mktemp SED=sed fi FILEASARG=true if [ "$#" == 0 ]; then FILEASARG=false fi if $FILEASARG ; then FILENAME="$1" else FILENAME=`$MKTMP` cat /dev/stdin > "${FILENAME}" fi TMPFILE=`$MKTMP` TMPFILE2=`$MKTMP` TMPFILE3=`$MKTMP` ## dump the pdf metadata to a file and replace the dates pdftk "$FILENAME" dump_data > "$TMPFILE3" $SED -e '/Date/{ N; s/Date\nInfoValue: D:.*/Date\nInfoValue: D:19790101072619/ }' < "$TMPFILE3" > "$TMPFILE" ## if the metadata did not change, do nothing if diff "$TMPFILE3" "$TMPFILE"; then rm "$TMPFILE3" rm "$TMPFILE" if [ -n $FILEASARG ] ; then cat "$FILENAME" fi exit 0 fi ## update the pdf metadata pdftk "$FILENAME" update_info "$TMPFILE" output "$TMPFILE2" ## overwrite the original pdf mv -f "$TMPFILE2" "$FILENAME" ## clean up rm -f "$TMPFILE" rm -f "$TMPFILE2" if [ -n $FILEASARG ] ; then cat "$FILENAME" fi 
+1
source

If you control the generation of PDFs, you might consider introducing a hash of the contents of the PDF file into the pdf keyword when generating. This hash uniquely identifies the PDF file without regard to the date field.

Then on the git side you can create a jury of something in the .gitattributes (using extract -p keywords in the pdf file) to perform binary diff in the pdf file.

I guess this might work.

0
source

All Articles