Identical data frames with different digests in R?

Question

Identical data frames with different digests in R?

I have two large data frames: a and b , for which identical(a,b) is TRUE , as is all.equal(a,b) , but identical(digest(a),digest(b)) is FALSE . What can cause this?

What more, I tried to dig deeper by applying digest to string bundles. Incredibly, at least for me, there is agreement in the digest values on the subframes up to the last row of data.

Here is the sequence of comparisons:

 > identical(a, b) [1] TRUE > all.equal(a, b) [1] TRUE > digest(a) [1] "cac56b06078733b6fb520442e5482684" > digest(b) [1] "fdd5ab78ca961982d195f800e3cf60af" > digest(a[1:nrow(a),]) [1] "e44f906723405756509a6b17b5949d1a" > digest(b[1:nrow(b),]) [1] "e44f906723405756509a6b17b5949d1a"

Every method I can think of indicates that the two objects are identical, but their digest values are different. Is there anything else in the data frames that can create such inconsistencies?

For more information: Objects are about 10M rows x 12 columns. Here's the output of str() :

 'data.frame': 10056987 obs. of 12 variables: $ V1 : num 1 11 21 31 41 61 71 81 91 101 ... $ V2 : num 1 1 1 1 1 1 1 1 1 1 ... $ V3 : num 2 3 2 3 4 5 2 4 2 4 ... $ V4 : num 1 1 1 1 1 1 1 1 1 1 ... $ V5 : num 1.8 2.29 1.94 2.81 3.06 ... $ V6 : num 0.0653 0.0476 0.0324 0.034 0.0257 ... $ V7 : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ... $ V8 : num 0.00653 0.00476 0.00324 0.0034 0.00257 ... $ V9 : num 1.8 2.3 1.94 2.81 3.06 ... $ V10: num 0.1957 0.7021 0.0604 0.1866 0.9371 ... $ V11: num 1704 1554 1409 1059 1003 ... $ V12: num 23309 23309 23309 23309 23309 ... > print(object.size(a), units = "Mb") 920.7 Mb

Update 1: On a whim, I converted them to matrices. Digestion of the same.

 > aM = as.matrix(a) > bM= as.matrix(b) > identical(aM,bM) [1] TRUE > digest(aM) [1] "c5147d459ba385ca8f30dcd43760fc90" > digest(bM) [1] "c5147d459ba385ca8f30dcd43760fc90"

Then I tried to convert back to a data frame, and the digest values are equal (and equal to the previous value for a ).

 > aMF = as.data.frame(aM) > bMF = as.data.frame(bM) > digest(aMF) [1] "cac56b06078733b6fb520442e5482684" > digest(bMF) [1] "cac56b06078733b6fb520442e5482684"

So b looks like a bad boy and has a colorful past. b came from a much larger data frame, say b . I took only the columns b that appeared in a and checked to see if they were equal. Well, they were equal, but had different digests. I converted the column names (from "InformativeColumnName1" to "V1", etc.) to avoid any problems, although all.equal and identical tend to indicate when the column names are different.

Since I work on two different programs and do not have access to a and b at the same time, for me the easiest way is to use digest values to check the calculations. However, something seems strange in the way I retrieve columns from a data frame and then apply digest() to it.

ANSWER: It turns out, to my amazement (horror, horror, embarrassment, you name it), identical very forgives about attributes. I assumed that only all.equal forgives attributes.

This was discovered at the suggestion of Tommy identical(d1, d2, attrib.as.set=FALSE) . Running attributes(a) is a bad and bad idea: the string name streams took some time before Ctrl-C can interrupt it. Here is the output of names(attributes()) :

 > names(attributes(a)) [1] "names" "row.names" "class" > names(attributes(b)) [1] "names" "class" "row.names"

They are in different orders! Kudos to digest() for being straight with me.

UPDATE

To help others with this problem, it seems that just a permutation of the attributes would be sufficient to get the same hash value. Since reworking with attribute orders is new to me, it might break something, but it works in my case. Note that this takes a little time if the objects are large; I do not know a faster method for this. (I also want to move on to using matrices or data tables instead of data frames, and this may be another incentive to avoid data frames.)

 tmpA0 = attributes(a) tmpA1 = tmpA0[sort(names(tmpA0))] a2 = a attributes(a2) = tmpA1 tmpB0 = attributes(b) tmpB1 = tmpB0[sort(names(tmpB0))] b2 = b attributes(b2) = tmpB1 digest(a2) # e04e624692d82353479efbd713ec03f6 digest(b2) # e04e624692d82353479efbd713ec03f6 identical(b,b2, attrib.as.set = FALSE) # FALSE identical(b,b2, attrib.as.set = TRUE) # TRUE identical(a2,b2, attrib.as.set = FALSE) # TRUE

+7

r hash dataframe

Iterator Sep 28 '11 at 15:05

source share

2 answers

Our digest package uses the internal function R serialize() to get what we pass for the hash generating functions (md5, sha1, ...).

Therefore, I strongly suspect that it may have something like an attribute. Until you can create something reproducible that is independent of your 1e7 x 12 dataset, we can do little.

In addition, the digest() function can output intermediate results and (starting from the latest version 0.5.1) even raw vectors. This can help. Finally, you can always contact us (as supporting the packaging / authors) offline, which is the recommended method within the territory of R, the popularity of StackOverflow does not stand.

+7

Dirk eddelbuettel Sep 28 '11 at 15:10

source share

Tommy · Accepted Answer · 2011-09-28T15:24:59+0000

No evidence. Of course, this is hard to understand, but one of the differences may be the order of the attributes. identical ignores this default value, but setting attrib.as.set=FALSE can change this:

 d1 <- structure(1, foo=1, bar=2) d2 <- structure(1, bar=2, foo=1) identical(d1, d2) # TRUE identical(d1, d2, attrib.as.set=FALSE) # FALSE

Identical data frames with different digests in R?

More articles: