Is it better to use integer64, numeric, or character in R for large integer numbers?

Question

Is it better to use integer64, numeric, or character in R for large integer numbers?

I am working with a dataset that has multiple columns that represent integer ID numbers (e.g. transactionId and accountId). These ID numbers are often 12 digits, which makes them too large to be stored as a 32-bit integer.

What is the best approach in this situation?

Read the identifier as a character string.
Read the identifier as integer64 using bit64.
Read the identifier as numeric (i.e. double).

I was warned about the dangers of testing equality with doubling, but I'm not sure that this will be a problem in the context of using them as identifiers, where I can combine and filter based on them, but I never do arithmetic by identifier numbers.

Character strings seem intuitively similar, as it should be slower to check for equality and do merges, but perhaps in practice it doesn't really matter.

+7

r

Rob donnelly Feb 03 '16 at 8:17

source share

2 answers

Ewald stieger · Answer 1 · 2016-02-03T08:38:20+0000

If performance you are using bit64.

With integer 64 vectors, you can store very large integers at the expense of 64 bits, which is 7 times better than int64 from the int64 package. Due to less memory, atomic vector architecture and using only S3 instead of S4 classes, most operations are three orders of magnitude faster: an example of 4x acceleration for serialization, 250x for adding, 900x for forcing, and 2000x for creating an object. Integer64 also avoids the constant (potentially infinite) penalty for garbage collection observed when int64 objects exist (see the code in the example section).

See the following PDF file: https://cran.r-project.org/web/packages/bit64/bit64.pdf

ctbrown · Answer 2 · 2016-10-11T19:05:58+0000

See Roland's comment on the original question. Your identifiers must be character vectors. Since identifiers are very unlikely to be used for mathematical operations, it is usually safer to store the value as symbol vectors. He also points out that the merge speed in data.table using character vectors is very fast. Perhaps not as fast as whole mergers, but nonetheless fast. In most cases this should be good.

Is it better to use integer64, numeric, or character in R for large integer numbers?

More articles: