R: reading in a CSV file removes leading zeros

I understand that reading a CSV file removes leading zeros, but for some of my files it supports leading zeros without my explicit colClasses job in read.csv. On the other hand, which bothers me in other cases, it removes leading zeros. So my question is: in what cases does read.csv remove leading zeros?

+5
source share
2 answers

read.csv , read.table and related functions read everything as character strings, and then depending on the arguments of the function (in particular colClasses , as well as others) and options, the function will then try to "simplify" the columns. If enough of the column looks numeric and you did not say the function otherwise, it will convert it to a numeric column, this will lead to the loss of any leading 0 (and the subsequent 0 after the decimal). If there is something in the column that is not like a number, then it will not be converted to numeric, and either save it as a symbol or convert to a coefficient, this holds the leading 0. The function does not always look at the entire column to make a decision, therefore what may be obvious to you, since it is not numerical, can still be converted.

The safest approach (and the fastest) is to specify colClasses , so R doesn't need to guess (and you don't need to guess what R is going to guess).

+7
source

Basically a complement to @GregSnow's answer, from the manual.

All quotes from ?read.csv :

If colClasses is not specified, all columns are read as character columns, and then converted using the type.convert method to a logical, integer, numeric, complex, or (depending on as.is) factor, if necessary. Quotes (by default) are interpreted in all fields, so a column of values โ€‹โ€‹of type "42" will result in an integer column.

Also:

The number of data columns is determined by searching for the first five lines of input ...

Suggests read.csv looks at the first 5 lines and guesses if there is a numeric / integer column, otherwise it saves it as character (and thus saves the leading 0 ).

If you are still interested in learning more, I suggest you study the code in edit(read.csv) and edit(read.table) , which are long but will describe each step of the function.

Finally, as an aside, it is usually recommended to specify colClasses :

Less memory will be used if colClasses is specified as one of six atomic classes of an atom. This can be especially noticeable when reading a column that takes many different numerical values, since storing each individual value as a character string can take up to 14 times more memory than storing it as an integer.

Although, if you are really concerned about memory usage / speed, you really should use fread from data.table ; even then, by specifying colClasses , acceleration is created.

+2
source

All Articles