Read.csv does not work properly in R

I'm at a dead end. Normally read.csv works as expected, but I ran into a problem when the behavior was unexpected. Most likely, this is a user error on my part, but any help would be appreciated.

Here is the file url

 http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip 

Here is my code to get the file, unzip and read it:

  URL <- "http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip" download.file(URL, destfile="temp.zip") unzip("temp.zip") tmp <- read.table("sfa0910.csv", header=T, stringsAsFactors=F, sep=",", row.names=NULL) 

Here is my problem. When I open the csv data of the data in Excel, the data looks as expected. When I read the data in R, the first column is actually called row.names. R is read in one additional row of data, but I cannot figure out where the β€œerror” occurs, which causes row.names to be called as a column. It just seems that the data has moved.

However, it is strange that the last column in R appears to contain the correct data.

Here are a few lines from the first columns:

 tmp[1:5,1:7] row.names UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP 1 100654 R 4496 R 1044 R 23 2 100663 R 10646 R 1496 R 14 3 100690 R 380 R 5 R 1 4 100706 R 6119 R 774 R 13 5 100724 R 4638 R 1209 R 26 

Any thoughts on what I can do wrong?

+7
source share
3 answers

I have a fix, possibly based on mnel comments

 dat<-readLines(paste("sfa", '0910', ".csv", sep="")) ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)}) > head(ncommas) [1] 450 451 451 451 451 451 

all columns after the first have an extra delimiter that excel ignores.

 for(i in seq_along(dat)[-1]){ dat[i]<-gsub('(.*),','\\1',dat[i]) } write(dat,'temp.csv') tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",") > tmp[1:5,1:7] UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP 1 100654 R 4496 R 1044 R 23 2 100663 R 10646 R 1496 R 14 3 100690 R 380 R 5 R 1 4 100706 R 6119 R 774 R 13 5 100724 R 4638 R 1209 R 26 

moral of the story ... listen to Joshua Ulrich;)

Quick fix. Open the file in excel and save it. This will also remove the extra delimiters.

As an alternative

 dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1) dum.names<-unlist(strsplit(dat,',')) tmp <- read.table(paste("sfa", '0910', ".csv", sep=""), header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1) tmp1<-tmp[,-dim(tmp)[2]] 
+5
source

My advice: use count.fields () as a quick diagnostic when delimited files do not behave as expected.

First, count the number of fields using the table ():

 table(count.fields("sfa0910.csv", sep = ",")) # 451 452 # 1 6852 

This tells you that all but one row contain 452 fields. So what is the aberrant line?

 which(count.fields("sfa0910.csv", sep = ",") != 452) # [1] 1 

The first line is the problem. When checking, all lines except the first end with two commas.

Now the question is: what does this mean? Should there be an extra field in the title bar that was omitted? Or were commas added to other lines by mistake? It is best to contact the person who generated the data, if possible, to clarify the ambiguity.

+17
source

I know that you found the answer, but since your answer helped me figure this out, I will share:

If you read in a R file with a different number of columns for different rows, for example:

 1,2,3,4,5 1,2,3,4 1,2,3 

it will read filling in the missing columns with NA, for example:

 1,2,3,4,5 1,2,3,4,NA 1,2,3,NA,NA 

BUT! If the row with the largest columns is not the first row, for example:

 1,2,3,4 1,2,3,4,5 1,2,3 

then it will be read in a somewhat confusing way:

 1,2,3,4 1,2,3,4 5,NA,NA,NA 1,2,3,NA 

(overwhelming before you figure out the problem and pretty simple after!)

Just hope this can help someone!

0
source

All Articles