The technique for finding bad data in read.csv in R

I read in a data file that looks like this:

userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey "2","John Smith," john.smith@gmail.com ","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83" 

but the actual file is about 20,000 entries. I use the following R code to read it:

 user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="") 

And the reason I have quote="" is because without it, import stops prematurely. As a result, I received a total of 9569 observations. Why I don’t understand why precisely quote="" overcomes this problem, it seems to do it.

Except that he introduces other problems that I have to “fix”. The first thing I saw is that dates are ultimately strings that include quotation marks that don't want to convert to actual dates when I use to.Date() for them.

Now I can fix the lines and punch my way. But it’s better to learn more about what I do. Can someone explain:

  • Why quote="" corrects "bad data"
  • What is the best practice method for figuring out what causes read.csv stop earlier? (If I just look at the input in the +/- specified line, I see nothing wrong).

Here are the lines next to the problem. I don’t see you have any damage?

 "16888","user1"," user1@gmail.com ","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025" "16889","user2"," user2@gmail.com ","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025" "16890","user3"," user3@gmail.com ","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025" "16891","user4"," user4@gmail.com ","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025" "16892","user5"," user5@gmail.com ","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025" 

* Update *

This is harder. Although the total number of rows imported is 9569, if I look at the last few rows, they correspond to the last few rows of data. Therefore, I assume that something happened during the import to produce many lines. Actually 15914 - 9569 = 6345 records. When I have quote = "", I get 15914.

So my question can be changed: is there a way to get read.csv to report strings that it decides not to import?

* UPDATE 2 *

@Dwin, I had to remove na.strings = "\ n" because the count.fields function does not allow this. With this, I get this conclusion, which looks interesting, but I do not understand it.

 3 4 22 23 24 1 83 15466 178 4 

The second command creates a lot of data (and stops when it reaches max.print). But the first line:

 [1] 2 4 2 3 5 3 3 3 5 3 3 3 2 3 4 2 3 2 2 3 2 2 4 2 4 3 5 4 3 4 3 3 3 3 3 2 4 

I don’t understand if the output should show how many fields in each input record. Obviously, the first lines have more than 2,4,2 fields, etc. Feel me approaching, but still confused!

+7
source share
2 answers

One problem I discovered (thanks to data.table ) is the missing quote (") after John Smith . Could this be a problem for the other lines that you have?

If I add a “missing” quote after John Smith , it reads fine.

I saved this data in data.txt :

 userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey "2","John Smith"," john.smith@gmail.com ","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83" "16888","user1"," user1@gmail.com ","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025" "16889","user2"," user2@gmail.com ","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025" "16890","user3"," user3@gmail.com ","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025" "16891","user4"," user4@gmail.com ","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025" "16892","user5"," user5@gmail.com ","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025" 

And this is the code. Both fread and read.csv work fine.

 require(data.table) dat1 <- fread("data.txt", header = T, na.strings = "\\N") dat1 dat2 <- read.csv("data.txt", header = T, na.strings = "\\N") dat2 
+4
source

The count.fields function can be very useful in determining where to look for garbled data.

This gives tabs to fields in a line that ignore quoting, possibly a problem if there are embedded commas:

 table( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") ) 

This gives a tab that ignores both quotes and "#" (octothorpe) as the comment character:

 table( count.fields("~/Desktop/dbdump/users.txt", quote="", comment.char="") ) 

Determine what you are reporting for the first tab ... most of them were optional ... You can get a list of line positions with non-22 values ​​(using comma parameters and without quotes):

 which( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") != 22) 

Sometimes a problem can be solved with fill=TRUE , unless the missing comma is missing at the end of lines.

+4
source

All Articles