I read in a data file that looks like this:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey "2","John Smith," john.smith@gmail.com ","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
but the actual file is about 20,000 entries. I use the following R code to read it:
user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="")
And the reason I have quote="" is because without it, import stops prematurely. As a result, I received a total of 9569 observations. Why I don’t understand why precisely quote="" overcomes this problem, it seems to do it.
Except that he introduces other problems that I have to “fix”. The first thing I saw is that dates are ultimately strings that include quotation marks that don't want to convert to actual dates when I use to.Date() for them.
Now I can fix the lines and punch my way. But it’s better to learn more about what I do. Can someone explain:
- Why
quote="" corrects "bad data" - What is the best practice method for figuring out what causes
read.csv stop earlier? (If I just look at the input in the +/- specified line, I see nothing wrong).
Here are the lines next to the problem. I don’t see you have any damage?
"16888","user1"," user1@gmail.com ","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025" "16889","user2"," user2@gmail.com ","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025" "16890","user3"," user3@gmail.com ","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025" "16891","user4"," user4@gmail.com ","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025" "16892","user5"," user5@gmail.com ","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
* Update *
This is harder. Although the total number of rows imported is 9569, if I look at the last few rows, they correspond to the last few rows of data. Therefore, I assume that something happened during the import to produce many lines. Actually 15914 - 9569 = 6345 records. When I have quote = "", I get 15914.
So my question can be changed: is there a way to get read.csv to report strings that it decides not to import?
* UPDATE 2 *
@Dwin, I had to remove na.strings = "\ n" because the count.fields function does not allow this. With this, I get this conclusion, which looks interesting, but I do not understand it.
3 4 22 23 24 1 83 15466 178 4
The second command creates a lot of data (and stops when it reaches max.print). But the first line:
[1] 2 4 2 3 5 3 3 3 5 3 3 3 2 3 4 2 3 2 2 3 2 2 4 2 4 3 5 4 3 4 3 3 3 3 3 2 4
I don’t understand if the output should show how many fields in each input record. Obviously, the first lines have more than 2,4,2 fields, etc. Feel me approaching, but still confused!