So, I have a data file (separated by a semicolon) in which there are many details and incomplete lines (leading Access and SQL to suppress). This county-level dataset is broken down into segments, sub-segments, and sub-segments (totaling 200 factors) for 40 years. In short, it is huge and it is not going to fit into the memory if I try to just read it.
So my question is that I want all the counties, but only one year (and only the highest level of the segment ..., which led to about 100,000 lines at the end), which would be the best way to proceed to get this convolution in R?
I'm currently trying to cut irrelevant years with Python, bypassing the file size limit, reading and working one line at a time, but I would prefer the R-only solution (CRAN packages are fine). Is there a similar way to read fragment files at a time in R?
We will be very grateful for any ideas.
Update:
- Limitations
- Need to use my car, so no EC2 instances
- As soon as R-is possible. Speed โโand resources in this case are not a problem ... if my car does not explode ...
- As you can see below, the data contains mixed types that I need to use later
- Data
- Data 3.5 GB, about 8.5 million rows and 17 columns.
- Several thousand rows (~ 2k) are garbled, with only one column instead of 17
- This is completely unimportant and can be discarded.
- I only need ~ 100,000 lines from this file (see below)
Sample data:
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ... Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ... Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ... NC [Malformed row] [8.5 Mill rows]
I want to cut out several columns and select two of the 40 available years (2009-2010 from 1980 to 2020) so that the data can fit into R:
County; State; Year; Quarter; Segment; GDP; ... Ada County;NC;2009;4;FIRE;80.1; ... Ada County;NC;2010;1;FIRE;82.5; ... [~200,000 rows]
Results:
After all the suggestions were submitted, I decided that readLines proposed by JD and Marek would work best. I gave Marek a check because he gave an approximate implementation.
I reproduced a slightly adapted version of the Marek implementation for my final answer here, using strsplit and cat to save only the columns that I want.
It should also be noted that this is much more efficient than Python ... like in Python chomps via a 3.5 GB file in 5 minutes, while R takes about 60 ... but if all you have is R then this is a ticket.
## Open a connection separately to hold the cursor position file.in <- file('bad_data.txt', 'rt') file.out <- file('chopped_data.txt', 'wt') line <- readLines(file.in, n=1) line.split <- strsplit(line, ';') # Stitching together only the columns we want cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE) ## Use a loop to read in the rest of the lines line <- readLines(file.in, n=1) while (length(line)) { line.split <- strsplit(line, ';') if (length(line.split[[1]]) > 1) { if (line.split[[1]][3] == '2009') { cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE) } } line<- readLines(file.in, n=1) } close(file.in) close(file.out)
Failures on the approach:
- sqldf
- This is definitely what I will use for this type of problem in the future if the data is well formed. However, if this is not the case, then SQLite will throttle.
- Mapreduce
- Honestly, the docs intimidated me a bit about this, so I did not try it. It seems that the object also had to be in memory in order to defeat the point, if that were the case.
- bigmemory
- This approach is completely data related, but it can only process one type at a time. As a result, all my character vectors fell when placed in a large table. If I need to create large datasets for the future, I would only consider using numbers to keep this option alive.
- scan
- The scan seemed to have problems with a similar type, such as large memory, but with all readLines mechanics. In short, it just didn't fit the bill this time.