Trimming a huge (3.5 GB) csv file for reading in R

So, I have a data file (separated by a semicolon) in which there are many details and incomplete lines (leading Access and SQL to suppress). This county-level dataset is broken down into segments, sub-segments, and sub-segments (totaling 200 factors) for 40 years. In short, it is huge and it is not going to fit into the memory if I try to just read it.

So my question is that I want all the counties, but only one year (and only the highest level of the segment ..., which led to about 100,000 lines at the end), which would be the best way to proceed to get this convolution in R?

I'm currently trying to cut irrelevant years with Python, bypassing the file size limit, reading and working one line at a time, but I would prefer the R-only solution (CRAN packages are fine). Is there a similar way to read fragment files at a time in R?

We will be very grateful for any ideas.

Update:

  • Limitations
    • Need to use my car, so no EC2 instances
    • As soon as R-is possible. Speed โ€‹โ€‹and resources in this case are not a problem ... if my car does not explode ...
    • As you can see below, the data contains mixed types that I need to use later
  • Data
    • Data 3.5 GB, about 8.5 million rows and 17 columns.
    • Several thousand rows (~ 2k) are garbled, with only one column instead of 17
      • This is completely unimportant and can be discarded.
    • I only need ~ 100,000 lines from this file (see below)

Sample data:

County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ... Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ... Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ... NC [Malformed row] [8.5 Mill rows] 

I want to cut out several columns and select two of the 40 available years (2009-2010 from 1980 to 2020) so that the data can fit into R:

 County; State; Year; Quarter; Segment; GDP; ... Ada County;NC;2009;4;FIRE;80.1; ... Ada County;NC;2010;1;FIRE;82.5; ... [~200,000 rows] 

Results:

After all the suggestions were submitted, I decided that readLines proposed by JD and Marek would work best. I gave Marek a check because he gave an approximate implementation.

I reproduced a slightly adapted version of the Marek implementation for my final answer here, using strsplit and cat to save only the columns that I want.

It should also be noted that this is much more efficient than Python ... like in Python chomps via a 3.5 GB file in 5 minutes, while R takes about 60 ... but if all you have is R then this is a ticket.

 ## Open a connection separately to hold the cursor position file.in <- file('bad_data.txt', 'rt') file.out <- file('chopped_data.txt', 'wt') line <- readLines(file.in, n=1) line.split <- strsplit(line, ';') # Stitching together only the columns we want cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE) ## Use a loop to read in the rest of the lines line <- readLines(file.in, n=1) while (length(line)) { line.split <- strsplit(line, ';') if (length(line.split[[1]]) > 1) { if (line.split[[1]][3] == '2009') { cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE) } } line<- readLines(file.in, n=1) } close(file.in) close(file.out) 

Failures on the approach:

  • sqldf
    • This is definitely what I will use for this type of problem in the future if the data is well formed. However, if this is not the case, then SQLite will throttle.
  • Mapreduce
    • Honestly, the docs intimidated me a bit about this, so I did not try it. It seems that the object also had to be in memory in order to defeat the point, if that were the case.
  • bigmemory
    • This approach is completely data related, but it can only process one type at a time. As a result, all my character vectors fell when placed in a large table. If I need to create large datasets for the future, I would only consider using numbers to keep this option alive.
  • scan
    • The scan seemed to have problems with a similar type, such as large memory, but with all readLines mechanics. In short, it just didn't fit the bill this time.
+83
r csv
Jun 22 '10 at 16:00
source share
13 answers

My attempt with readLines . This code snippet creates csv with selected years.

 file_in <- file("in.csv","r") file_out <- file("out.csv","a") x <- readLines(file_in, n=1) writeLines(x, file_out) # copy headers B <- 300000 # depends how large is one pack while(length(x)) { ind <- grep("^[^;]*;[^;]*; 20(09|10)", x) if (length(ind)) writeLines(x[ind], file_out) x <- readLines(file_in, n=B) } close(file_in) close(file_out) 
+36
Jun 24 '10 at 8:45
source share

Is there a similar way to read fragment files at a time in R?

Yes. The readChar () function will be read in a block of characters, not assuming they are null-terminated. If you want to read data in a row at a time, you can use readLines (). If you are reading a block or line, perform the operation, and then write the data, you can avoid the memory problem. Although, if you like to run a large copy of the memory on Amazon EC2, you can get up to 64 GB of RAM. This should contain your file, plus plenty of room for data manipulation.

If you need a higher speed, Shane's recommendation for using Map Reduce is very good. However, if you are on the path to using a large instance of memory in EC2, you should look at the multi-core package to use all the cores on the machine.

If you want to read a lot of concerts with separation data in R, you should at least examine the sqldf package, which allows you to import directly into sqldf from R, and then work with data from R. I found sqldf to be one of the fastest ways to import data concerts in R as mentioned in this previous question .

+10
Jun 22 2018-10-18
source share

I'm not an expert at this, but you might consider trying MapReduce , which would basically mean a โ€œsplit and winโ€ approach. R has several options for this, including:

As an alternative, R provides several packages for processing big data that go beyond memory (to disk). You could probably load the entire dataset into a bigmemory object and complete the abbreviation inside R. See http://www.bigmemory.org/ for a set of tools to handle this.

+9
Jun 22 '10 at 16:14
source share

The ff package is a transparent way to work with huge files.

You can see the package site and / or about it.

I hope this helps

+6
Sep 09 '12 at 14:37
source share

You can import data into a SQLite database , and then use RSQLite to select subsets.

+5
Jun 22 '10 at 16:36
source share

There is a completely new package called colbycol, which allows you to read only the variables that you want to get from huge text files:

http://colbycol.r-forge.r-project.org/

It passes any arguments to read.table, so the combination should allow you to subset pretty tightly.

+5
May 4 '12 at 12:22
source share

Did you find bigmemory ? Check out this and this .

+4
Jun 22 '10 at 16:30
source share

How about using readr and read_*_chunked families?

So in your case:

TestFile.CSV

 County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP Ada County;NC;2009;4;FIRE;Financial;Banks;80.1 Ada County;NC;2010;1;FIRE;Financial;Banks;82.5 lol Ada County;NC;2013;1;FIRE;Financial;Banks;82.5 

Actual code

 require(readr) f <- function(x, pos) subset(x, Year %in% c(2009, 2010)) read_csv2_chunked("testfile.csv", DataFrameCallback$new(f), chunk_size = 1) 

This applies to each fragment of f , remembering the names col and combining the filtered results at the end. See ?callback , which is the source of this example.

It leads to:

 # A tibble: 2 ร— 8 County State Year Quarter Segment `Sub-Segment` `Sub-Sub-Segment` GDP * <chr> <chr> <int> <int> <chr> <chr> <chr> <dbl> 1 Ada County NC 2009 4 FIRE Financial Banks 801 2 Ada County NC 2010 1 FIRE Financial Banks 825 

You can even increase chunk_size , but in this example there are only 4 lines.

+4
May 02 '17 at 9:50 a.m.
source share

Perhaps you can upgrade to MySQL or PostgreSQL to prevent you from limiting MS Access.

It is fairly easy to connect R to these systems using the DBI (available on CRAN) -based database connector.

+3
Jun 22 '10 at 16:36
source share

scan () has both an nlines argument and a skip argument. Is there some reason why you can just use this to read in a piece of string, checking the date to see if it fits? If the input file is ordered by date, you can save an index that tells you what your omissions should be and nlines that will speed things up in the future.

+3
Jun 23 '10 at 20:38
source share

These days, 3.5 GB is just not that big; I can access a machine with 244 GB of RAM (r3.8xlarge) in the Amazon cloud for $ 2.80 / hour. How many hours will it take you to figure out how to solve the problem with solutions like big data? How much is your time? Yes, it will take you an hour or two to figure out how to use AWS, but you can learn the basics in a free way, load the data and read the first 10k lines in R to see if this works, and then you can run a large copy of the memory. such as r3.8xlarge and read it all! Just my 2c.

+1
Oct 12 '14 at 16:20
source share

Now, 2017, I would suggest going for a spark and a spark.

  • syntax can be written in a simple way in a dplyr-like manner

  • It is good enough for a small memory (small in the meaning of 2017)

However, it can be an intimidating experience to get started ...

0
Sep 26 '17 at 3:39 on
source share

I would go to the DB and then make some queries to extract the samples you need via DBI

Please avoid importing a 3.5 GB CSV file into SQLite. Or at least double check that your HUGE db fits within SQLite, http://www.sqlite.org/limits.html

This is a hell of a big database for you. I would go to MySQL if you need speed. But be prepared to wait many hours for the import to be completed. If you do not have unconventional equipment or you are writing from the future ...

Amazon EC2 might be a good solution to create a server instance with R and MySQL.

my two humble virtues.

-2
Jun 22 '10 at 21:14
source share



All Articles