How to use a zoo or xts with big data?

How can I use zoo or xts R packages with very large datasets? (100GB) I know that there are some packages, such as bigrf, ff, bigmemory, that can solve this problem, but you have to use their limited set of commands, they do not have zoo functions or xts, and I don’t know how to do it zoo or xts to use them. How can i use it?

I saw that there are other things related to databases, such as sqldf and hadoopstreaming, RHadoop, or some others used by Revolution R. What do you recommend ?, others?

I just want to combine the series, clear and do some cointegration and plots. I would not like to code and implement new functions for each team that I need, using small pieces of data every time.

Added: I'm on Windows

+6
source share
1 answer

I had a similar problem (although I only played with 9-10 GB). My experience is that there is no way that R can process so much data on its own , especially since your dataset contains time-series data.

If your data set contains many zeros, you can process it using sparse matrices - see the Matrix package ( http://cran.r-project.org/web/packages/Matrix/index.html ); this guide may also come in handy ( http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ )

I used PostgreSQL - the corresponding R RPostgreSQL package ( http://cran.r-project.org/web/packages/RPostgreSQL/index.html ). It allows you to query your PostgreSQL database; It uses SQL syntax. Data is loaded into R as a data frame. It can be slow (depending on the complexity of your request), but it is reliable and can be convenient for data aggregation.

Disadvantage : you need to load the data into the database first. Your raw data should be clean and stored in some readable format (txt / csv). This will probably be the biggest problem if your data is not yet in a reasonable format. However, loading β€œgood” data into the database is easy (see http://www.postgresql.org/docs/8.2/static/sql-copy.html and How to import CSV files into a PostgreSQL table? )

I would recommend using PostgreSQL or any other relational database for your task. I have not tried Hadoop, but using CouchDB nearly led me to a turn. Stick with good old SQL

+2
source

All Articles