How to use tidyr (or similar data violations) on "big" data in a postgreSQL database (Redshift)

I have 4 billion rows of data in a 12 node redshift cluster. I can successfully connect to it with the Rpostgreqsql package and use dplyr to fix the underlying data.

However, I would like to make some changes to the data that I usually used to convert reshape2 (dcast) or tidyr (spread). I found that no package is running on my database object. I could run "collect", but that would be problematic because this framework would be too large to fit in memory (hence the reason I want to work in the database). My common goal is to use dcast / spread to make the data wider when creating 0/1 flags in the process. This works like a charm with small data samples on my machine, but not so good on the DB.

Below is my code that works for me. Database connection and basic filtering with dplyr. When I try to use tidyr / reshape2, R throws syntax errors that are "type not detected"

Redshift <- src_postgres('dev', host = 'xxx.aws.com', port = 5439, user = "user", password = "pwd") ### create table reference ### df <- tbl(Redshift, "df_cj_allact") # simple and default R commands analyzing data frames dim(df) colnames(df) head(df) df2 <- df %>% filter(id != '0') %>% arrange(id, timestamp, category) # seems to work! # 2157398, was 2306109 (6% loss) 
+6
source share
1 answer

The tidyr package does not support a database backend. You can only manipulate memory data. dplyr works with database tables as well as memory objects. You can try to use a machine with more memory (say, on AWS) and use data.table or think about sharing data.

+2
source

All Articles