I have 4 billion rows of data in a 12 node redshift cluster. I can successfully connect to it with the Rpostgreqsql package and use dplyr to fix the underlying data.
However, I would like to make some changes to the data that I usually used to convert reshape2 (dcast) or tidyr (spread). I found that no package is running on my database object. I could run "collect", but that would be problematic because this framework would be too large to fit in memory (hence the reason I want to work in the database). My common goal is to use dcast / spread to make the data wider when creating 0/1 flags in the process. This works like a charm with small data samples on my machine, but not so good on the DB.
Below is my code that works for me. Database connection and basic filtering with dplyr. When I try to use tidyr / reshape2, R throws syntax errors that are "type not detected"
Redshift <- src_postgres('dev', host = 'xxx.aws.com', port = 5439, user = "user", password = "pwd") df <- tbl(Redshift, "df_cj_allact")
source share