How to select from multiple tables in a dataset in Big Query using dplyr and bigrquery?

I am trying to query multiple tables from a dataset in Big Query using dplyr and bigrquery. The data set contains several tables, one for each day of data for the year. I can execute a query from one table (for example, 1 day of data) using the following code, but it does not seem to work for several tables at once (for example, during a month or year of data). Any help would be greatly appreciated.

connection <- src_bigquery("my_project", "dataset1") first_day <- connection %>% tbl("20150101") %>% select(field1) %>% group_by(field1) %>% summarise(number = n()) %>% arrange(desc(number)) 

Thanks,

Juan

+5
source share
3 answers

As far as I know, support for table lookup functions in dplyr and bigrquery at the moment. If you are not afraid of ugly hacks, you can extract and edit the query that dplyr builds and sends to bq so that it points to several tables instead of one.

Set your billing information and connect to BigQuery:

 my_billing <- ########## bq_db <- src_bigquery( project = "bigquery-public-data", dataset = "noaa_gsod", billing = my_billing ) gsod <- tbl(bq_db, "gsod1929") 

How to choose one table (for comparison only):

 gsod %>% filter(stn == "030750") %>% select(year, mo, da, temp) %>% collect 
 Source: local data frame [92 x 4] year mo da temp (chr) (chr) (chr) (dbl) 1 1929 10 01 45.2 2 1929 10 02 49.2 3 1929 10 03 48.2 4 1929 10 04 43.5 5 1929 10 05 42.0 6 1929 10 06 51.0 7 1929 10 07 48.0 8 1929 10 08 43.7 9 1929 10 09 45.1 10 1929 10 10 51.3 .. ... ... ... ... 

How to choose from several tables by manually editing the query generated with dplyr :

 multi_query <- gsod %>% filter(stn == "030750") %>% select(year, mo, da, temp) %>% dplyr:::build_query(.) multi_tables <- paste("[bigquery-public-data:noaa_gsod.gsod", c(1929, 1930), "]", sep = "", collapse = ", ") query_exec( query = gsub("\\[gsod1929\\]", multi_tables, multi_query$sql), project = my_billing ) %>% tbl_df 
 Source: local data frame [449 x 4] year mo da temp (chr) (chr) (chr) (dbl) 1 1930 06 11 51.8 2 1930 05 20 46.8 3 1930 05 21 48.5 4 1930 07 04 56.0 5 1930 08 08 54.5 6 1930 06 06 52.0 7 1930 01 14 36.8 8 1930 01 27 32.9 9 1930 02 08 35.6 10 1930 02 11 38.5 .. ... ... ... ... 

Verification of the results:

 table(.Last.value$year) 
 1929 1930 92 357 
+1
source

The BigQuery SQL standard supports the use of wildcard tables . By changing the example in the question, the following R code selects all the daily tables in the dataset.

 library(dplyr) library(bigrquery) connection <- src_bigquery("my_project", "dataset1") multi_days <- connection %>% tbl("*") %>% select(field1) %>% group_by(field1) %>% summarise(number = n()) %>% arrange(desc(number)) 

Here is another example of using one of the BigQuery public datasets. In this case, only a subset of the tables is selected โ€” between 1994 and 2000. The query calculates the average temperature for each year in a row. (Note: you will need to change the billing value to your own BigQuery project identifier to complete the query):

 library(dplyr) library(bigrquery) bq_src <- src_bigquery( project = "bigquery-public-data", dataset = "noaa_gsod", billing = "api-project-123456789" ) results <- bq_src %>% tbl("gsod*") %>% filter(`_TABLE_SUFFIX` %>% between("1994", "2000")) %>% group_by(year) %>% summarise(temp = mean(temp, na.rm = TRUE)) %>% arrange(year) print(results) 
+1
source

What about the command 'list_tabledata' in 'bigrquery'? I tested this piece of code with the same designation as you, and the output contains as many .RData files in your working directory as there are days in your date range.

 library(bigrquery) project<-"my_project" dataset<-"dataset1" day<-seq(from=as.Date("20150101",format="%Y%m%d"),to=as.Date("20150131",format="%Y%m%d"),by="days") for (i in seq_along(day)) { t<-list_tabledata(project,dataset,gsub("-","",as.character(day[i])),max_pages=Inf) save(t,file=paste(gsub("-","",as.character(day[i])),".RData")) } 

Hope it works!
Lourdes Hernandez

0
source

All Articles