Dataproc + BigQuery examples - is everything available?

According to the Dataproc docos, it has "built-in and automatic integration with BigQuery."

I have a table in BigQuery. I want to read this table and do some analysis on it using the Dataproc cluster that I created (using PySpark work). Then record the results of this analysis in BigQuery. You may be asking, "Why not just do the analysis in BigQuery directly !?" - the reason is that we create complex statistical models, and SQL is too high a level for their development. We need something like Python or R, ergo Dataproc.

Do they have any examples of Dataproc + BigQuery? I can not find.

+9
google-cloud-platform google-bigquery google-cloud-dataproc
source share
4 answers

For starters, as noted in this question , the BigQuery connector is pre-installed on Cloud Dataproc .

Here is an example of how to read data from BigQuery in Spark. In this example, we will read data from BigQuery to perform word counting. You read data from BigQuery to Spark using SparkContext.newAPIHadoopRDD . The Spark documentation contains additional information about using SparkContext.newAPIHadoopRDD .

 import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat import com.google.gson.JsonObject import org.apache.hadoop.io.LongWritable val projectId = "<your-project-id>" val fullyQualifiedInputTableId = "publicdata:samples.shakespeare" val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>" val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]" val jobName = "wordcount" val conf = sc.hadoopConfiguration // Set the job-level projectId. conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId) // Use the systemBucket for temporary BigQuery export data used by the InputFormat. val systemBucket = conf.get("fs.gs.system.bucket") conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket) // Configure input and output for BigQuery access. BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId) BigQueryConfiguration.configureBigQueryOutput(conf, fullyQualifiedOutputTableId, outputTableSchema) val fieldName = "word" val tableData = sc.newAPIHadoopRDD(conf, classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject]) tableData.cache() tableData.count() tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10) 

You will need to configure this example with your settings, including the Cloud Platform project identifier in <your-project-id> and your output table identifier in <your-fully-qualified-table-id> .

Finally, if you end up using the BigQuery connector with MapReduce, there are examples on this page of how to write MapReduce jobs using the BigQuery connector.

+8
source share

You can also use spark-bigquery connectors https://github.com/samelamin/spark-bigquery to directly query dataproc with spark.

+1
source share

The above example does not show how to write data to the output table. You need to do this:

 .saveAsNewAPIHadoopFile( hadoopConf.get(BigQueryConfiguration.TEMP_GCS_PATH_KEY), classOf[String], classOf[JsonObject], classOf[BigQueryOutputFormat[String, JsonObject]], hadoopConf) 

where is the key: String is actually ignored

0
source share

https://github.com/GoogleCloudPlatform/spark-bigquery-connector This is the new beta version of the connector. This is an api to bigquery API data source that is easy to use.

0
source share

All Articles