BigQuery - check if table exists

I have a dataset in BigQuery. This dataset contains several tables.

I follow these steps programmatically using the BigQuery API:

  • Query tables in a dataset. Since my answer is too large, I turn on the allowLargeResults parameter and redirect my response to the destination table.

  • Then I export the data from the destination table to the GCS bucket.

Requirements:

  • Assuming my process failed in step 2, I would like to restart this step.

  • But before restarting, I would like to check / verify that a specific target table named "xyz" already exists in the dataset.

  • If it exists, I would like to re-run step 2.

  • If this does not exist, I would like to make foo.

How can i do this?

Thanks in advance.

+12
google-cloud-storage export google-api google-bigquery
source share
6 answers

Here is a snippet of Python code that will tell if a table exists (deleting it in the process - carefully!):

def doesTableExist(project_id, dataset_id, table_id): bq.tables().delete( projectId=project_id, datasetId=dataset_id, tableId=table_id).execute() return False 

On the other hand, if you do not want to delete the table in the process, you can try:

 def doesTableExist(project_id, dataset_id, table_id): try: bq.tables().get( projectId=project_id, datasetId=dataset_id, tableId=table_id).execute() return True except HttpError, err if err.resp.status <> 404: raise return False 

If you want to know where bq came from, you can call build_bq_client here: http://code.google.com/p/bigquery-e2e/source/browse/samples/ch12/auth.py

In general, if you use this to check whether you should run a task that changes the table, it might be a good idea to just complete the task anyway and use WRITE_TRUNCATE as the record location.

Another approach could be to create a predictable job identifier and re-run the job with that identifier. If the task already exists, the task is already completed (however, you can check it again to make sure that the task did not fail).

+9
source share

Alex F solution works on v0.27, but will not work on later versions. To go to v0. 28+ will work the solution below.

 from google.cloud import bigquery project_nm = 'gc_project_nm' dataset_nm = 'ds_nm' table_nm = 'tbl_nm' client = bigquery.Client(project_nm) dataset = client.dataset(dataset_nm) table_ref = dataset.table(table_nm) def if_tbl_exists(client, table_ref): from google.cloud.exceptions import NotFound try: client.get_table(table_ref) return True except NotFound: return False if_tbl_exists(client, table_ref) 
+10
source share

Enjoy:

 def doesTableExist(bigquery, project_id, dataset_id, table_id): try: bigquery.tables().get( projectId=project_id, datasetId=dataset_id, tableId=table_id).execute() return True except Exception as err: if err.resp.status != 404: raise return False 

There is an edit exception.

+1
source share

With my_bigquery is an instance of the google.cloud.bigquery.Client class (already authenticated and associated with the project):

 my_bigquery.dataset(dataset_name).table(table_name).exists() # returns boolean 

It makes an API call to check for the existence of a table using a GET request

Source: https://googlecloudplatform.imtqy.com/google-cloud-python/0.24.0/bigquery-table.html#google.cloud.bigquery.table.Table.exists

It works for me using the 0.27 Google Bigquery Python module.

0
source share

SQL Embedded Alternative

Tarheel's answer is probably the most correct at the moment

but I considered Ivan’s comment above that β€œ404 can also mean that a resource does not exist for a number of reasons,” so here is a solution that should always successfully execute a metadata request and return a result.

This is not the fastest because it should always execute the query, bigquery has overhead for small queries

The trick I saw earlier is to query the information_schema for the (tabular) object and union it with a fake query, which ensures that the record is always returned, even if the object does not. There is also LIMIT 1 and order to ensure that one returned record represents a table, if one exists. See SQL in the code below.

  • Despite the doc claims that the SQL Bigquery standard is ISO compatible, they do not support information_schema, but have __table_summary__
  • a dataset is needed because you cannot query __table_summary__ without specifying a dataset
  • a dataset is not a parameter in SQL because you cannot parameterize object names without problems with SQL injection (except for the magic _TABLE_SUFFIX , see https://cloud.google.com/bigquery/docs/querying-wildcard-tables )
 #!/usr/bin/env python """ Inline SQL way to check a table exists in Bigquery eg print(table_exists(dataset_name='<dataset_goes_here>', table_name='<real_table_name')) True print(table_exists(dataset_name='<dataset_goes_here>', table_name='imaginary_table_name')) False """ from __future__ import print_function from google.cloud import bigquery def table_exists(dataset_name, table_name): client = bigquery.Client() query = """ SELECT table_exists FROM ( SELECT true as table_exists, 1 as ordering FROM __TABLES_SUMMARY__ WHERE table_id = @table_name UNION ALL SELECT false as table_exists, 2 as ordering ) ORDER by ordering LIMIT 1""" query_params = [bigquery.ScalarQueryParameter('table_name', 'STRING', table_name)] job_config = bigquery.QueryJobConfig() job_config.query_parameters = query_params if dataset_name is not None: dataset_ref = client.dataset(dataset_name) job_config.default_dataset = dataset_ref query_job = client.query( query, job_config=job_config ) results = query_job.result() for row in results: # There is only one row because LIMIT 1 in the SQL return row.table_exists 
0
source share

you can now use exists() to check if a dataset exists in the same table as BigQuery.

0
source share

All Articles