BigQuery: create a JSON data type column

Question

BigQuery: create a JSON data type column

I am trying to load json with the following schema in BigQuery:

{ key_a:value_a, key_b:{ key_c:value_c, key_d:value_d } key_e:{ key_f:value_f, key_g:value_g } }

The keys under key_e are dynamic, that is, in one answer key_e will contain key_f and key_g, and for another answer it will contain key_h and key_i. New keys can be created at any time, so I cannot create a record with fields with a zero value for all possible keys.

Instead, I want to create a column with a JSON data type, which can then be queried using the JSON_EXTRACT () function. I tried loading key_e as a column with the STRING data type, but value_e is parsed as JSON and therefore fails.

How can I load a JSON section into a single BigQuery column when there is no JSON data type?

+5

google-bigquery

Nathanc Jun 06 '16 at 14:55

source share

3 answers

You cannot do this directly with BigQuery, but you can make it work in two passes:

(1) Import the JSON data as a CSV file with one row column.

(2) Convert each row to pack a “any type” field into a string. Write a UDF that takes a row and emits the final set of columns that you would like. Add the result of this query to the target table.

Example

I will start with JSON:

 {"a": 0, "b": "zero", "c": { "woodchuck": "a"}} {"a": 1, "b": "one", "c": { "chipmunk": "b"}} {"a": 2, "b": "two", "c": { "squirrel": "c"}} {"a": 3, "b": "three", "c": { "chinchilla": "d"}} {"a": 4, "b": "four", "c": { "capybara": "e"}} {"a": 5, "b": "five", "c": { "housemouse": "f"}} {"a": 6, "b": "six", "c": { "molerat": "g"}} {"a": 7, "b": "seven", "c": { "marmot": "h"}} {"a": 8, "b": "eight", "c": { "badger": "i"}}

Import it into BigQuery as a CSV with one STRING column (I called it "blob"). I had to set the delimiter character to something arbitrary and unlikely (thorn - 'þ'), or it turned off by default ','.

Verify that the table is imported correctly. You should see your simple single-column layout, and the look should look just like your source file.

Then we write a request to convert it to the desired form. In this example, we need the following scheme:

 a (INTEGER) b (STRING) c (STRING -- packed JSON)

We can do this with UDF:

 // Map a JSON string column ('blob') => { a (integer), b (string), c (json-string) } bigquery.defineFunction( 'extractAndRepack', // Name of the function exported to SQL ['blob'], // Names of input columns [{'name': 'a', 'type': 'integer'}, // Output schema {'name': 'b', 'type': 'string'}, {'name': 'c', 'type': 'string'}], function (row, emit) { var parsed = JSON.parse(row.blob); var repacked = JSON.stringify(parsed.c); emit({a: parsed.a, b: parsed.b, c: repacked}); } );

And the corresponding request:

 SELECT a, b, c FROM extractAndRepack(JsonAnyKey.raw)

Now you just need to run the query (by selecting the desired target table), and you will have your data in the form that you like.

 Row abc 1 0 zero {"woodchuck":"a"} 2 1 one {"chipmunk":"b"} 3 2 two {"squirrel":"c"} 4 3 three {"chinchilla":"d"} 5 4 four {"capybara":"e"} 6 5 five {"housemouse":"f"} 7 6 six {"molerat":"g"} 8 7 seven {"marmot":"h"} 9 8 eight {"badger":"i"}

+3

Adam lydick Jun 06 '16 at 17:18

source share

One way to do this is to load this file as CSV instead of JSON (and specify values or exclude new lines in the middle), then it will become a single STRING column inside BigQuery.

PS You're right that having a native JSON data type would make this scenario more natural, and the BigQuery team is well aware of this.

+1

Mosha pasumansky Jun 06 '16 at 17:13

source share

Mikhail Berlyant · Accepted Answer · 2016-06-06T20:22:40+0000

Having JSON as a single row column inside BigQuery is a definite option. If you have a large amount of data, this can lead to a high query price, since all your data will fall into one column, and in fact the logic request can become quite dirty.

If you have the luxury to slightly change your “design” - I would recommend looking at one below - here you can use the REPEATED mode

Table layout:

 [ { "name": "key_a", "type": "STRING" }, { "name": "key_b", "type": "RECORD", "mode": "REPEATED", "fields": [ { "name": "key", "type": "STRING"}, { "name": "value", "type": "STRING"} ] }, { "name": "key_e", "type": "RECORD", "mode": "REPEATED", "fields": [ { "name": "key", "type": "STRING"}, { "name": "value", "type": "STRING"} ] } ]

JSON loading example

 {"key_a": "value_a1", "key_b": [{"key": "key_c", "value": "value_c"}, {"key": "key_d", "value": "value_d"}], "key_e": [{"key": "key_f", "value": "value_f"}, {"key": "key_g", "value": "value_g"}]} {"key_a": "value_a2", "key_b": [{"key": "key_x", "value": "value_x"}, {"key": "key_y", "value": "value_y"}], "key_e": [{"key": "key_h", "value": "value_h"}, {"key": "key_i", "value": "value_i"}]}

Please note: this should be a line with line separators in a new line, so each line should be on the same line

BigQuery: create a JSON data type column

More articles: