Best architecture for converting JSON to SQL?

I'm just wondering if anyone has any thoughts on converting the database structure of a JSON document to SQL. This must be done for data integration / storage.

JSON fields are relatively static, but new "fields" can spring every 2-4 weeks.

Due to the nature of this and the conversion to SQL --- I thought ... parse all static fields into SQL fields. Fortunately, “dynamic” fields are structured in one part of the JSON document.

My idea was to simply unload this section of “dynamic fields”, which can contain 50, 100 fields that know - it can change slowly - into one additional SQL field.

Thus, at least the ETL process is relatively static regardless of how the JSON fields change.

Then the second layer, or, possibly, the “view” will essentially parse this giant column into its individual fields. The IE giant column might say "color: red, status: open; city: Rome" ... and a number of string functions will parse them to fill in the color, status and city fields, possibly in the view.

I'm not sure if this is crazy thinking or not. Another option would be to execute MySQL queries on the fly (to add columns) since they are found in JSON documents, but this is his own set of problems.

Anyone have any thoughts on this?

And let's say that the database is added, never updated. In this case, parsing should be done only once per line. Would a performance still be the best option? Or just another table?

+6
source share
3 answers

Simple strategy: deduce from JSON the fields that are fixed and what you know. Put them in SQL tables.

Fields that you do not recognize leave them as JSON. If the database supports the JSON type, put it there. Otherwise, save it in a large string field.

Do not start parsing JSON into anonymous fields, especially when fields change weekly (or so). Most databases currently support JSON to some extent, so you can use the database engine for parsing when you request data.

+5
source

With lots of fields that change, add / delete, etc., Maintaining ETL pipelines can be quite cumbersome. Rockset was designed to solve this problem - by indexing all fields in JSON documents, as well as all type information and providing an SQL API on top of that.

For example, with the Rockset collection named new_col, I can start by adding one document to an empty collection that looks like this:

{ "my-field":"xyz", "my-other-field":"mno" } 

... and then request it.

 rockset> select * from new_col; +------------------+----------------------------------------+---------+------------+------------------+ | _event_time | _id | _meta | my-field | my-other-field | |------------------+----------------------------------------+---------+------------+------------------| | 1542417599528000 | 70d268fb-fa00-40fe-879a-f18dd0732e4a-1 | {} | xyz | mno | +------------------+----------------------------------------+---------+------------+------------------+ 

Now, if a new JSON document arrives with some new fields - maybe with some arrays, nested fields, etc., I can still query it using SQL.

 { "my-field":"xyz2", "my-other-field":[ { "c1":"this", "c2":"field", "c3":"has", "c4":"changed" } ] } 

I am adding this to the same collection and I can request it as before.

 rockset> select * from new_col; +------------------+----------------------------------------+---------+------------+-------------------------------------------------------- | _event_time | _id | _meta | my-field | my-other-field |------------------+----------------------------------------+---------+------------+-------------------------------------------------------- | 1542417764940000 | 3cf51333-ca2c-401b-9a15-1138a4c73ffe-1 | {} | xyz2 | [{'c2': 'field', 'c1': 'this', 'c4': 'changed', 'c3': ' | 1542417599528000 | 70d268fb-fa00-40fe-879a-f18dd0732e4a-1 | {} | xyz | mno +------------------+----------------------------------------+---------+------------+-------------------------------------------------------- 

I can further smooth out the fields of nested objects and arrays during the query and build the table that I want to get to - without having to do any conversions in advance.

 rockset> select mof.* from new_col, unnest(new_col."my-other-field") as mof limit 10; +------+-------+------+---------+ | c1 | c2 | c3 | c4 | |------+-------+------+---------| | this | field | has | changed | +------+-------+------+---------+ 

In addition to this, information about strict types is stored, which means that mixed types, etc. will not confuse me. Adding a third document:

 { "my-field":"xyz3", "my-other-field":[ { "c1":"unexpected", "c2":99, "c3":100, "c4":101 } ] } 

This still adds my document as expected.

 rockset> select mof.* from new_col, unnest(new_col."my-other-field") as mof limit 10; +------------+-------+------+---------+ | c1 | c2 | c3 | c4 | |------------+-------+------+---------| | unexpected | 99 | 100 | 101 | | this | field | has | changed | +------------+-------+------+---------+ 

... and the fields are strongly typed.

 rockset> select typeof(mof.c2) from new_col, unnest(new_col."my-other-field") as mof limit 10; +-----------+ | ?typeof | |-----------| | int | | string | +-----------+ 

Full disclosure : I work at Rockset. There is a free level if you want to do this.

Edit 2 : I wrote a blog post about this. See https://rockset.com/blog/running-sql-on-nested-json/

+2
source

It looks like you have a handle to “static” fields. Have you considered using a tag system for “dynamic” fields? Perhaps the table in which the value is stored, the foreign key in the list of main tags (a list of all available "static" fields that contains value type definitions, such as string, int, etc.) And the foreign key for the object, the field value is associated with? Of course, you must support the ETL process for well-known core tags, but this seems to make things a bit easier. When new tags are added, you can simply enter some (hopefully) tested SQL transactions that will add new tags to your system and return them with your application.

Having said all that, I most likely will subscribe a little and do some more design work, and come up with a strategy that is compatible at the application level to try to do this at the save level. DDD + Domain events, producer / consumer models, pub / sub, acting semantics or some other strategy that solves the problem in the future. It looks like most of this can be tied to some maintenance screens in order to maintain application-level consistency if you want to make some changes and override some of your business objects / objects.

0
source

All Articles