With lots of fields that change, add / delete, etc., Maintaining ETL pipelines can be quite cumbersome. Rockset was designed to solve this problem - by indexing all fields in JSON documents, as well as all type information and providing an SQL API on top of that.
For example, with the Rockset collection named new_col, I can start by adding one document to an empty collection that looks like this:
{ "my-field":"xyz", "my-other-field":"mno" }
... and then request it.
rockset> select * from new_col; +------------------+----------------------------------------+---------+------------+------------------+ | _event_time | _id | _meta | my-field | my-other-field | |------------------+----------------------------------------+---------+------------+------------------| | 1542417599528000 | 70d268fb-fa00-40fe-879a-f18dd0732e4a-1 | {} | xyz | mno | +------------------+----------------------------------------+---------+------------+------------------+
Now, if a new JSON document arrives with some new fields - maybe with some arrays, nested fields, etc., I can still query it using SQL.
{ "my-field":"xyz2", "my-other-field":[ { "c1":"this", "c2":"field", "c3":"has", "c4":"changed" } ] }
I am adding this to the same collection and I can request it as before.
rockset> select * from new_col; +------------------+----------------------------------------+---------+------------+-------------------------------------------------------- | _event_time | _id | _meta | my-field | my-other-field |------------------+----------------------------------------+---------+------------+-------------------------------------------------------- | 1542417764940000 | 3cf51333-ca2c-401b-9a15-1138a4c73ffe-1 | {} | xyz2 | [{'c2': 'field', 'c1': 'this', 'c4': 'changed', 'c3': ' | 1542417599528000 | 70d268fb-fa00-40fe-879a-f18dd0732e4a-1 | {} | xyz | mno +------------------+----------------------------------------+---------+------------+--------------------------------------------------------
I can further smooth out the fields of nested objects and arrays during the query and build the table that I want to get to - without having to do any conversions in advance.
rockset> select mof.* from new_col, unnest(new_col."my-other-field") as mof limit 10; +------+-------+------+---------+ | c1 | c2 | c3 | c4 | |------+-------+------+---------| | this | field | has | changed | +------+-------+------+---------+
In addition to this, information about strict types is stored, which means that mixed types, etc. will not confuse me. Adding a third document:
{ "my-field":"xyz3", "my-other-field":[ { "c1":"unexpected", "c2":99, "c3":100, "c4":101 } ] }
This still adds my document as expected.
rockset> select mof.* from new_col, unnest(new_col."my-other-field") as mof limit 10; +------------+-------+------+---------+ | c1 | c2 | c3 | c4 | |------------+-------+------+---------| | unexpected | 99 | 100 | 101 | | this | field | has | changed | +------------+-------+------+---------+
... and the fields are strongly typed.
rockset> select typeof(mof.c2) from new_col, unnest(new_col."my-other-field") as mof limit 10; +-----------+ | ?typeof | |-----------| | int | | string | +-----------+
Full disclosure : I work at Rockset. There is a free level if you want to do this.
Edit 2 : I wrote a blog post about this. See https://rockset.com/blog/running-sql-on-nested-json/