Is it possible to have another timestamp as a dimension in a druid?

Question

Is it possible to have another timestamp as a dimension in a druid?

Is it possible to have a Druid data source with 2 (or more) temporary maps? I know that a Druid is a temporary DB and I have no problems with the concept, but I would like to add another dimension that I can work with as a timestamp

eg. User retention: the metric is probably indicated for a certain date, but I also need to create cohorts based on the date of user registration and folding of these dates, possibly for several weeks, months or filtering only for certain periods of time ....

If functionality is not supported, are there any plugins? Any dirty decisions?

+6

druid

Štefan Šoóš Jun 24 '16 at 8:01

source share

2 answers

Štefan Šoóš · Answer 1 · 2016-07-04T12:26:00+0000

Although I would prefer that the official implementation for timestamp sizes be supported in the druid, I found the dirty hack I was looking for.

DataSource Schema

First of all, I wanted to know how many users were registered for each day, having the ability to aggregate into groups by date / month / year

here is the data scheme used:

"dataSchema": { "dataSource": "ds1", "parser": { "parseSpec": { "format": "json", "timestampSpec": { "column": "timestamp", "format": "iso" }, "dimensionsSpec": { "dimensions": [ "user_id", "platform", "register_time" ], "dimensionExclusions": [], "spatialDimensions": [] } } }, "metricsSpec": [ { "type" : "hyperUnique", "name" : "users", "fieldName" : "user_id" } ], "granularitySpec": { "type": "uniform", "segmentGranularity": "HOUR", "queryGranularity": "DAY", "intervals": ["2015-01-01/2017-01-01"] } },

therefore, the data samples should look something like this (each record is an input event):

 {"user_id": 4151948, "platform": "portal", "register_time": "2016-05-29T00:45:36.000Z", "timestamp": "2016-06-29T22:18:11.000Z"} {"user_id": 2871923, "platform": "portal", "register_time": "2014-05-24T10:28:57.000Z", "timestamp": "2016-06-29T22:18:25.000Z"}

as you can see, my "main" timestamp for which I calculate these indicators is the timestamp field, where register_time is just a measurement in stringy - ISO 8601 UTC format .

Aggregation

And now, for the fun part: I managed to aggregate by timestamp (date) and register_time (date again) thanks to the time extraction function

The request is as follows:

 { "intervals": "2016-01-20/2016-07-01", "dimensions": [ { "type": "extraction", "dimension": "register_time", "outputName": "reg_date", "extractionFn": { "type": "timeFormat", "format": "YYYY-MM-dd", "timeZone": "Europe/Bratislava" , "locale": "sk-SK" } } ], "granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"}, "aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}], "dataSource": "ds1", "queryType": "groupBy" }

Filtration

The filtering solution is based on a JavaScript retrieval function with which I can convert the date to UNIX and use it inside a (for example) related filter :

 { "intervals": "2016-01-20/2016-07-01", "dimensions": [ "platform", { "type": "extraction", "dimension": "register_time", "outputName": "reg_date", "extractionFn": { "type": "javascript", "function": "function(x) {return Date.parse(x)/1000}" } } ], "granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"}, "aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}], "dataSource": "ds1", "queryType": "groupBy" "filter": { "type": "bound", "dimension": "register_time", "outputName": "reg_date", "alphaNumeric": "true" "extractionFn": { "type": "javascript", "function": "function(x) {return Date.parse(x)/1000}" } } }

I tried to filter it "directly" using a javascript filter, but I was not able to convince the druid to return the correct entries, although I double-check it for various JavaScript REPL requests, but hey, I'm not a JavaScript expert.

Slim bouguerra · Answer 2 · 2016-06-25T17:23:08+0000

Unfortunately, the Druid has only one timestamped column that can be used to perform rollup plus, and currently the druid treats all the other columns as rows (with the exception of metrics, of course), so you can add other string columns with timestamp values but the only thing you can do with this is filtering. I think you could hack it that way. Hopefully in the future, the druid will allow the use of different types of columns, and perhaps the timestamp will be one of them.

Is it possible to have another timestamp as a dimension in a druid?

More articles: