Question
I am trying to keep a boolean for elasticsearch search, but this is particularly for NULL. In this case, it doesn’t matter.
There seem to be several options, but it’s not entirely clear what would be best.
We are using ElasticSearch version 5.0.2
Option 1
It would be trivial to save it as a boolean with NULL values. Those will be considered as “missing” ES.
PUT my_index { "mappings": { "my_type": { "properties": { "my_boolean": { "type": "boolean"} } } } } PUT my_index/my_type/1 {"my_boolean": true} PUT my_index/my_type/2 {"my_boolean": false} PUT my_index/my_type/3 {"my_boolean": null}
This has several problems, one of which is congestion. There does not seem to be an easy way to get true , false and NULL values in aggregation.
I know the missing function, so I know that I can do the following:
GET my_index/_search { "size":0, "aggregations": { "my_boolean": { "terms": { "field": "my_boolean" } }, "missing_fields": { "missing" : { "field": "my_boolean" } } } }
But this will lead to the appearance of a bucket with two values (true / false) and separately count the missing documents. It seems like it will cause problems.
Option 2
Another option is to actually set the NULL value to as described in the manual . The problem is that the value must be the correct type, and there is nothing but true and false as boolean.
The null_value value must be the same data type as the field. For example, a long field cannot have a null_value string.
This means that we can use another type that supports more than two values, for example, an integer, but it would be the same in my head as saying: let map it as an integer, and define 1 as true, 2 as false and 3 - as null. This will work, but we will have an implicit matching that everyone should know about. (All manufacturers / consumers / whatyamahaveits).
Option 3
The final version will be an attempt and script is our way out of this problem.
GET my_index/_search { "size":0, "aggregations": { "my_boolean": { "terms": { "script" : { "inline": "if(doc['my_boolean'].length === 1) { if(doc['my_boolean'].value === true){ return 1;} else {return 2;} } else { return 3;}" } } } } }
Now we get the right results in a few robust buckets.
"aggregations": { "my_boolean": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "1", "doc_count": 1 }, { "key": "2", "doc_count": 1 }, { "key": "3", "doc_count": 1 } ] } }
Note that we still have an implicit mapping with keys here, so this one seems to have the same problems that displays it as an integer. Nevertheless, your data type is what it should be, so there may be something. Please note that we cannot have a bucket with the key "null". We can call them “true”, “false” and “zero” (lines), but this is the same situation, but it is hidden even more.
Question
What is the best way to deal with this zero problem? (Or maybe we should call this a "three-state-logical problem"?)
To clarify: we fear that a later “non-standard” value may cause problems. The first thing we saw was bucketing, which we could solve with a script solution, but maybe we will encounter other problems later. Therefore, we are looking for best practice for storing this type of data, rather than a quick fix for a specific problem.