How can we deal with NULL values ​​that have specific values?

Question

I am trying to keep a boolean for elasticsearch search, but this is particularly for NULL. In this case, it doesn’t matter.

There seem to be several options, but it’s not entirely clear what would be best.

We are using ElasticSearch version 5.0.2

Option 1

It would be trivial to save it as a boolean with NULL values. Those will be considered as “missing” ES.

PUT my_index { "mappings": { "my_type": { "properties": { "my_boolean": { "type": "boolean"} } } } } PUT my_index/my_type/1 {"my_boolean": true} PUT my_index/my_type/2 {"my_boolean": false} PUT my_index/my_type/3 {"my_boolean": null} 

This has several problems, one of which is congestion. There does not seem to be an easy way to get true , false and NULL values ​​in aggregation.

I know the missing function, so I know that I can do the following:

 GET my_index/_search { "size":0, "aggregations": { "my_boolean": { "terms": { "field": "my_boolean" } }, "missing_fields": { "missing" : { "field": "my_boolean" } } } } 

But this will lead to the appearance of a bucket with two values ​​(true / false) and separately count the missing documents. It seems like it will cause problems.

Option 2

Another option is to actually set the NULL value to as described in the manual . The problem is that the value must be the correct type, and there is nothing but true and false as boolean.

The null_value value must be the same data type as the field. For example, a long field cannot have a null_value string.

This means that we can use another type that supports more than two values, for example, an integer, but it would be the same in my head as saying: let map it as an integer, and define 1 as true, 2 as false and 3 - as null. This will work, but we will have an implicit matching that everyone should know about. (All manufacturers / consumers / whatyamahaveits).

Option 3

The final version will be an attempt and script is our way out of this problem.

 GET my_index/_search { "size":0, "aggregations": { "my_boolean": { "terms": { "script" : { "inline": "if(doc['my_boolean'].length === 1) { if(doc['my_boolean'].value === true){ return 1;} else {return 2;} } else { return 3;}" } } } } } 

Now we get the right results in a few robust buckets.

 "aggregations": { "my_boolean": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "1", "doc_count": 1 }, { "key": "2", "doc_count": 1 }, { "key": "3", "doc_count": 1 } ] } } 

Note that we still have an implicit mapping with keys here, so this one seems to have the same problems that displays it as an integer. Nevertheless, your data type is what it should be, so there may be something. Please note that we cannot have a bucket with the key "null". We can call them “true”, “false” and “zero” (lines), but this is the same situation, but it is hidden even more.

Question

What is the best way to deal with this zero problem? (Or maybe we should call this a "three-state-logical problem"?)

To clarify: we fear that a later “non-standard” value may cause problems. The first thing we saw was bucketing, which we could solve with a script solution, but maybe we will encounter other problems later. Therefore, we are looking for best practice for storing this type of data, rather than a quick fix for a specific problem.

+7
null elasticsearch tri-state-logic
source share
2 answers

In the end, we went to map the various states to a byte.

An invalid value only works if the type is capable of having this value, so we need to display it anyway, so we add an extra number during insertion.

Thus, instead of the boolean values true , false and null or an integer with values 1 , 2 and null (with missing = -1 ), we use bytes with 1 , 2 and 3 , which means (in random order) true , false and null .

0
source share

You can use missing for aggregation of terms (that is, not for separate aggregation of missing ).

So you can continue to use your boolean field and get your three buckets with 0, 1 and -1 (for null)?

 { "size":0, "aggregations": { "my_boolean": { "terms": { "field": "my_boolean", "missing": -1 <--- add this } } } } 

It has no drawback to change the type of the field and encode it into some other data type (integer / string), and also frees you from using scripts, as this will not scale very well.

+3
source share

All Articles