Store complex (e.g. labels + id) metadata in a SOLR document

I use SOLR to store documents containing some metadata that consists of several values. Usually an identifier with a label. A simple example is the name of the city and the unique identifier of this city. An identifier is required because different cities may have the same name as Berlin in Germany and Berlin in the USA. The name is required by obvioulsy because I want to find this line.

If I use granules, I would like to return two faces labeled "Berlin". If I restrict my search (using another metadata field) to documents from Germany, I expect to get only one facet for German Berlin. Obviously, this will not work if I store the identifier and label in two separate SOLR fields.

I would suggest that this is not an unusual requirement, but I could not find any useful information. My current approaches:

  • Implementing a full custom field type in Java: It’s hard for me to evaluate, because I am currently just a SOLR user, not a SOLR developer.

  • Put the identifier and label on the same line (for example, “123: Berlin” and “456: Berlin”) and define custom field types in schema.xml using a custom analyzer that separates the value. The sound is reasonable for me, but I'm not 100% sure if it will work with faceting.

  • I found some links on the subfield, but only on older pages, and I was unable to find useful documentation.

Is there any known way to solve this problem in SOLR?

+8
lucene solr
source share
4 answers

There seems to be no ready-made solution.

  • Your # 2 should work fine with some client side changes.
  • You can index your data with id_name as one string field. Needs a change during indexing. It is easier to use transformers if you use DIH.
  • Now you will have unique faces for each identifier, and with the client you can always separate the faces for display.

You can also check out Facet Pivots, which can provide a Hierarchical look.

+1
source share

Rotary cut can work.

Say you have fields: cityId, cityName, country

Make a reference face by city identifier, city name using query parameters:

 facet.pivot=cityId,cityName 

At the first level, as a standard face, you will receive each city identifier. But on the second level you will get the name of each city. Given that each city identifier will have only one name, you can simply read the name of each city from the next facet level (under the pivot element in XML).

 <lst name="facet_pivot"> <arr name="cityId,city"> <lst> <str name="field">cityId</str> <str name="value">1</str> <int name="count">1</int> <arr name="pivot"> <lst> <str name="field">city</str> <str name="value">berlin</str> <int name="count">1</int> </lst> </arr> </lst> <lst> <str name="field">cityId</str> <str name="value">2</str> <int name="count">1</int> <arr name="pivot"> <lst> <str name="field">city</str> <str name="value">berlin</str> <int name="count">1</int> </lst> </arr> </lst> <lst> <str name="field">cityId</str> <str name="value">3</str> <int name="count">1</int> <arr name="pivot"> <lst> <str name="field">city</str> <str name="value">melbourne</str> <int name="count">1</int> </lst> </arr> </lst> </arr> </lst> 

Basically, if the identifier is unique, you are guaranteed to have only one pivot value at the second level.

Optionally, if you want to group your “Berlins” together, just change the rotation order of the face and do this:

 facet.pivot=cityName,cityId 

and you will get “Berlin” on the first level and, possibly, several identifiers on the second level (and as a bonus you can add a third level country so that you can read the country for each city from the third level).

+2
source share

That should work. If you add a filter request, for example fq=country_name:Germany , it should return the faces for cities only in Germany. Please take a look at this example below:

Suppose you have 4 fields in your schema:

id, city_name, country_name, state_name

SAMPLES DATA:

id: 1

city_name: Berlin

country_name: Germany

state_name: Some_State1


id: 2

city_name: Berlin

country_name: USA

state_name: Some_State2


id: 3

city_name: Dublin

country_name: Ireland

state_name: Some_State3


id: 4

city_name: Dublin

country_name: USA

state_name: California


id: 5

city_name: Dublin

country_name: USA

state_name: Virginia


If you want to get a line for all cities with the name Dublin:

 /select/?q=*:*&facet=true&facet.field=country_name&facet.field=city_name&fq=city_name:Dublin 

As a result, the number of faces in Dublin will be 3


Now, if you want to get the edge for all cities with the name Dublin and limit the country to the USA, your query will look like this:

 /select/?q=*:*&facet=true&facet.field=country_name&facet.field=city_name&fq=city_name:Dublin&fq=country_name:USA 

As a result, the counter for the Dublin face will be 2, because we have two Dublin in the USA, one in California and the other in Virginia

NOTE. I added & fq = country_name: USA

0
source share

Pretty simple suggestion: use two fields in index time via copyField for values ​​like "123: Berlin"

one is not indexed and stored string field for cutting plus parsing / cleaning on the client side and use indexed copy instead of stored with a simple regular expression parser in ex: PatternReplaceCharFilterFactory .

No need for custom parsers or new types of fields, as you already indicated in your second solution

0
source share

All Articles