Elasticsearch Java API Terms aggregation weirdness

Question

I am currently indexing tags (industries) for an entity with a data structure like this:

industry: ["Consulting & Recruitment","Professional Services","Education & Training"]

I am applying a termsAggregation to the query as:

AggregationBuilders.terms("industry").field("industry");

What I expect to come out:

Key: "Consulting & Recruitment"
docCount: 100

What I actually get:

Key: "Consulting"
docCount: 100
Key: "Recruitment"
docCount: 100.

Is there a way to correct this?

Thanks

I have deeply described this problem and given two solutions here. — Vineeth Mohan
– Vineeth Mohan, Commented Oct 9, 2015 at 10:32

bittusarkar · Accepted Answer · 2015-02-20 11:49:05Z

Looks like the field industry was indexed using the default analyzer which breaks the input string at word boundaries and lower cases them. Hence in your case, the indexed tokens would be "consulting", "recruitment", "professional", "services", "education", and "training". Term aggregation pick tokens that are indexed. Hence it picks up only "consulting" instead of "Consulting & Recruitment". The way to fix this would be to make the field industry non analyzed. In that case, the tokens "Consulting & Recruitment", "Professional Services", "Education & Training" will be indexed as is and you'll get expected results.

Milind J · Accepted Answer · 2015-02-20 21:19:14Z

Check the analyzer for this field, I believe its set to Standard or so. Therefore your content is broken down into words representing tokens, and common verbs like '&' is not considered as a token/key while aggregating.

Elastic search indexes your documents with these tokens('recruitment', 'consultants'). Thus according to the primary functionality of elastic-search, this behavior is as expected; That is, when searching by a keyword 'consulting', ES would then return the documents with relevant score or in simple words containing the keyword 'consulting'.

If you insist of getting "Consulting & Recruitment" as a whole key or token, then you need to stop the tokenizer from splitting it into multiple terms.

Check for pattern tokenizer, to customize the way you split these into different tokens. Its like designing a tokenizer to consider "Consulting & Recruitment" as one big word, but then your tokens would not be well defined and hence your search might suffer.

One solution is that you can change the format of your data, use a industry-type-code representing each industry and have another field as industry-name to have your text content. Index the document with the field industry-type-code as tokenised/standard, and the field industry-name as another additional one. For normal search operations use the field industry-name , for aggregation use the field industry-type-code.

{
  "mappings": {
    "industries" : {
      "properties" : {
        "industry-type-code" : {
          "type" :    "string",
          "analyzer": "standard"
        },
        "industry-name" : {
          "type" :   "string",
          "analyzer": "standard"
        }
      }
    }
  }
}

Collectives™ on Stack Overflow

Elasticsearch Java API Terms aggregation weirdness

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related