Simple elasticsearch regexp

Question

I'm trying to write a query to will give me all the documents where the field "id" is of the form: "SOMETHING-SOMETHING-4SOMETHING-SOMETHING-SOMETHING"

For instance, ab-ba-4a-b-a is a valid id.

I wrote this query

  "query": 
  {
    "regexp": 
    {
      "id":
      {
        "value": ".*-.*-4.*-.*-.*"
      }
    }
  }

It gets no hits. What's wrong with this? I can see many ids of this form.

Could you also let me know which version of ES are you using? — Kamal Kunjapur
– Kamal Kunjapur, Commented Jul 6, 2020 at 14:45
I'm using 7.8, but the answer you wrote was great! you shouldnt have deleted it. it works — Oria Gruber
– Oria Gruber, Commented Jul 6, 2020 at 14:52
I've posted it once again, thought if you are using version 2.x, I may have to modify it a bit. But happy your query has been resolved!! — Kamal Kunjapur
– Kamal Kunjapur, Commented Jul 6, 2020 at 14:53

Kamal Kunjapur · Accepted Answer · 2020-07-06 14:44:37Z

If the id field is of type keyword the regexp should be working fine.

However if it is of type text, notice how elasticsearch stores the token internally.

POST /_analyze
{
  "text": "abc-abc-4bc-abc-abc",
  "analyzer": "standard"
}

Response:

{
  "tokens" : [
    {
      "token" : "abc",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "abc",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "4bc",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "abc",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "abc",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

Notice that it breaks down the token abc-abc-4abc-abc-abc into 5 strings. Take a look at what Analysis and Analyzers are and how they are only applied on text fields.

However, keyword datatype has been created only for the cases where you do not want your text to be analyzed (i.e. broken into tokens and stored in inverted indexes) and stores the string value as it is internally.

Now just in case if your mapping is dynamic, ES by default creates two different fields for string values. a text and its keyword sibling, something like below:

{
    "mappings" : {
      "properties" : {
        "id" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }

In that case, just apply the query you have on id.keyword field.

POST <your_index_name>/_search
{
  "query": {
    "regexp": {
      "id.keyword": ".*-.*-4.*-.*-.*"
    }
  }
}

Hope that helps!

Collectives™ on Stack Overflow

Simple elasticsearch regexp

1 Answer 1

Response:

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Response:

Comments

Related