ElasticSearch Regexp Filter

Question

I'm having problems correctly expressing a regexp for the ElasticSearch Regexp Filter. I'm trying to match on anything in "info-for/media" in the url field e.g. http://mydomain.co.uk/info-for/media/press-release-1. To try and get the regex right I'm using match_all for now, but this will eventually be match_phrase with the user's query string.

POST to localhost:9200/_search

{
"query" : {
               "match_all" : { },
               "filtered" : {
                           "filter" : {
                                   "regexp": {
                                        "url":".*info-for/media.*" 
                                    }
                          }
                }
         },
}

This returns 0 hits, but does parse correctly. .*info.* does get results containing the url, but unfortunately is too broad, e.g. matching any urls containing "information". As soon as I add the hyphen in "info-for" back in, I get 0 results again. No matter what combination of escape characters I try, I either get a parse exception, or no matches. Can anybody help explain what I'm doing wrong?

Alex Brasetvik · Accepted Answer · 2014-01-14 12:42:54Z

First, to the extent possible, try to never use regular expressions or wildcards that don't have a prefix. The way a search for .*foo.* is done, is that every single term in the index's dictionary is matched against the pattern, which in turn is constructed into an OR-query of the matching terms. This is O(n) in the number of unique terms in your corpus, with a subsequent search that is quite expensive as well.

This article has some more details about that: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/

Secondly, your url is probably tokenized in a way that makes "info-for" and "media" separate terms in your index. Thus, there is no info-for/media-term in the dictionary for the regexp to match.

What you probably want to do is to index the path and the domain separately, with a path_hierarchy-tokenizer to generate the terms.

Here is an example that demonstrates how the tokens are generated: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis

I.e. /foo/bar/baz generates the tokens /foo/bar/baz, /foo/bar, /foo and the domain foo.example.com is tokenized to foo.example.com, example.com, com

A search for anything in below /foo/bar could then be a simple term filter matching path:/foo/bar. That's a massively more performant filter, which can also be cached.

A simpler option is to map this field as a multi field with a non-analyzed version, and run the regexp filter on the not-analyzed field. In general, the regexp filter makes more sense on a non analyzed field.
Thanks @AlexBrasetvik I'm having some difficulty POSTing a JSON version of the mapping/analyzer config to my index _settings endpoint. It can't find the analyzer I've declared. Sample JSON would be really helpful if you have it, thanks.
@AlexBrasetvik why would it still be expensive to execute regex on non_analyzed fields?

Collectives™ on Stack Overflow

ElasticSearch Regexp Filter

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related