3

I'm having problems correctly expressing a regexp for the ElasticSearch Regexp Filter. I'm trying to match on anything in "info-for/media" in the url field e.g. http://mydomain.co.uk/info-for/media/press-release-1. To try and get the regex right I'm using match_all for now, but this will eventually be match_phrase with the user's query string.

POST to localhost:9200/_search

{
"query" : {
               "match_all" : { },
               "filtered" : {
                           "filter" : {
                                   "regexp": {
                                        "url":".*info-for/media.*" 
                                    }
                          }
                }
         },
}

This returns 0 hits, but does parse correctly. .*info.* does get results containing the url, but unfortunately is too broad, e.g. matching any urls containing "information". As soon as I add the hyphen in "info-for" back in, I get 0 results again. No matter what combination of escape characters I try, I either get a parse exception, or no matches. Can anybody help explain what I'm doing wrong?

1 Answer 1

8

First, to the extent possible, try to never use regular expressions or wildcards that don't have a prefix. The way a search for .*foo.* is done, is that every single term in the index's dictionary is matched against the pattern, which in turn is constructed into an OR-query of the matching terms. This is O(n) in the number of unique terms in your corpus, with a subsequent search that is quite expensive as well.

This article has some more details about that: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/

Secondly, your url is probably tokenized in a way that makes "info-for" and "media" separate terms in your index. Thus, there is no info-for/media-term in the dictionary for the regexp to match.

What you probably want to do is to index the path and the domain separately, with a path_hierarchy-tokenizer to generate the terms.

Here is an example that demonstrates how the tokens are generated: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis

I.e. /foo/bar/baz generates the tokens /foo/bar/baz, /foo/bar, /foo and the domain foo.example.com is tokenized to foo.example.com, example.com, com

A search for anything in below /foo/bar could then be a simple term filter matching path:/foo/bar. That's a massively more performant filter, which can also be cached.

Sign up to request clarification or add additional context in comments.

4 Comments

A simpler option is to map this field as a multi field with a non-analyzed version, and run the regexp filter on the not-analyzed field. In general, the regexp filter makes more sense on a non analyzed field.
That'd still be a very expensive query to execute.
Thanks @AlexBrasetvik I'm having some difficulty POSTing a JSON version of the mapping/analyzer config to my index _settings endpoint. It can't find the analyzer I've declared. Sample JSON would be really helpful if you have it, thanks.
@AlexBrasetvik why would it still be expensive to execute regex on non_analyzed fields?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.