1

I have some text in elastic search containing urls in various formats (http://www, www.) what I want to do is to search for all texts containing e.g., google.com.

For the current search I use something like this query:

query = { "query": {
                "bool": {
                     "must": [{
                            "range": {
                            "cdate": {
                                "gt": dfrom,
                                "lte": dto }
                            }
                        },
             { "query_string":{
                "default_operator": "AND",
                "default_field": "text",
                "analyze_wildcard":"true",
                "query": searchString } }
            ]
        }
        }}

But a query looking like google.com never returns any result, searching for e.g., the term "test" works fine (without "). I do want to use query_string because I'd like to use boolean operators but I really need to be able to search substrings not only for whole words.

Thank you !

2
  • what is the mapping of your url fiedl? Commented Jan 19, 2016 at 21:42
  • "text" is just a text field. Commented Jan 19, 2016 at 22:28

2 Answers 2

1

It is true indeed that http://www.google.com will be tokenized by the standard analyzer into http and www.google.com and thus google.com will not be found.

So the standard analyzer alone will not help here, we need a token filter that will correctly transform URL tokens. Another way if your text field only contained URLs would have been to use the UAX Email URL tokenizer, but since the field can contain any other text (i.e. user comments), it won't work.

Fortunately, there's a new plugin around called analysis-url which provides an URL token filter, and this is exactly what we need (after a small modification I begged for, thanks @jlinn ;-) )

First, you need to install the plugin:

bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.2.0/elasticsearch-analysis-url-2.2.0.zip

Then, we can start playing. We need to create the proper analyzer for your text field:

curl -XPUT localhost:9200/test -d '{
  "settings": {
    "analysis": {
      "filter": {
        "url_host": {
          "type": "url",
          "part": "host",
          "url_decode": true,
          "passthrough": true
        }
      },
      "analyzer": {
        "url_host": {
          "filter": [
            "url_host"
          ],
          "tokenizer": "whitespace"
        }
      }
    }
  },
  "mappings": {
    "url": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "url_host"
        }
      }
    }
  }
}'

With this analyzer and mapping, we can properly index the host you want to be able to search for. For instance, let's analyze the string blabla bla http://www.google.com blabla using our new analyzer.

curl -XGET 'localhost:9200/urls/_analyze?analyzer=url_host&pretty' -d 'blabla bla http://www.google.com blabla'

We'll get the following tokens:

{
  "tokens" : [ {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "bla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "www.google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 5
  } ]
}

As you can see the http://www.google.com part will be tokenized into:

  • www.google.com
  • google.com i.e. what you expected
  • com

So now if your searchString is google.com you'll be able to find all the documents which have a text field containing google.com (or www.google.com).

Sign up to request clarification or add additional context in comments.

1 Comment

@pinas Were you able to try this out?
0

Full-text search is always about exact matches in the inverted index, unless you perform a wild-card search which forces traversing the inverted index. Using a wildcard at the beginning of your queryString will lead to a full-traverse of your index and is not recommended.

Consider not just indexing the URL, but also the domain (by stripping off protocol, subdomain and any information following the domain) applying the Keyword Tokenizer. Then you can search the domains against this field.

1 Comment

HI - I might have explained this bad. What I do is to index posts from a company internal wiki (actually the comments) and try to make these searchable. One query that I'd like to execute is to find all pastebin links that were ever mentioned in these comments. So I do not index only the urls but full text comments which might contain pastebin links. In the future I'd like to do queries like "all comments that contain a pastebin link and the word 'test engine' " - query string seems fine but this boolean operations but wildcards are not working.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.