0

What should be the regular expression pattern for a tokenizer in Elasticsearch for matching C# and C++ each separately?

Right now we have one analyzer for this, but whenever we are trying to search C# it is showing C++ also as a match and vice versa.

2
  • What are you asking? Do you want to analyse some C++ code? or are you talking about the APIs to ES? Commented Feb 13, 2015 at 17:31
  • Not c++ code but i want to analyze special characters like +,# $ as tokens which is right now is not possible in standard analyzer. Commented Feb 16, 2015 at 5:44

1 Answer 1

1

Assuming I'm understanding you correctly, one thing you can do is set up an analyzer that just tokenizes on whitespace. The default standard analyzer tokenizes on symbols as well as whitespace, so "c++" and "c#" both get turned into the term "c", so both documents will match a search for one or the other.

One way around this (though it might cause you other headaches), is to use an analyzer like this:

"whitespace_analyzer": {
   "type": "custom",
   "tokenizer": "whitespace",
   "filter": [
      "lowercase",
      "asciifolding"
   ]
}

Or, in a full toy example, I can set up an index like:

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
         "analyzer": {
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "analyzer": "whitespace_analyzer"
            }
         }
      }
   }
}

then add a few docs via the bulk api:

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc", "_id":1}}
{"text_field": "some text with C++"}
{"index":{"_index":"test_index","_type":"doc", "_id":2}}
{"text_field": "some text with C#"}
{"index":{"_index":"test_index","_type":"doc", "_id":3}}
{"text_field": "some text with Objective-C"}

Now a search for "C++" only gives me back the document that contains that term:

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field": "C++"
        }
    }
}
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.70273256,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.70273256,
            "_source": {
               "text_field": "some text with C++"
            }
         }
      ]
   }
}

and likewise with "C#"

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field": "C#"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.70273256,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 0.70273256,
            "_source": {
               "text_field": "some text with C#"
            }
         }
      ]
   }
}

This solution may or may not end up giving you what you want, because it won't tokenize on punctuation either.

Here is the code I used:

http://sense.qbox.io/gist/92871671ea7313356cbbd1ea900c3d55944bd20b

EDIT: Here is a slightly more advanced solution that can help solve the punctuation problem. I got the idea from this article. The basic idea is that you can declare certain symbol characters to be alpha-numeric characters.

So I create the index using a custom token filter, then add the same three docs plus another one that the previous solution would not handle correctly:

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
         "filter": {
            "symbol_filter": {
               "type": "word_delimiter",
               "type_table": [
                  "# => ALPHANUM",
                  "+ => ALPHANUM",
                  "@ => ALPHANUM"
               ]
            }
         },
         "analyzer": {
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "symbol_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "analyzer": "whitespace_analyzer"
            }
         }
      }
   }
}

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc", "_id":1}}
{"text_field": "some text with C++"}
{"index":{"_index":"test_index","_type":"doc", "_id":2}}
{"text_field": "some text with C#"}
{"index":{"_index":"test_index","_type":"doc", "_id":3}}
{"text_field": "some text with Objective-C"}
{"index":{"_index":"test_index","_type":"doc", "_id":4}}
{"text_field": "some text with Objective-C, C#, and C++."}

Now querying for "C++" will return both the documents that contain that token:

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field": "C++"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.643841,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.643841,
            "_source": {
               "text_field": "some text with C++"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "4",
            "_score": 0.40240064,
            "_source": {
               "text_field": "some text with Objective-C, C#, and C++."
            }
         }
      ]
   }
}

Here is the code for this one:

http://sense.qbox.io/gist/5c583b4e99b8f3b088925ccdb894695aa0c257cb

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.