Creating Custom Elasticsearch Analyzers

Published on May 6, 2018 by Bo Andersen

In a previous post, you saw how to configure one of the built-in analyzers as well as a token filter. Now it’s time to see how we can build our own custom analyzer. We do that by defining which character filters, tokenizer, and token filters the analyzer should consist of, and potentially configuring them.

PUT /analyzers_test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_stop": {
          "type": "standard",
          "stopwords": "_english_"
        },
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "trim",
            "my_stemmer"
          ]
        }
      },
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "name": "english"
        }
      }
    }
  }
}

The above query adds two analyzers and one token filter, which is used within the custom analyzer. Apart from the custom token filter, built-in character filters and token filters are used. And of course the standard tokenizer. You can find a list of the available ones in the documentation right here.

Great, so the index has been created with our settings. Let’s use the Analyze API to verify that the analyzer works as we expect.

POST /analyzers_test/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I'm in the mood for drinking semi-dry red wine!"
}

{
  "tokens": [
    {
      "token": "i'm",
      "start_offset": 0,
      "end_offset": 3,
      "type": "",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 4,
      "end_offset": 6,
      "type": "",
      "position": 1
    },
    {
      "token": "the",
      "start_offset": 7,
      "end_offset": 10,
      "type": "",
      "position": 2
    },
    {
      "token": "mood",
      "start_offset": 11,
      "end_offset": 15,
      "type": "",
      "position": 3
    },
    {
      "token": "for",
      "start_offset": 16,
      "end_offset": 19,
      "type": "",
      "position": 4
    },
    {
      "token": "drink",
      "start_offset": 20,
      "end_offset": 28,
      "type": "",
      "position": 5
    },
    {
      "token": "semi",
      "start_offset": 37,
      "end_offset": 41,
      "type": "",
      "position": 6
    },
    {
      "token": "dry",
      "start_offset": 42,
      "end_offset": 54,
      "type": "",
      "position": 7
    },
    {
      "token": "red",
      "start_offset": 55,
      "end_offset": 58,
      "type": "",
      "position": 8
    },
    {
      "token": "wine",
      "start_offset": 59,
      "end_offset": 63,
      "type": "",
      "position": 9
    }
  ]
}

Taking a glance at the results, we can see that the letter “i” has been lowercased within the first token. We can also see that the HTML tags have been stripped out, and that the word “drinking” has been stemmed to “drink.” Great, so the analyzer is good to go and we can now use it within field mappings.

Featured

Learn Elasticsearch today!

Take an online course and become an Elasticsearch champion!

Here is what you will learn:

The architecture of Elasticsearch
Mappings and analyzers
Many kinds of search queries (simple and advanced alike)
Aggregations, stemming, auto-completion, pagination, filters, fuzzy searches, etc.
... and much more!

elasticsearch

2 comments on »Creating Custom Elasticsearch Analyzers«

manolis

# March 29, 2020 at 8:46 PM

Hello!
When i try to execute the above code,i get this response
“The [standard] token filter has been removed.””
Any fix for that?
Thank you!
- Reply
Bo Andersen

# May 8, 2020 at 10:40 AM

Yes, simply remove it from the query. I have updated the post to reflect this.
- Reply

Learn Elasticsearch today!

2 comments on »Creating Custom Elasticsearch Analyzers«

Leave a Reply Cancel reply