Creating Custom Elasticsearch Analyzers

Published on May 6, 2018 by

In a previous post, you saw how to configure one of the built-in analyzers as well as a token filter. Now it’s time to see how we can build our own custom analyzer. We do that by defining which character filters, tokenizer, and token filters the analyzer should consist of, and potentially configuring them.

PUT /analyzers_test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_stop": {
          "type": "standard",
          "stopwords": "_english_"
        },
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "trim",
            "my_stemmer"
          ]
        }
      },
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "name": "english"
        }
      }
    }
  }
}

The above query adds two analyzers and one token filter, which is used within the custom analyzer. Apart from the custom token filter, built-in character filters and token filters are used. And of course the standard tokenizer. You can find a list of the available ones in the documentation right here.

Great, so the index has been created with our settings. Let’s use the Analyze API to verify that the analyzer works as we expect.

POST /analyzers_test/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I'm in the mood for drinking semi-dry red wine!"
}
{
  "tokens": [
    {
      "token": "i'm",
      "start_offset": 0,
      "end_offset": 3,
      "type": "",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 4,
      "end_offset": 6,
      "type": "",
      "position": 1
    },
    {
      "token": "the",
      "start_offset": 7,
      "end_offset": 10,
      "type": "",
      "position": 2
    },
    {
      "token": "mood",
      "start_offset": 11,
      "end_offset": 15,
      "type": "",
      "position": 3
    },
    {
      "token": "for",
      "start_offset": 16,
      "end_offset": 19,
      "type": "",
      "position": 4
    },
    {
      "token": "drink",
      "start_offset": 20,
      "end_offset": 28,
      "type": "",
      "position": 5
    },
    {
      "token": "semi",
      "start_offset": 37,
      "end_offset": 41,
      "type": "",
      "position": 6
    },
    {
      "token": "dry",
      "start_offset": 42,
      "end_offset": 54,
      "type": "",
      "position": 7
    },
    {
      "token": "red",
      "start_offset": 55,
      "end_offset": 58,
      "type": "",
      "position": 8
    },
    {
      "token": "wine",
      "start_offset": 59,
      "end_offset": 63,
      "type": "",
      "position": 9
    }
  ]
}

Taking a glance at the results, we can see that the letter “i” has been lowercased within the first token. We can also see that the HTML tags have been stripped out, and that the word “drinking” has been stemmed to “drink.” Great, so the analyzer is good to go and we can now use it within field mappings.

Featured

Learn Elasticsearch today!

Take an online course and become an Elasticsearch champion!

Here is what you will learn:

  • The architecture of Elasticsearch
  • Mappings and analyzers
  • Many kinds of search queries (simple and advanced alike)
  • Aggregations, stemming, auto-completion, pagination, filters, fuzzy searches, etc.
  • ... and much more!
Elasticsearch logo
Author avatar
Bo Andersen

About the Author

I am a back-end web developer with a passion for open source technologies. I have been a PHP developer for many years, and also have experience with Java and Spring Framework. I currently work full time as a lead developer. Apart from that, I also spend time on making online courses, so be sure to check those out!

2 comments on »Creating Custom Elasticsearch Analyzers«

  1. manolis

    Hello!
    When i try to execute the above code,i get this response
    “The [standard] token filter has been removed.””
    Any fix for that?
    Thank you!

  2. Yes, simply remove it from the query. I have updated the post to reflect this.

Leave a Reply

Your e-mail address will not be published.