Creating Custom Elasticsearch Analyzers

Published on May 6, 2018 by

In a previous post, you saw how to configure one of the built-in analyzers as well as a token filter. Now it’s time to see how we can build our own custom analyzer. We do that by defining which character filters, tokenizer, and token filters the analyzer should consist of, and potentially configuring them.

PUT /analyzers_test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_stop": {
          "type": "standard",
          "stopwords": "_english_"
        },
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "standard",
            "lowercase",
            "trim",
            "my_stemmer"
          ]
        }
      },
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "name": "english"
        }
      }
    }
  }
}

The above query adds two analyzers and one token filter, which is used within the custom analyzer. Apart from the custom token filter, built-in character filters and token filters are used. And of course the standard tokenizer. You can find a list of the available ones in the documentation right here. Note that it’s good practice to include the standard token filter in custom analyzers even though it doesn’t do anything as of today. That’s just a way of future proofing things in case it will do something in future versions of Elasticsearch.

Great, so the index has been created with our settings. Let’s use the Analyze API to verify that the analyzer works as we expect.

POST /analyzers_test/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I'm in the mood for drinking semi-dry red wine!"
}
{
  "tokens": [
    {
      "token": "i'm",
      "start_offset": 0,
      "end_offset": 3,
      "type": "",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 4,
      "end_offset": 6,
      "type": "",
      "position": 1
    },
    {
      "token": "the",
      "start_offset": 7,
      "end_offset": 10,
      "type": "",
      "position": 2
    },
    {
      "token": "mood",
      "start_offset": 11,
      "end_offset": 15,
      "type": "",
      "position": 3
    },
    {
      "token": "for",
      "start_offset": 16,
      "end_offset": 19,
      "type": "",
      "position": 4
    },
    {
      "token": "drink",
      "start_offset": 20,
      "end_offset": 28,
      "type": "",
      "position": 5
    },
    {
      "token": "semi",
      "start_offset": 37,
      "end_offset": 41,
      "type": "",
      "position": 6
    },
    {
      "token": "dry",
      "start_offset": 42,
      "end_offset": 54,
      "type": "",
      "position": 7
    },
    {
      "token": "red",
      "start_offset": 55,
      "end_offset": 58,
      "type": "",
      "position": 8
    },
    {
      "token": "wine",
      "start_offset": 59,
      "end_offset": 63,
      "type": "",
      "position": 9
    }
  ]
}

Taking a glance at the results, we can see that the letter “i” has been lowercased within the first token. We can also see that the HTML tags have been stripped out, and that the word “drinking” has been stemmed to “drink.” Great, so the analyzer is good to go and we can now use it within field mappings.

Author avatar
Bo Andersen

About the Author

I am a back-end web developer with a passion for open source technologies. I have been a PHP developer for many years, and also have experience with Java and Spring Framework. I currently work full time as a lead developer. Apart from that, I also spend time on making online courses, so be sure to check those out!

Leave a Reply

Your e-mail address will not be published.