Creating Custom Elasticsearch Analyzers
In a previous post, you saw how to configure one of the built-in analyzers as well as a token filter. Now it’s time to see how we can build our own custom analyzer. We do that by defining which character filters, tokenizer, and token filters the analyzer should consist of, and potentially configuring them.
PUT /analyzers_test
{
"settings": {
"analysis": {
"analyzer": {
"english_stop": {
"type": "standard",
"stopwords": "_english_"
},
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"trim",
"my_stemmer"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"name": "english"
}
}
}
}
}
The above query adds two analyzers and one token filter, which is used within the custom analyzer. Apart from the custom token filter, built-in character filters and token filters are used. And of course the standard tokenizer. You can find a list of the available ones in the documentation right here.
Great, so the index has been created with our settings. Let’s use the Analyze API to verify that the analyzer works as we expect.
POST /analyzers_test/_analyze
{
"analyzer": "my_analyzer",
"text": "I'm in the mood for drinking semi-dry red wine!"
}
{
"tokens": [
{
"token": "i'm",
"start_offset": 0,
"end_offset": 3,
"type": "",
"position": 0
},
{
"token": "in",
"start_offset": 4,
"end_offset": 6,
"type": "",
"position": 1
},
{
"token": "the",
"start_offset": 7,
"end_offset": 10,
"type": "",
"position": 2
},
{
"token": "mood",
"start_offset": 11,
"end_offset": 15,
"type": "",
"position": 3
},
{
"token": "for",
"start_offset": 16,
"end_offset": 19,
"type": "",
"position": 4
},
{
"token": "drink",
"start_offset": 20,
"end_offset": 28,
"type": "",
"position": 5
},
{
"token": "semi",
"start_offset": 37,
"end_offset": 41,
"type": "",
"position": 6
},
{
"token": "dry",
"start_offset": 42,
"end_offset": 54,
"type": "",
"position": 7
},
{
"token": "red",
"start_offset": 55,
"end_offset": 58,
"type": "",
"position": 8
},
{
"token": "wine",
"start_offset": 59,
"end_offset": 63,
"type": "",
"position": 9
}
]
}
Taking a glance at the results, we can see that the letter “i” has been lowercased within the first token. We can also see that the HTML tags have been stripped out, and that the word “drinking” has been stemmed to “drink.” Great, so the analyzer is good to go and we can now use it within field mappings.
Here is what you will learn:
- The architecture of Elasticsearch
- Mappings and analyzers
- Many kinds of search queries (simple and advanced alike)
- Aggregations, stemming, auto-completion, pagination, filters, fuzzy searches, etc.
- ... and much more!
2 comments on »Creating Custom Elasticsearch Analyzers«
Hello!
When i try to execute the above code,i get this response
“The [standard] token filter has been removed.””
Any fix for that?
Thank you!
Yes, simply remove it from the query. I have updated the post to reflect this.