Fuzzy Searches
In the previous lectures, I showed you how to perform a variety of searches. For full text searches, the exact terms that I was searching for had to match somewhere in the document. But what if a user enters something that is not an exact match, but should still be considered a match? For example, it could be that a user made a typo and entered past instead of pasta; in this case, we might still want the pasta products and recipes to match. Enabling such behavior is actually quite easy thanks to something called fuzzy searches. Fuzziness for a query is interpreted as a so-called Levenshtein Edit Distance, which is the number of one character changes that need to be made to one string to make it the same as another string. This means that for the term past to match pasta, a fuzziness of one is needed. So fuzziness is the number of characters that are allowed to be different from a given term that is searched for.
I will first show you how to use fuzzy searches with query string searches. To perform a fuzzy search, simply append a tilde (~) sign to the end of a term, followed by an optional integer. If no number is provided after the tilde sign, the default edit distance is two.
GET /ecommerce/product/_search?q=past~1
In this example, I am searching for the term past, but with an edit distance of one. If I run this query, you will see that products whose names include the term pasta are included, because the edit distance between this term and the search query is one. You can do like this for multiple terms in a query if you wish.
Let’s see how this is done using the query DSL, which is the more common approach. Instead of simply nesting key-value pairs of field names and values within a match query, I will add a key with the field name and an object as its value instead. This is because an object allows me to use a fuzziness property to control how much the field value is allowed to vary from the search query.
GET /ecommerce/product/_search
{
"query": {
"match": {
"name": {
"query": "past",
"fuzziness": 1
}
}
}
}
This query is identical to the query that I showed you before with the query string approach. It searches for the term past where one character may be different. If we take a look at the results, we can see that the matching products do not have the term past in the name, but rather the terms paste and pasta, which match with a difference of one character. This satisfies the constraint that I specified in my query, so the documents are considered matches. Note that the maximum fuzziness that is supported by Lucene is two. A fuzziness greater than two is too expensive in terms of performance and will be ignored by Lucene.
The term that I searched for in my query only consists of four characters. If I set the fuzziness property to two, for example, this could result in quite unpredictable results and results of low quality. This is because the allowed edit distance is so close to the total number of characters in the search query; in this case, 50% of the characters would be allowed to be different. The solution to this problem is to increase and decrease the maximum edit distance based on the length of the search query. While this is easy to do within an application, Elasticsearch conveniently provides this functionality out of the box. By specifying the value for the fuzziness parameter as a string with a value of auto instead, Elasticsearch will take care of this for us.
GET /ecommerce/product/_search
{
"query": {
"match": {
"name": {
"query": "past",
"fuzziness": "AUTO"
}
}
}
}
When a fuzziness of auto is used, Elasticsearch will determine the most appropriate maximum edit distance based on the length of the search query. If a query is 0 to 2 characters long, it must match exactly. If it’s between 3 and 5 characters, an edit distance of 1 is used. If the query is longer than 5 characters, an edit distance of 2 is used. While this logic is very simple, it is convenient not having to do this within the application that utilizes Elasticsearch. In fact, using auto for the fuzziness property is recommended in most cases.
Now that I have showed you how to perform fuzzy queries, I have a few notes and performance considerations to share with you. First and foremost, it’s important to understand that when performing fuzzy queries, the query is compared to the terms that are stored within an index. The terms are the result of analyzing text fields when adding or updating documents within an index. Therefore, it’s the analyzed terms rather than the actual stored document that are searched. Because of this text analysis, the query may be compared to an unanticipated value, which can sometimes lead to confusing results. This is important to keep in mind while debugging a search query and trying to figure out why a given value matches a fuzzy query.
Apart from using a match query with a fuzziness property, there are also other ways of performing fuzzy queries, the most important one being the fuzzy query. This query does not allow for any analysis for the query text and provides only a subset of the functionality of a match query. It is therefore less versatile, and the match query should be preferred unless one has a good reason for doing otherwise.
There are a few things to consider in terms of performance. Even though Lucene’s implementation of fuzzy searches is fast, it is slower than running a plain match query. The time it takes to run a fuzzy query increases with the number of unique terms in an index. The reason for this difference in performance is that for match queries, Lucene performs a binary search on its terms index that it stores internally. Binary searches are extremely scalable and fast, even for extremely large data sets. Fuzzy searches use a different algorithm using a Deterministic Finite Automaton, or DFA, which is naturally slower than a binary search. Fuzzy searches need to process a larger number of terms compared to a binary search, so these queries are always going to be slower. Now don’t worry if you don’t know what a binary search or a DFA is, as it’s not important for you to know this. Just know that fuzzy queries are slower than simple match queries, so if you encounter performance issues, then this might be a good place to optimize your queries.
One way of doing this, would be to require that matches have exact prefix matches within the query. What this means, is that the first X number of characters must match exactly and must not be different from those of the search query. This reduces the number of characters that can be different and therefore reduces the number of terms that a fuzzy search must process. For instance, you can define that the first two characters of a search query must match exactly. The longer the prefix length, the faster the query will be. For small or medium sized data sets, you don’t have to worry about this, so for most people, this is just going to be something to keep in mind if the amount of data in your index grows a lot. Using this technique is a tradeoff between performance and the quality of search results, as any typo within the first characters would cause some relevant documents not to match. Like I said, you don’t want to do this in general, but if you have a lot of data and need a performance boost, then this could be a solution. To do this, simply add the prefix_length property to the object within a match query and add an integer defining the number of leading characters that must match exactly.
That’s it for fuzzy searches! In the next article, we will take a look at proximity searches that are somewhat related to fuzzy searches. Until then, happy searching!
Here is what you will learn:
- The architecture of Elasticsearch
- Mappings and analyzers
- Many kinds of search queries (simple and advanced alike)
- Aggregations, stemming, auto-completion, pagination, filters, fuzzy searches, etc.
- ... and much more!
