Proximity Searches

This entry is part 29 of 35 in the series Complete Guide to Elasticsearch

Now that we have taken a look at fuzzy searches, it’s time to take a quick look at proximity searches, which are somewhat related. We looked at phrase searches in a previous lecture, where the search query is enclosed within quotation marks and the terms must be in the correct order. A proximity search allows for the order of the terms to be different or for the terms to be further apart than in the search query. This is useful if you don’t just want to ensure that the terms exist within a field, but also that they appear close to each other, i.e. in the same context. Where a fuzzy query allows you to specify the maximum edit distance for characters, a proximity search allows you to specify a maximum edit distance of words or terms within a phrase.

I will reuse the search query from the article where I discussed phrase queries, and here it is before turning it into a proximity search query.

GET /ecommerce/product/_search?q=name:"pasta spaghetti"

Notice that the request uses a query string search for the phrase pasta spaghetti. Remember that by default, matching documents must contain the words or terms in the search query in exactly that order. By appending a tilde (~) and a natural number, I can define how different the order or how far apart the two words in this query may be. In this case, I will add a proximity of two.

GET /ecommerce/product/_search?q=name:"pasta spaghetti"~2

As you can see, the syntax for adding proximity to the search is exactly the same as when adding fuzziness to a query string search. The only difference is that this search query contains quotation marks, making it a phrase search.

I have added another document to my index to illustrate how this works, so let’s just go ahead and run this query. Looking at the results, you can see that one of the matches has a name where the words pasta and spaghetti are inverse compared to the query. Had I not added a proximity of two or more to my phrase query, then this document would not match. You might, however, wonder why a proximity of one is not enough. The answer to that has to do with how the edit distance is calculated. Consider the document with the name of Pasta – Spaghetti, which contains two terms. You might think that a proximity of one would be sufficient, because moving spaghetti into the first position requires only one edit. However, this is not the case because the term spaghetti first has to be moved to the first position, after which the term “pasta” has to be moved to the second position. So internally, this is in fact considered to be two edits. Therefore it can sometimes be a bit tricky to figure out why a given phrase does not match.

Notice that the description for the product that I added reads Spaghetti and pasta is nice!. There is an and between spaghetti and pasta, so I just want to show you that by issuing a proximity search, I can still match this description using a phrase search. I am going to search the description field for the phrase spaghetti pasta with a proximity of one.

GET /ecommerce/product/_search?q=description:"spaghetti pasta"~1

As you can see in the results, the document still matches because of the proximity that I specified. It should be noted that the closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. Also note that one can specify a very high value for the proximity; this would not exclude documents where two terms are not near each other, but would give a higher score to the documents where the terms are near each other. So if you want to boost documents where the terms are near each other but don’t want to exclude documents where this is not the case, then this would be the way to do it.

Now that I have shown you how to perform proximity searches with the query string approach, I will now show you how to accomplish the same thing with the query DSL.

GET /ecommerce/product/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "pasta spaghetti",
        "slop": 2
      }
    }
  }
}

The name key matches the field that is to be searched. The slop property is used when performing proximity searches with the query DSL.

This query is equivalent to the previous query with a proximity of two, and if we inspect the results, we can see that they are the same.

And there you have it! That’s how to perform proximity searches in Elasticsearch.

Series Navigation<< Fuzzy SearchesBoosting >>

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *