Aggregations

This entry is part 35 of 35 in the series Complete Guide to Elasticsearch

Aggregations are a way of grouping and extracting statistics from your data. In case you are familiar with relational databases, you can think of this as the equivalent of SQL’s GROUP BY clause and aggregate functions such as SUM. Interestingly, Elasticsearch provides a rather powerful feature that allows you to execute searches and return hits as normal, but to also return aggregated results at the same time. These aggregated results are separate from the search hits and all of this can be returned in a single request, thus avoiding multiple network roundtrips.

There are three types of aggregations; metric, bucket, and pipeline aggregations. Pipeline aggregations are experimental and for quite advanced use cases, so I will not explain that in this article, because the functionality is likely to change. Metric aggregations work on values extracted from the aggregated documents. These values are usually extracted from a specific document field, which is specified in the query. It is also possible to gain more control and return values from a script, but this is beyond the scope of this article. Most of the metric aggregations output a single numeric metric, such as the sum aggregation, and are called single-value numeric metrics aggregations. Others that generate multiple metrics, such as the stats aggregation, are called multi-value numeric metric aggregations. This sounds more complicated than it is, so don’t worry too much about it, because I will show you examples of this now.

The aggregation that is easiest to understand is the sum aggregation. This single-metric aggregation sums up numeric values that are extracted from the aggregated documents. Let’s write a simple query that sums up the total quantity of all products in the ecommerce index.

GET /ecommerce/product/_search
{
  "query": {
    "match_all": { }
  },
  "size": 0,
  "aggs": {
    "quantity_sum": {
      "sum": {
        "field": "quantity"
      }
    }
  }
}

The aggs key may also be named aggregations if you prefer. quantity_sum is the name of the aggregation, and sum is the type of the aggregation.

Because I use the match_all query, all products are used in the aggregation. Let me just try to change the query a bit and see what the quantity is for all pasta products.

GET /ecommerce/product/_search
{
  "query": {
    "match": {
      "name": {
        "query": "pasta"
      }
    }
  },
  "size": 0,
  "aggs": {
    "quantity_sum": {
      "sum": {
        "field": "quantity"
      }
    }
  }
}

The aggregated quantity is now lower, because only pasta products are used in the aggregation.

Another single-metric aggregation is the avg aggregation, which calculates the average value for a document field. I will just modify the query slightly.

GET /ecommerce/product/_search
{
  "query": {
    "match": {
      "name": {
        "query": "pasta"
      }
    }
  },
  "size": 0,
  "aggs": {
    "avg_quantity": {
      "avg": {
        "field": "quantity"
      }
    }
  }
}

This query now calculates the average quantity for all pasta products.

Another two aggregations are the min and max aggregations, which calculate the lowest and highest value for a document field, respectively. I will just slightly modify the query to use the min aggregation.

GET /ecommerce/product/_search
{
  "query": {
    "match": {
      "name": {
        "query": "pasta"
      }
    }
  },
  "size": 0,
  "aggs": {
    "min_quantity": {
      "min": {
        "field": "quantity"
      }
    }
  }
}

As the results show, the lowest value for the quantity field is two for all of the pasta products. As you can probably imagine, all I have to do to use the max query, is to replace the min aggregation type with max.

GET /ecommerce/product/_search
{
  "query": {
    "match": {
      "name": {
        "query": "pasta"
      }
    }
  },
  "size": 0,
  "aggs": {
    "max_quantity": {
      "max": {
        "field": "quantity"
      }
    }
  }
}

This time, the results show that the maximum value for the quantity field is 87.

These aggregations were so-called single-value aggregations, because they only output a single value. Let’s now move on to taking a look at a multi-value aggregation, namely the stats aggregation. This aggregation computes stats from the aggregated documents. In fact, these stats include the aggregations that I have just showed you.

GET /ecommerce/product/_search
{
  "query": {
    "match": {
      "name": {
        "query": "pasta"
      }
    }
  },
  "size": 0,
  "aggs": {
    "quantity_stats": {
      "stats": {
        "field": "quantity"
      }
    }
  }
}

This is a multi-value aggregation, simply because it returns multiple values. Looking at the results, we can see that the count of documents is returned, along with the minimum, maximum and average value for the field. The sum of all of the values for the field is also calculated and returned.

There are other metric aggregations, but these were the most important ones. Please have a look at the documentation if you want a full list of available metric aggregations.

Now that I have discussed metric aggregations, I will move on to bucket aggregations, which are slightly more complicated. Instead of calculating metrics over fields, bucket aggregations create buckets or sets of documents. Each bucket is associated with a criterion, which documents must satisfy to be part of that bucket. This is not easy to understand without seeing an example, so let me show you just that.

I will use a range aggregation for this example, which lets me group documents together based on ranges for the quantity field, as you will see in just a moment as I type in the query.

GET /ecommerce/product/_search
{
  "query": {
    "match_all": { }
  },
  "size": 0,
  "aggs": {
    "quantity_ranges": {
      "range": {
        "field": "quantity",
        "ranges": [
          { "from": 1, "to": 50 },
          { "from": 50, "to": 100 }
        ]
      }
    }
  }
}

Note that for each range, the from value is included and the to value is excluded. Within the results, we can see that 483 documents have a quantity in the 1-50 range, while 506 documents belong to the 50-100 range. A use case for this aggregation could be to find the number of products within certain price ranges, allowing the user to apply a filter for a given price range.

An interesting feature that is available for bucket aggregations is the ability to add sub-aggregations, which is something that is not possible for metric aggregations. When doing so, an aggregation will aggregate the documents that are in the buckets that the parent bucket aggregation created. I hope that doesn’t sound too confusing, but don’t worry if it does, because I am going to give you an example of this now.

Let’s say that I wanted stats on the documents in each bucket. To do this, I can add a sub-aggregation of the type stats, which is actually the metric aggregation that I showed you a minute ago. This metric aggregation will then operate upon the documents that are contained within the bucket that the range bucket aggregation creates. Let’s see that in action.

GET /ecommerce/product/_search
{
  "query": {
    "match_all": { }
  },
  "size": 0,
  "aggs": {
    "quantity_ranges": {
      "range": {
        "field": "quantity",
        "ranges": [
          { "from": 1, "to": 50 },
          { "from": 50, "to": 100 }
        ]
      },
      "aggs": {
        "quantity_stats": {
          "stats": {
            "field": "quantity"
          }
        }
      }
    }
  }
}

The results now include stats for the documents within each bucket, so the stats aggregation ran within the context of each of the buckets. You can even nest aggregations even further, although you typically won’t need to do this. For example, you could add a sub-aggregation to further sub-divide the quantity ranges, and then add yet another sub-aggregation of the type stats to get stats for each of these ranges. In fact, there is no hard limit on how deep you can nest aggregations, but of course you will see the performance decrease the more you add.

There are quite a few bucket aggregations, but going through them all now would take a long time, and you probably don’t want to hear me ramble on about it for hours, so I will just mention another useful bucket aggregation. The term aggregation builds a bucket for each unique term for a given field. This could be useful for a gender field containing either Male or Female, because it could then be used to count how many persons of each gender there are in the index. This is quite useful, so I am going to show you how to use this aggregation in a later lecture when building a sample search engine.

That’s all I wanted you to know about aggregations. Like I said, there are quite a few aggregations that I haven’t covered in this article, but I have discussed the most important ones that you need to know. The rest are more for edge use cases, so you are much less likely to encounter them.

Series Navigation<< Sorting Results

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *