Batch Processing
In the previous articles, we saw how to add, replace, update and delete individual documents. In this lecture, you will see how to perform these operations in batches. To do this, we will use something called the _bulk API. Performing operations in batches is very efficient because it limits the amount of network overhead. This is because when doing 20 updates, for instance, a batch operation would result in a single network roundtrip instead of 20.
The _bulk API expects a line with the action (i.e. create, update or delete) as well as meta data, followed by an optional line of the source for the action. The source document is optional, because it is not needed for deletions. This can be repeated as many times as you would like. It is important to note that the _bulk API uses newlines (that is, \n) to infer the semantics of a given line, so JSON cannot be formatted to look pretty and span over multiple lines. Also, the final line of data must end with the \n newline character.
Let’s see an example of this and add two documents to our index. Excuse the formatting below, but the request should only span five lines; one with the HTTP verb and the request URI, and two for each product.
POST /ecommerce/product/_bulk
{"index":{"_id":"1002"}}
{"name":"Why Elasticsearch is Awesome","price":"50.00","description":"This book is all about Elasticsearch!","status":"active","quantity":12,"categories":[{"name":"Software"}],"tags":["elasticsearch","programming"]}
{"index":{"_id":"1003"}}
{"name":"Peanuts","price":"3.00","description":"Peanuts with salt.","status":"active","quantity":56,"categories":[{"name":"Food"}],"tags":["snacks"]}
Result:
{
"took": 339,
"errors": false,
"items": [
{
"index": {
"_index": "ecommerce",
"_type": "product",
"_id": "1002",
"_version": 1,
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"status": 201
}
},
{
"index": {
"_index": "ecommerce",
"_type": "product",
"_id": "1003",
"_version": 1,
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"status": 201
}
}
]
}
As we can see in the results, both of our actions were successful. Open up Kibana and search for “peanuts” to verify that the documents have been added.
Next, let’s try to make a bulk request with more than one kind of action. I will delete the product with an ID of 1 and update one of the products that I added just before. I will update the quantity of the book that I added from 12 to 11 to simulate that a copy was sold.
POST /ecommerce/product/_bulk
{ "delete" : { "_id" : "1" } }
{ "update" : { "_id" : "1002" } }
{ "doc" : { "quantity" : 11 } }
Again, let’s verify that everything worked.
GET /ecommerce/product/1
The document with the ID of one no longer exists within the ecommerce index, meaning that it was successfully deleted. Next, go to Kibana and search for “awesome” to find the book that we added and verify that the quantity is now 11.
Before the end of this article, I just want to mention a few things. On the lines where you specify the action to be performed, it is also possible to include the name of the index and type of document to use. In our example, this data was present in the URL, but if we add it as the _index and _type properties respectively, we do not have to specify this information in the URL. This is useful if a bulk request affects multiple indexes and/or types. If, however, the index/type are provided, they will be used by default on bulk items that do not provide them explicitly.
When issuing bulk requests, both the index and type are optional in the URL. As long as the information is available within the actions, you can do any of the following.
POST /ecommerce/product/_bulk
POST /ecommerce/_bulk
POST /_bulk
In this example, I will just leave out the index and type and specify it within the action. We will just simulate that we have sold another copy of the book.
POST /_bulk
{ "update" : { "_id" : "1002", "_index" : "ecommerce", "_type" : "product" } }
{ "doc" : {"quantity" : 10 } }
Going into Kibana and refreshing the search, will reveal that we have updated the document just as with the previous approach.
Another thing worth mentioning is that all of the actions in a bulk request are executed sequentially and in order. If a single action fails for whatever reason, the remainder of the actions will still be processed and left unaffected. The response for a bulk request includes a status for each action that was executed in the same order that they were requested. This is useful for checking whether or not a specific action succeeded or failed.
That’s all there is to bulk requests. You should use these whenever you need to modify a fair number of documents at the same time, to improve performance.
Here is what you will learn:
- The architecture of Elasticsearch
- Mappings and analyzers
- Many kinds of search queries (simple and advanced alike)
- Aggregations, stemming, auto-completion, pagination, filters, fuzzy searches, etc.
- ... and much more!
One comment on »Batch Processing«
Hi
How we can perform batch processing with logstash and elastic. Actually in my use case, we have numbers of csv file and we want them to be processed in batches Wil logstash and elasticsearch.
Please advise.
Regards
Vijay Garg