In this article, we will be walking through the terminology that you should be familiar with when working with Elasticsearch.
First, we will discuss what is meant by Elasticsearch being a near real-time search engine, something that was briefly mentioned in a previous article in this series. What this means, is that at the point in time one makes a change to an index, this change is propagated throughout the Elasticsearch cluster within one second. This is unlike relational database systems that are hosted on a single machine, where changes become instantly available. This slight delay is due to Elasticsearch’s distributed architecture which makes the search engine so scalable. In fact, for large clusters, a delay of one second is really impressive!
A cluster is a collection of one or more nodes (or servers). A cluster can consist of as many nodes as you want, making it extremely scalable. The collection of nodes contain all of the data in the cluster, and the cluster provides indexing and search capability across all of the nodes. In practice, this means that when performing searches, you do not need to worry about which particular node a given document is stored on. Clusters are identified by a unique name, which defaults to “elasticsearch”.
We briefly touched the concept of nodes in the above section, but let’s elaborate a bit. A node is a single server that stores searchable data, and is part of a cluster. If a cluster only contains one node, then this node stores all of the data – otherwise a node contains a subset of a cluster’s data. Nodes participate in a cluster’s indexing and search capabilities, meaning that when operations are performed on the cluster, nodes collaborate on completing requests. Just as with a cluster, nodes are also identified by names, the default being the name of a random Marvel character.
By default, a node will join a cluster named “elasticsearch” unless configured otherwise. Starting a single node on a network, will create a single-node cluster named “elasticsearch” by default, which we will see automatically being done when we get to installing Elasticsearch.
Another key concept in Elasticsearch is an index. An index is a collection of documents, which could be a product, an order or something like that. These examples would be a type within an index. We will see what a type is in just a moment. The easiest way to understand what an index is, is to think of it like a database within a relational database system. This might not always be true, because it depends on how you design your cluster, but it is a good starting point nevertheless, and will hold true for most use cases.
Indexes are identified by names that you choose, and these names must be lowercased. The names are used when indexing, searching, updating and deleting documents within indexes. You can define as many indexes as you want within a cluster depending on the scale of your project, but most people will have one or just a few.
Within an index are types of documents. A type represents a class or category of similar documents, which could be a product or a user. A type consists of a name and a mapping, where the mapping does not need to be explicitly defined. You can think of a type being equivalent to a table within a relational database such as MySQL. An index can have one or more types, and each can have its own mapping, which we will discuss in the next section.
Because Lucene, which Elasticsearch is built on, has no concept of document types, this is stored within an _type field. What happens internally is that when searching for a specific type of document, Elasticsearch applies a filter on this field.
A document type has a mapping that is similar to the schema of a table in a relational database. It describes the fields that a document of a given type may have along with their data types, such as string, integer, date, etc. Also included is information on how fields should be indexed and how they should be stored by Lucene. It is, however, optional if you wish to specify this.
Thanks to dynamic mapping, it is optional to define a mapping before adding documents to an index. If no mapping is defined, it will be inferred automatically when a document is added, based on its data.
A document is a basic unit of information that can be indexed. It consists of fields, which are key/value pairs, where a value can be of various types, such as strings, dates, objects, etc. A document corresponds to an object in an object-oriented programming language, and a document type corresponds to a class. An example of a document could be a single user or product. Documents are expressed as JSON objects, and you can store as many documents within an index as you want.
Now that we have walked through a few concepts, let’s complete the analogy to relational databases. Where an index corresponds to a database and a type corresponds to a table, a document can be thought of as being the equivalent of a row in a database table. The fields of a document correspond to columns, and a mapping corresponds to the schema for a table.
We will now discuss a term called shards, which is also a term that exists within relational databases, for instance. Perhaps you have heard of the concept of “sharding” a database, but don’t worry if you haven’t. An index can be divided into multiple pieces, and each piece is called a shard. This is useful if an index contains more data than the hardware of a node can store. An example could be that an index contains 1 terabyte of data, but the node has a hard drive of only 500 gigabytes. A shard can then be created and stored on another node which has enough space for it.
A shard is a fully functional and independent index and can be stored on any node within a cluster. The number of shards can be specified when creating an index, but it defaults to five. Shards allow to scale horizontally by content volume, i.e. index space. Also, sharding allows to distribute and parallelize operations across shards, which increases the performance of a cluster.
While shards improve the scalability of the content volume for indexes, replicas ensure high availability. A replica is a copy of a shard, which can take over in case a shard or a node fails. A replica never resides on the same node as the original shard, meaning that if the given node fails, the replica will be available on another node. Replicas also allow for scaling search volume, because replicas can handle search queries.
By default, Elasticsearch adds five primary shards and one replica for each index, meaning that unless configured otherwise, there will be one replica for each primary shard, which is five in total.