Field Data Types
Having introduced mapping, we will now go into a bit of more detail about data types. There are four categories of data types in Elasticsearch, namely core data types, complex data types, geo data types and specialized data types, which we
will all take a look at now. Feel free to skip this article and move onto the next one if you are already familiar with the various data types.
Core Data Types
The first category that I will talk about is core data types, beginning with strings.
Strings
Not surprisingly string fields accept string values. This data type can be sub-divided into full text and keywords, which I will discuss now.
Strings – Full Text
Full text fields are typically used for text based relevance searches such as when searching for products by name. These fields are analyzed, which means that data is passed through an analyzer to convert strings into a list of individual terms. This happens before the data gets indexed and is what enables Elasticsearch to search for individual words within a full text field. Full text fields are not used for sorting and are rarely used for aggregations.
Strings – Keywords
The keywords type is used for storing exact values such as tags, statuses and such. The exact string value is added to the index as a single term, because keyword fields are not analyzed. Fields of this type are typically used for filtering, such as finding all products that are on sale. They are also used for sorting and aggregations quite often.
Numeric
The next core data type is the numeric data type. As you can probably tell, this data type is used for storing numeric values such as integers, floats and doubles, e.g. 1 and 1.5. The numeric data type supports the following numeric types.
- long (signed 64-bit integer)
- integer (signed 32-bit integer)
- short (signed 16-bit integer)
- byte (signed 8-bit integer)
- double (double-precision 64-bit floating point)
- float (single-precision 32-bit floating point)
Dates
There is also a “date” data type in Elasticsearch. Dates can be either a string containing formatted dates (e.g. 2016-01-01 or 2016/01/01 12:00:00), an integer representing the seconds since the epoch or a long number representing the milliseconds since the epoch. The latter is how Elasticsearch stores dates internally.
Dates – Formats
When dealing with dates, it is possible to choose their formats. The default format is defined as a date with an optional time or the number of milliseconds since the epoch. This is defined as follows: strict_date_optional_time||epoch_millis. Below are a few examples of some of the default formats that you can use when adding dates.
- 2016-01-01 (date only)
- 2016-01-01T12:00:00Z (date including time)
- 1410020500000 (milliseconds since the epoch)
Note that the default date parser supports formats defined by the ISO standard, so these examples are not the only formats that you can use. However, I would recommend using these as a convention within your project. If you define your own date format, then you can use various constant values that Elasticsearch provides, which act as placeholders when defining the format. You can find a full list of these in the documentation. Note that you can use the boolean OR as in a programming language, by separating two placeholders with two pipes (||).
Boolean
If a field can only contain on or off values, then the boolean data type is appropriate. Just like in JSON, boolean fields can accept true and false values. They also accept strings and numbers, which will be interpreted as a boolean. I would personally not recommend this, unless you need to do this for some reason. False values include the boolean false, a string with a value of “false”, “off”, “no”, “0” or an empty string (“”), or the number 0. Everything that is not interpreted as false, will be interpreted as true.
Binary
It’s also possible to store binary data within a field, such as a file. This data will be stored as a Base64 encoded string and will not be searchable. An example is the string aHR0cDovL2NvZGluZ2V4cGxhaW5lZC5jb20=.
Complex Data Types
Next, we will discuss complex data types.
Object
JSON documents are hierarchical, meaning that an object may container inner objects. In Elasticsearch, these are flattened before being indexed and are stored as simple key/value pairs. Consider the following example.
{
"message": "Some text...",
"customer.age": 26,
"customer.address.city": "Copenhagen",
"customer.address.country": "Denmark"
}
If customer is an object that has an age property, then this property will be stored within a field named customer.age. The same applies for nested objects as you can see with the address object.
Array
Often times, you might want to store an array of values. You might be surprised to hear that Elasticsearch does not have an array type. This is because any field can contain zero or more values by default, without having to explicitly define this. This means that by simply passing an array of values to a field, Elasticsearch will automatically store these values, even for a field that has been mapped as a string, for example. Do, however, note that all of the values within an array must be of the same data type. If no explicit mapping defines the data type, then Elasticsearch will infer it based on the first value in the array.
Below, you can see a few examples of adding arrays to an Elasticsearch field.
- Array of strings: [“Elasticsearch”, “rocks”]
- Array of integers: [1, 2]
- Array of arrays: [1, [2, 3]] – equivalent of [1, 2, 3]
- Array of objects: [{ “name”: “Andy”, “age”: 26 }, { “name”: “Brenda”, “age”: 32 }]
Most noticeable is the array of arrays that is flattened into a single array of values. Besides the simple data types, it is also possible to add complex data types to an array, such as objects.
Arrays – Objects
Above, I showed you an example of an array of objects. However, this does not work as you might expect. As I told you a minute ago, Elasticsearch flattens object hierarchies into key/value pairs because Lucene has no concept of inner objects like JSON does. This means that you cannot query each object independently of the other objects in the array. Please consider the below example.
{ "users : [{ "name": "Andy", "age": 26 }, { "name": "Brenda", "age": 32 }] }
This array of objects is stored similar to this:
{ "users.name": ["Andy", "Brenda"], "users.age": [32, 26] }
An array contains two user objects with the name and age properties. When this array is stored in Lucene, all of the values for a given property in the objects are added as an array into a single field – in this case fields named users.name and users.age. In doing so, the associations between the properties are lost. In this particular example, the association between the name Andy and the age 26 is lost, because the values are not sorted. The consequence is that a search for a user named Andy who is 26 years old, would return incorrect results. The good news is that there is a solution to this problem, in case you need to be able to perform such a search.
Nested
The solution to the problem mentioned above is the nested data type. This data type should be used when you need to index arrays of objects and maintain the independence of each object. This means that the values for all of the objects will not be mixed together as you saw above. Internally, each object in the array is indexed as a separate hidden document. The result of using this data type is that you can query for specific objects based on their field values by using a special query named nested. What happens internally in Elasticsearch, is that the nested objects are queried as if they were indexed as separate documents. This is in fact the case technically, but you do not actually need to be aware of this.
Geo Data Types
Next, we will briefly walk through some of the geo data types, which are unsurprisingly used for storing geographical data.
Geo-point
The geo-point data type is used for storing latitude-longitude pairs and can be used for searching for documents within a radius of some coordinates, sorting by distance from some coordinates, etc. Values for a geo-point field can be added in four different formats. The first one is as an object with the lat and lon properties.
{
"location": {
"lat": 33.5206608,
"lon": -86.8024900
}
}
The second is as a string with the latitude and longitude separated by a comma, and in that order.
{
"location": "33.5206608,-86.8024900"
}
The third option is a geohash, which I am not going to explain further.
{
"location": "drm3btev3e86"
}
And the fourth way is as an array.
{
"location": [-86.8024900,33.5206608]
}
Notice that compared to the other formats, the latitude and longitude have been switched around. The reason for this is to conform with GeoJSON.
Geo-shape
A geo shape field can contain shapes such as rectangles and polygons, and can therefore be used to construct complex shapes. Therefore, this data type should be used when a geo point is not sufficient.
A geo shape can be constructed in many ways, some of which include adding it as a linestring or as a polygon. A linestring is an array of two or more position, which is essentially an array of arrays. If the array only contains two positions, then the result is a straight line. A polygon is also added as an array of arrays, where each array contains points. It is important to note that the first and last points in the outer array must be the same, which closes the polygon.
Specialized Data Types
Next, we will briefly discuss some specialized data types whose use cases are quite narrow.
IPv4
The first one is the IPv4 data type. Unsurprisingly, it is used to map IPv4 addresses. Elasticsearch stores the values for an IPv4 field as long values internally.
Completion
There is also a data type that is used for auto-complete functionality, named completion. It is a so-called prefix suggester and although it does not do spell correction, it is useful for providing the user with suggestions while searching, such as on Google. The way is does this, is by storing a finite state transducer as part of the index, which allows for very fast loads and executions. However, as long as you know when to use this type, then this is not important.
Token Count
The token_count data type is an integer field which accepts string values. This may sound strange, but what happens is that the string values are analyzed, and the number of tokens in the strings are indexed. An example use case would be that a name property has a length field of the token_count type. A search query could then be executed to find persons whose name contains a given number of tokens, split by space for instance.
Attachment
There is also an attachment data type, which lets Elasticsearch index attachments for many common formats such as PDF files, spreadsheets, PowerPoint presentations and more. The attachments are stored as a Base64 encoded string. This functionality is available as a plugin that must be installed on all nodes. Doing so is easy and can be done by typing in the below command, assuming that the node is running in a UNIX environment.
sudo /path/to/elasticsearchbin/plugin install mapper-attachments
If you do install this plugin, then note that you must restart each node after the installation has completed.
That’s it! That was a walkthrough of the most important data types in Elasticsearch that you are most likely to need or encounter.
Here is what you will learn:
- The architecture of Elasticsearch
- Mappings and analyzers
- Many kinds of search queries (simple and advanced alike)
- Aggregations, stemming, auto-completion, pagination, filters, fuzzy searches, etc.
- ... and much more!