How To Install an Elasticsearch Cluster on Ubuntu 18.04

In this post we will setup a 3 Node Elasticsearch Cluster which will be installed on Ubuntu 18.04.

Elasticsearch is a distributed, restful search and analytics engine built on Apache Lucene.

Elasticsearch has become the most popular search engine and is commonly used for log analytics, full-text search, security intelligence, business analytics etc.

Elasticsearch Logo

In this tutorial, we will install a 3 node cluster and go through some API examples on creating indexes, ingesting documents, searches etc.

This cluster will consist of 3 data nodes, so with this scenario a master node will be elected and with a 3 node cluster, we would want to avoid a split brain and have quorum of master-eligible nodes.

In a future post, I will be going through how to setup a Elasticsearch Cluster with Dedicated Masters

But before we get to that, let’s cover some basics:

Table of Contents

Elasticsearch Basic Concepts

a Elasticsearch Cluster is made up of a number of nodes;
Each Node contains Indexes, where a Index is a Collection of Documents;
Master nodes are responsible for Cluster related tasks, creating / deleting indexes, tracking of nodes, allocate shards to nodes;
Data nodes are responsible for hosting the actual shards that has the indexed data also handles data related operations like CRUD, search, and aggregations;
Indexes is split into Multiple Shards;
Shards exists of Primary Shards and Replica Shards;
A Replica Shard is a Copy of a Primary Shard which is used for HA/Redundancy;
Shards gets placed on random nodes throughout the cluster;
A Replica Shard will NEVER be on the same node as the Primary Shard’s associated shard-id.

Elasticsearch cluster architecture — *Representation of Nodes, Index and Shards on 2 Nodes (as an example)*

Note on Master Elections

The minimum number of master eligible nodes that need to join a newly elected master in order for an election is configured via the setting discovery.zen.minimum_master_nodes. This configuration is extremely important, as it makes each master-eligible node aware of the minimum number of master-eligible nodes that must be visible in order to form a cluster.

Without this setting or incorrect configuration, this might lead to a split brain, where let’s say something went wrong and upon nodes rejoining the cluster, it may form 2 different clusters, which we want to avoid at all costs.

From consulting elasticsearch documentation, to avoid a split brain, this setting should be set to a quorum of master-eligible nodes via the following formula:

(master_eligible_nodes / 2) + 1
# in our case:
(3/2) + 1 = 2

It is recommended to avoid having only two master eligible nodes, since a quorum of two is two. To read more on elasticsearch cluster master election process, have a look at their documentation

Prerequisites

We need to set the internal IP addresses of our nodes to either our hosts file or DNS server. To keep it simple, I will add them to my host file. This needs to applied to both nodes:

$ sudo su - 
$ cat > /etc/hosts << EOF
127.0.0.1 localhost
172.31.0.77 es-node-1
172.31.0.45 es-node-2
172.31.0.48 es-node-3
EOF

Now that our host entries are set, we can start with the fun stuff.

Install Elasticsearch

The below instructions need to be applied to both nodes.

Get the Elasticsearch repositories and update your system so that your servers are aware of the newly added Elasticsearch repository:

$ apt update && apt upgrade -y
$ apt install software-properties-common python-software-properties apt-transport-https -y
$ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
$ echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
$ apt update

Elasticsearch relies on Java, so install the java development kit:

$ apt install default-jdk -y

Verify that java is installed:

$ java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu218.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu218.04.1, mixed mode, sharing)

Install Elasticsearch:

$ apt install elasticsearch -y

Once Elasticsearch is installed, repeat these steps on the second node. Once that is done, move on to the configuration section.

Configure Elasticsearch

For nodes to join the same cluster, they should all share the same cluster name.

We also need to specify the the discovery hosts as the masters so that the nodes can be discoverable. Since we are installing a 3 node cluster, all nodes will contribute to a master and data node role.

Feel free to inspect the Elasticsearch cluster configuration, but I will be overwriting the default configuration with the config that I need.

Make sure to apply the configuration on both nodes:

$ cat > /etc/elasticsearch/elasticsearch.yml << EOF
cluster.name: es-cluster
node.name: \${HOSTNAME}
node.master: true
node.data: true
path.logs: /var/log/elasticsearch
path.data: /usr/share/elasticsearch/data
bootstrap.memory_lock: true
network.host: 0.0.0.0
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["es-node-1", "es-node-2"]
EOF

Important settings for your elasticsearch cluster is described on their docs:

Increase the file descriptors on the nodes, as instructed by the documentation:

$ cat > /etc/default/elasticsearch << EOF
ES_STARTUP_SLEEP_TIME=5
MAX_OPEN_FILES=65536
MAX_LOCKED_MEMORY=unlimited
EOF

Ensure that pages are not swapped out to disk by requesting the JVM to lock the heap in memory by setting LimitMEMLOCK=infinity.

Set the maxiumim file descriptor number for this process: LimitNOFILE and increase the number of threads using LimitNPROC:

$ vim /usr/lib/systemd/system/elasticsearch.service
[Service]
LimitMEMLOCK=infinity
LimitNOFILE=65535
LimitNPROC=4096
...

Increase the limit on the number of open files descriptors to user elasticsearch of 65536 or higher

$ cat > /etc/security/limits.conf << EOF
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
EOF

Increase the value of the mmap counts as elasticsearch uses mmapfs directory to store its indices:

$ sysctl -w vm.max_map_count=262144

For a permanent setting, update vm.max_map_count in /etc/sysctl.conf and run:

$ sysctl -p /etc/sysctl.conf

Change the permissions of the elasticsearch data path, so that the elasticsearch user and group has permissions to read and write from the configured path:

$ chown -R elasticsearch:elasticsearch /usr/share/elasticsearch

Make sure that you have applied these steps to all the nodes before continuing.

Start Elasticsearch

Enable Elasticsearch on boot time and start the Elasticsearch service:

$ systemctl enable elasticsearch
$ systemctl start elasticsearch

Verify that Elasticsearch is running:

$ netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0      0 :::9200                 :::*                    LISTEN      278/java
tcp6       0      0 :::9300                 :::*                    LISTEN      278/java

Using Elasticsearch Restful API

In this section we will get comfortable with using Elasticsearch API, by covering the following examples:

Cluster Overview;
How to view Node, Index and Shard information;
How to Ingest data into Elasticsearch;
Who to Search data in Elasticsearch;
How to delete your Index

View Cluster Health

From any node, use a HTTP client such as curl to investigate the current health of the cluster by looking at the cluster API:

$ curl -XGET http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "es-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

As you can see the cluster status is Green, which means everything works as expected.

In Elasticsearch you get Green, Yellow and Red statuses. Yellow would essentially mean that one or more replica shards are in a unassigned state. Red status means that some or all primary shards are unassigned which is really bad.

From this output we can also see the number of data nodes, primary shards, unassigned shards etc.

This is a good place to get an overall view of your Elasticsearch cluster’s health.

View the Number of Nodes in your Cluster

By looking at that /_cat/nodes API we can get information about our nodes that is part of our cluster:

$ curl -XGET http://localhost:9200/_cat/nodes?v
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.31.0.77              10          95   0    0.00    0.00     0.00 mdi       -      es-node-1
172.31.0.45              11          94   0    0.00    0.00     0.00 mdi       -      es-node-2
172.31.0.48              25          96   0    0.07    0.02     0.00 mdi       *      es-node-3

As you can see, we can see information about our nodes such as the JVM Heap, CPU, Load averages, node role of each node and which node is master.

As we are not running dedicated masters, we can see that node es-node-3 got elected as master.

Create your first Elasticsearch Index

Note that when you create a index, the default primary shards are set to 5 and the default replica shard count is set to 1. You can change the replica shard count after a index has been created, but not the primary shard count as that you will need to set on index creation.

Let’s create a Elasticsearch index named myfirstindex :

$ curl -XPUT http://localhost:9200/myfirstindex
{"acknowledged":true,"shards_acknowledged":true,"index":"myfirstindex"}

Now that your index has been created, let’s have a look at the /_cat/indices API to get information about our indices:

$ curl -XGET http://localhost:9200/_cat/indices?v
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   myfirstindex xSX9nOQJQ2qNIq4A6_0bTw   5   1          0            0      1.2kb           650b

From the output you will find that we have 5 primary shards and 1 replica shard, with 0 documents in our index and that our cluster is in a green state, meaning that our primary and replica shards has been assigned to the nodes in our cluster.

Note that a replica shard will NEVER reside on the same node as the primary shard for HA and Redundancy.

Let’s go a bit deeper and have a look at the shards, to see how our shards are distributed through our cluster, using the /_cat/shards API:

$ curl -XGET http://localhost:9200/_cat/shards?v
index        shard prirep state   docs store ip             node
myfirstindex 1     r      STARTED    0  230b 172.31.0.77    es-node-2
myfirstindex 1     p      STARTED    0  230b 172.31.0.48    es-node-3
myfirstindex 4     p      STARTED    0  230b 172.31.0.48    es-node-3
myfirstindex 4     r      STARTED    0  230b 172.31.0.77    es-node-1
myfirstindex 2     r      STARTED    0  230b 172.31.0.45    es-node-2
myfirstindex 2     p      STARTED    0  230b 172.31.0.77    es-node-1
myfirstindex 3     p      STARTED    0  230b 172.31.0.45    es-node-2
myfirstindex 3     r      STARTED    0  230b 172.31.0.48    es-node-3
myfirstindex 0     p      STARTED    0  230b 172.31.0.45    es-node-2
myfirstindex 0     r      STARTED    0  230b 172.31.0.77    es-node-1

As you can see each replica shard of it’s primary are spread on different nodes.

Replicating a Yellow Cluster Status

For a yellow cluster status we know that it’s when one or more replica shards are in a unassigned state.

So let’s replicate that behavior by scaling our replica count to 3, which would mean that 5 replica shards will be in a unassigned state:

$ curl -XPUT -H 'Content-Type: application/json' \
http://localhost:9200/myfirstindex/_settings -d \
'{"number_of_replicas": 3}'

Now we have scaled the replica count to 3, but since we only have 3 nodes, we will have a yellow state cluster:

$ curl -XGET http://localhost:9200/_cat/indices?v
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   myfirstindex xSX9nOQJQ2qNIq4A6_0bTw   5   3          0            0      3.3kb           1.1kb

The cluster health status should show the number of unassigned shards, and while they are unassigned we can verify that by looking at the shards API again:

$ curl -XGET http://localhost:9200/_cat/shards?v
index        shard prirep state      docs store ip             node
myfirstindex 1     r      STARTED       0  230b 172.31.0.45    es-node-2
myfirstindex 1     p      STARTED       0  230b 172.31.0.48    es-node-3
myfirstindex 1     r      STARTED       0  230b 172.31.0.77    es-node-1
myfirstindex 1     r      UNASSIGNED
myfirstindex 4     r      STARTED       0  230b 172.31.0.45    es-node-2
myfirstindex 4     p      STARTED       0  230b 172.31.0.48    es-node-3
myfirstindex 4     r      STARTED       0  230b 172.31.0.77    es-node-1
myfirstindex 4     r      UNASSIGNED
myfirstindex 2     r      STARTED       0  230b 172.31.0.45    es-node-2
myfirstindex 2     r      STARTED       0  230b 172.31.0.48    es-node-3
myfirstindex 2     p      STARTED       0  230b 172.31.0.77    es-node-1
myfirstindex 2     r      UNASSIGNED
myfirstindex 3     p      STARTED       0  230b 172.31.0.45    es-node-2
myfirstindex 3     r      STARTED       0  230b 172.31.0.48    es-node-3
myfirstindex 3     r      STARTED       0  230b 172.31.0.77    es-node-1
myfirstindex 3     r      UNASSIGNED
myfirstindex 0     p      STARTED       0  230b 172.31.0.45    es-node-2
myfirstindex 0     r      STARTED       0  230b 172.31.0.48    es-node-3
myfirstindex 0     r      STARTED       0  230b 172.31.0.77    es-node-1
myfirstindex 0     r      UNASSIGNED

At this point in time, we could either add another node to the cluster or scale the replication factor back to 1 to get the cluster health to green again.

I will scale it back down to a replication factor of 1:

$ curl -XPUT http://localhost:9200/myfirstindex/_settings -d '{"number_of_replicas": 1}'

Ingest Data into Elasticsearch

We will ingest 3 documents into our index, this will be a simple document consisting of a name, country and gender, for example:

{ 
  "name": "james", 
  "country": "south africa", 
  "gender": "male"
}

First we will ingest the document using a PUT HTTP method, when using a PUT method, we need to specify the document ID.

PUT methods will be use to create or update a document. For creating:

$ curl -XPUT -H 'Content-Type: application/json' \
http://localhost:9200/myfirstindex/_doc/1 -d '
{"name":"james", "country":"south africa", "gender": "male"}'

Now you will find we have one index in our cluster:

$ curl -XGET http://localhost:9200/_cat/indices?v
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   myfirstindex xSX9nOQJQ2qNIq4A6_0bTw   5   1          1            0     11.3kb          5.6kb

Since we know that the document ID is “1”, we can do a GET on the document ID to read the document from the index:

$ curl -XGET http://localhost:9200/myfirstindex/people/1?pretty
{
  "_index" : "myfirstindex",
  "_type" : "people",
  "_id" : "1",
  "found" : false
}

If we ingest documents with a POST request, Elasticsearch generates the document ID for us automatically. Let’s create 2 documents:

$ curl -XPOST -H 'Content-Type: application/json' \
http://localhost:9200/myfirstindex/_doc/ -d '
{"name": "kevin", "country": "new zealand", "gender": "male"}'

$ curl -XPOST -H 'Content-Type: application/json' \
http://localhost:9200/myfirstindex/_doc/ -d '
{"name": "sarah", "country": "ireland", "gender": "female"}'

When we have a look again at our index, we can see that we now have 3 documents in our index:

$ curl -XGET http://localhost:9200/_cat/indices?v
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   myfirstindex xSX9nOQJQ2qNIq4A6_0bTw   5   1          3            0       29kb         14.5kb

Search Queries

Now that we have 3 documents in our elasticsearch index, let’s explore the search API’s to get data from our index. First, let’s search for the keyword “sarah” as a source query parameter:

$ curl -XGET 'http://localhost:9200/myfirstindex/_search?q=sarah&pretty'
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "myfirstindex",
        "_type" : "_doc",
        "_id" : "cvU96GsBP0-G8XdN24s4",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "sarah",
          "country" : "ireland",
          "gender" : "female"
        }
      }
    ]
  }
}

We can also narrow our search query down to a specific field, for example, show me all the documents with the name kevin:

$ curl -XGET 'http://localhost:9200/myfirstindex/_search?q=name:kevin&pretty'
{
...
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "myfirstindex",
        "_type" : "_doc",
        "_id" : "gPU96GsBP0-G8XdNHoru",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "kevin",
          "country" : "new zealand",
          "gender" : "male"
        }
      }
    ]
  }
}

With Elasticsearch we can also search with our query in the request body, a similar query as above would look like this:

$ curl -XPOST -H 'Content-Type: application/json' \
'http://localhost:9200/myfirstindex/_search?pretty' -d '
{
  "query": {
    "match": {
      "name": "kevin"
    }
  }
}'

{
...
        "_index" : "myfirstindex",
        "_source" : {
          "name" : "kevin",
          "country" : "new zealand",
          "gender" : "male"
        }
...
}

We can use wildcard queries:

$ curl -XPOST -H 'Content-Type: application/json' \
'http://172.31.0.77:9200/myfirstindex/_search?pretty' -d '
{
  "query": {
    "wildcard": {
      "country": "*land"
    }
  }
}'

{
...
    "hits" : [
      {
        "_index" : "myfirstindex",
        "_type" : "_doc",
        "_id" : "cvU96GsBP0-G8XdN24s4",
        "_score" : 1.0,
        "_source" : {
          "name" : "sarah",
          "country" : "ireland",
          "gender" : "female"
        }
      },
      {
        "_index" : "myfirstindex",
        "_type" : "_doc",
        "_id" : "gPU96GsBP0-G8XdNHoru",
        "_score" : 1.0,
        "_source" : {
          "name" : "kevin",
          "country" : "new zealand",
          "gender" : "male"
        }
      }
    ]
...
}

Have a look at their documentation for more information on the Search API

Delete your Index

To wrap this up, we will go ahead and delete our index:

$ curl -XDELETE http://localhost:9200/myfirstindex

Going Further

If this got you curious, then definitely have a look at this Elasticsearch Cheatsheet that I’ve put together and if you want to generate lot’s of data to ingest to your elasticsearch cluster, have a look at this python script.

Our other links related to ELK :

7 comments

Links 15/7/2019: Vulkan 1.1.115 and Facebook Openwashing | Techrights July 15, 2019 - 8:37 am

[…] How To Install an Elasticsearch Cluster on Ubuntu 18.04 […]

Eleftheria Drosopoulou July 15, 2019 - 12:10 pm

Hello Antoine,

Nice blog! I am editor at Java Code Geeks (www.javacodegeeks.com). We have the JCG program (see http://www.javacodegeeks.com/join-us/jcg/), that I think you’d be perfect for.

If you’re interested, send me an email to [email protected] and we can discuss further.

Best regards,
Eleftheria Drosopoulou

Shivdas Kanade July 21, 2019 - 5:52 pm

Easy to understand and steps are very clearly mentioned in simple language.

How To Install Logstash on Ubuntu 18.04 and Debian 9 – devconnected July 22, 2019 - 9:20 pm

Cristian B. October 10, 2019 - 10:33 pm

Scaling it back down to a replication factor of 1 didn’t work in my case and the cluster remained in a yellow state with unassigned shards.
I’ve deleted and recreated my index to proceed with the guide.

schkn October 12, 2019 - 9:50 am

Thank you for the useful information.

Cristian B. May 5, 2020 - 10:37 pm

Doing “apt install software-properties-common python-software-properties apt-transport-https -y” says “Package python-software-properties is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source However the following packages replace it: software-properties-common”.

That “software-properties-common” is already in the command so perhaps “python-software-properties” just needs removing from the guide.