This blog post explores CoreOS cluster discovery problems and how to troubleshoot them. This blog post is part 8 of my DDDocker (a dummy’s docker diary) series. Since this post is one of my more popular blog posts on linkedIn, I have decided to move it to my WordPress site.
In the DDDocker (7) linkedIn post I had shown how to easily set up a CoreOS cluster using Vagrant. However, I had experienced problems with the health of the CoreOS cluster, caused by the presence of a HTTP proxy. This post is exploring on CoreOS cluster health and troubleshooting steps.
v1 (2015-07-21) : Manual Cluster Discovery & Troubleshooting
v2 (2015-07-27): Added the Appendix „Healthy Docker (CoreOS) Cluster in less than 10 Minutes“
v3 (2015-08-19): Moved the Appendix „Healthy Docker (CoreOS) Cluster in less than 10 Minutes“ to an independent blog post on WordPress
v4 (2016-06-03): Article moved to WordPress (here) and re-written the introduction
In the DDDocker (7) linkedIn post I have created a CoreOS cluster with 3 cluster nodes. This is our starting point for the current post:
Before following the instructions in the CoreOS Quick Start Guide, section „Process Management with fleet“ let us test, whether the cluster members are aware of each other:
Checking the Cluster Health
We will see below that the cluster is not healthy, since I have deployed
Connect to the first node using „vagrant ssh core-01“, or via putty like described in the DDDocker (7) post. Within the window, type:
Problems with Cluster Initialization behind a HTTP Proxy
In my case, this does not work:
The reason is that fleet depends on etcd and the nodes cannot connect to the public etcd discovery server on https://discovery.etcd.io, as defined in the user-data file.
This can be seen with
journalctl -b -u etcd | less
where we see a note saying „failed to connect discovery service[https://discovery.etcd.io/e86ff40c2076ccf901fdc0681f11417e]“:
The problem: I have performed the installation behind a HTTP proxy and etcd discovery does not yet support communication via HTTP proxies (see here change request for CoreOS).
This is, why the discovery server has not discovered any nodes on my token, which I can see in a browser onhttps://discovery.etcd.io/e86ff40c2076ccf901fdc0681f11417e:
Workaround: connect to the Internet temporarily
There are two possible workarounds/solutions:
- connect to the Internet temporarily (for discovery process only)
- add a local etcd discovery agent as suggested on http://stackoverflow.com/questions/25019355/coreos-fleetctl-list-machines-show-error.
In my case, I have chosen 1, using a hotspot function of my mobile phone: only seconds after having connected to the Internet, the cluster nodes are seen on the public discovery agent:
and the cluster turns our to be „healthy“:
and also fleetctl works fine on core-01:
All cluster nodes are visible and healthy. The same picture is seen on all 3 cluster nodes. And it does not change, if I stop the Internet connection and hide behind a HTTP proxy again.
We have seen that cluster discovery of a CoreOS cluster depends on the etcd discovery service. The easiest way to set up etcd is to use the public discovery service on https://discovery.etcd.io. However, you need to make sure that each cluster node can reach this service without the need to pass a HTTP proxy, since etcd discovery does not support proxies yet. Even if curl or wget works, etcd discover will still fail.
There are two ways to resolve the etcd discovery issue:
- –shown in this post–
temporarily connect to the Internet
- –not tested on my side–
create your own etcd discovery service, as described onhttp://stackoverflow.com/questions/25019355/coreos-fleetctl-list-machines-show-error. For etcd newbies like me, also https://github.com/coreos/etcd/issues/1404 seems to give some insight.
The temporary connection to the Internet worked like a charm, and the cluster is still healthy after disconnecting the cluster from the Internet.
- Install a healthy Docker Cluster in less than 10 Minutes
- Docker CoreOS Cluster Failover Test in less than 15 Minutes: shows how to manage a cluster-wide service, including cluster node failover.
- CoreOS Clustering Guide: https://coreos.com/etcd/docs/latest/clustering.html.
Is good, but I found the etcd Discovery section a little bit confusing: it does not tell, how the local etcd discover process is started.
- How To Troubleshoot Common Issues with your CoreOS Servers: https://www.digitalocean.com/community/tutorials/how-to-troubleshoot-common-issues-with-your-coreos-servers
seems to provide insight to many issues around CoreOS and clustering
- Issue #1404: https://github.com/coreos/etcd/issues/1404:
seems to give more insight on how to start a local etcd discovery service. See also http://stackoverflow.com/questions/25019355/coreos-fleetctl-list-machines-show-error.