After installing a fresh Red Hat OpenShift cluster, go to Monitoring -> Alerting. There, you will find a Watchdog alert, which sends messages to let you know that Alertmanager is not only still running, but is also emitting other signals for alerts you might be interested in. You can hook into Watchdog alerts with an external monitoring system, which in turn can tell you that alerting in your OpenShift cluster is working.
"You need a check to check if your check checks out."
How do you do this? Before we can configure Alertmanager for sending out Watchdog alerts, we need something on the receiving side, which is in our case Nagios. Follow me on this journey to get Alertmanager's Watchdog alerting against Nagios with a passive check.
Set up Nagios
OpenShift is probably not the first infrastructure element you have running under your supervision. That is why we start to capture a message from OpenShift with a self-made (actually from the Python 3 website and adjusted) Python HTTP receiving server, just to learn how to configure alert manager and to possibly modify the received alert message.
Also, you probably already have Nagios, Checkmk, Zabbix, or something else for external monitoring and running alerts. For this journey, I chose to use Nagios because it is a simple precooked and pre-setup option via yum install nagios
. Nagios normally only does active checks. An active check means that Nagios is the initiator of a check configured by you. To know if the OpenShift Alertmanager is working, we need a passive check in Nagios.
So, let's go and let our already existing monitoring system receive something from Alertmanager. Start by installing Nagios and the needed plugins:
$ yum -y install nagios nagios-plugins-ping nagios-plugins-ssh nagios-plugins-http nagios-plugins-swap nagios-plugins-users nagios-plugins-load nagios-plugins-disk nagios-plugins-procs nagios-plugins-dummy
Let's be more secure and change the provided default password for the Nagios administrator, using htpasswd
:
$ htpasswd -b /etc/nagios/passwd nagiosadmin <very_secret_password_you_created>
Note: If you also want to change the admin's username nagiosadmin
to something else, don't forget to change it also in /etc/nagios/cgi.cfg
.
Now, we can enable and start Nagios for the first time:
$ systemctl enable nagios $ systemctl start nagios
Do not forget that every time you modify your configuration files, you should run a sanity check on them. It is important to do this before you (re)start Nagios Core since it will not start if your configuration contains errors. Use the following to check your Nagios configuration:
$ /sbin/nagios -v /etc/nagios/nagios.cfg $ systemctl reload nagios $ systemctl status -l nagios
Dump HTTP POST content to a file
Before we start configuring, we first need an HTTP POST receiver program in order to receive a message from the Alertmanager via a webhook configuration. Alertmanager sends out a JSON message to an HTTP endpoint. To do that, I created a very basic python program to dump all data received via POST into a file:
#!/usr/bin/env python3 from http.server import HTTPServer, BaseHTTPRequestHandler from io import BytesIO class SimpleHTTPRequestHandler(BaseHTTPRequestHandler): def do_GET(self): self.send_response(200) self.end_headers() self.wfile.write(b'Hello, world!') def do_POST(self): content_length = int(self.headers['Content-Length']) body = self.rfile.read(content_length) self.send_response(200) self.end_headers() response = BytesIO() response.write(b'This is POST request. ') response.write(b'Received: ') response.write(body) self.wfile.write(response.getvalue()) dump_json = open('/tmp/content.json','w') dump_json.write(body.decode('utf-8')) dump_json.close() httpd = HTTPServer(('localhost', 8000), SimpleHTTPRequestHandler) httpd.serve_forever()
The above program definitely needs some rework. Both the location and format of the output in the file have to be changed for Nagios.
Configure Nagios for a passive check
Now that this rudimentary receive program is in place, let's configure the passive checks in Nagios. I added a dummy command to the file /etc/nagios/objects/commands.cfg
. That is what I understood from the Nagios documentation, but it is not really clear to me whether that is the right place and the right information. In the end, this process worked for me. But keep following, the purpose at the end is Alertmanager showing up in Nagios.
Add the following to the end of the commands.cfg
file:
define command { command_name check_dummy command_line $USER1$/check_dummy $ARG1$ $ARG2$ }
Then add this to the server's service object .cfg
file:
define service { use generic-service host_name box.example.com service_description OCPALERTMANAGER notifications_enabled 0 passive_checks_enabled 1 check_interval 15 ; 1.5 times watchdog alerting time check_freshness 1 check_command check_dummy!2 "Alertmanager FAIL" }
It would be nice if we could check that this is working via curl, but first, we have to change the sample Python program. It writes to a file by default, and for this example, it must write to a Nagios command_file
.
This is the adjusted Python program to write to the command_file
with the right service_description
:
#!/usr/bin/env python3 from http.server import HTTPServer, BaseHTTPRequestHandler from io import BytesIO import time; class SimpleHTTPRequestHandler(BaseHTTPRequestHandler): def do_GET(self): self.send_response(200) self.end_headers() self.wfile.write(b'Hello, world!') def do_POST(self): content_length = int(self.headers['Content-Length']) body = self.rfile.read(content_length) self.send_response(200) self.end_headers() response = BytesIO() response.write(b'This is POST request. ') response.write(b'Received: ') response.write(body) self.wfile.write(response.getvalue()) msg_string = "[{}] PROCESS_SERVICE_CHECK_RESULT;{};{};{};{}" datetime = time.time() hostname = "box.example.com" servicedesc = "OCPALERTMANAGER" severity = 0 comment = "OK - Alertmanager Watchdog\n" cmdFile = open('/var/spool/nagios/cmd/nagios.cmd','w') cmdFile.write(msg_string.format(datetime, hostname, servicedesc, severity, comment)) cmdFile.close() httpd = HTTPServer(('localhost', 8000), SimpleHTTPRequestHandler) httpd.serve_forever()
And with a little curl
, we can check that the Python program has a connection with the command_file
and that Nagios can read it:
$ curl localhost:8000 -d OK -X POST
Now we only have to trigger the POST action. All of the information sent to Nagios is hard-coded in this Python program. Hard coding this kind of information is really not the best practice, but it got me going for now. At this point, we have an endpoint (SimpleHTTPRequestHandler
) to which we can connect Alertmanager via a webhook to an external monitoring system—in this case, Nagios with an HTTP helper program.
Configure the webhook in Alertmanager
To configure the Alertmanager's Watchdog, we have to adjust the secret alertmanager.yml
. To get that file out of OpenShift, use the following command:
$ oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alertmanager.yaml
global: resolve_timeout: 5m route: group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'default' routes: - match: alertname: 'Watchdog' repeat_interval: 5m receiver: 'watchdog' receivers: - name: 'default' - name: 'watchdog' webhook_configs: - url: 'http://nagios.example.com:8000/'
Note: On the Prometheus web page, you can see the possible alert endpoints. As I found out with webhook_config
, you should name that file in plural form (webhook_configs
) in alertmanager.yml
. Also, check out the example provided on the Prometheus GitHub.
To get our new fresh configuration back into OpenShift, execute the following command:
$ oc -n openshift-monitoring create secret generic alertmanager-main --from-file=alertmanager.yaml --dry-run -o=yaml | oc -n openshift-monitoring replace secret --filename=-
In the end, you will see something similar received by Nagios. Actually, this is the message the Watchdog sends, via webhook_config
, to Nagios:
{"receiver":"watchdog", "status":"firing", "alerts":[ {"status":"firing", "labels": {"alertname":"Watchdog", "prometheus":"openshift-monitoring/k8s", "severity":"none"}, "annotations": {"message":"This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"}, "startsAt":"2020-03-26T10:57:30.163677339Z", "endsAt":"0001-01-01T00:00:00Z", "generatorURL":"https://prometheus-k8s-openshift-monitoring.apps.box.example.com/graph?g0.expr=vector%281%29\u0026g0.tab=1", "fingerprint":"e25963d69425c836"}], "groupLabels":{}, "commonLabels": {"alertname":"Watchdog", "prometheus":"openshift-monitoring/k8s", "severity":"none"}, "commonAnnotations": {"message":"This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"}, "externalURL":"https://alertmanager-main-openshift-monitoring.apps.box.example.com", "version":"4", "groupKey":"{}/{alertname=\"Watchdog\"}:{}"}
In the end, if all went well you see in Nagios the services overview a nice green 'OCPALERTMANEGER' service
If you want to catch up with Nagios passive checks, read more at Nagios Core Passive Checks.
Thanks for joining me on this journey!
Last updated: June 29, 2020