Monitoring Airflow with Prometheus + Statsd + Grafana and spicing the game with B-Tree Filesystem

José Vicente Núñez
12 min readMay 8, 2021

--

Monitoring the Airflow PostgreSQL database using Prometheus and Grafana. And if you are patient, I’ll show you how to use the Prometheus node-exporter :-)

I want to monitor my home Airflow installation, to see if the performance is good enough. I want to fine tune how much resources is using, also want to see if the poor host where it runs is able to keep up.

This metrics will make sense if I store them in a time series database. With that I can visualize them and possibly generate some alerts using a Grafana board.

Monitoring the PostgresSQL database and the host where it runs seems to be straightforward using Prometheus, but Airflow exports the data on the statsd format, so it will need a data transformation before I can capture those metrics.

NOTE 1: Both the scrappers and Grafana will run on a separate machine (My Raspberry PI 4 server) and not on the same MacMini2 as memory is pretty tight there. That’s the beauty of these software stacks, you can decide where to run the components as they fit best.

NOTE 2: To store the time series data, I will use a separate SSD USB external drive on the Raspberry PI 4, formatted with btrfs, as I don’t want to wear down the memory card where the operating system is installed (I’ll explain why I choose to use brtfs instead of ext4 or xfs below).

First step: Make Airflow to export the stats

You can run several Prometheus “scrappers” and then poll/ query their data using a single Prometheus instance.

I decided to try the much more conventional statds path, after all Airflow supports exporting this data on this format out of the box. Followed Sarah Krasnik excelent guide, along with Ales Nosek guide, but Vova Rozhkov guide is the most complete in my opinion.

Plenty of information out there. Just matter of piece everything together.

Tell Airflow to export metrics using Statsd

I will focus on these tasks for my monitoring example, but there is plenty more

Note: From my previous post you may remember that I did not use a Docker image to run Airflow due certain hardware limitations, my setup is on a bare-metal instance. Anyways below are the statsd changes on $HOME/airflow/airflow.cfg

[metrics]# StatsD (https://github.com/etsy/statsd) integration settings.
# Enables sending metrics to StatsD. Make sure you change the port from 8125 to 9125
statsd_on = True
statsd_host = raspberrypi
statsd_port = 9125
statsd_prefix = airflow

Restart the services

(airflow) [josevnz@macmini2 airflow]$ systemctl --user restart airflow-webserver.service 
(airflow) [josevnz@macmini2 airflow]$ systemctl --user restart airflow-scheduler.service

Enable the Prometheus statsd_exporter

Once Airflow is exporting the metrics we need the conversion (exporter) so the Prometheus collector can consume those metrics to store the time series data.

For this I will run the statsd exporter. I do not need to run the statsd daemon but I will ask Airflow to talk directly to the exporter. Airflow has a detailed list of metrics being exported, I recommend you take a look.

To make it easier, will run the exporter using its official Docker image:

The exporter needs a little bit of help translating the statsd labels into Prometheus labels. So will re-use an existing mapping:

josevnz@raspberrypi:~$ curl --location --output $HOME/etc/statsd_exporter_mapping.yaml https://raw.githubusercontent.com/databand-ai/airflow-dashboards/main/statsd/statsd.conf

Let’s check the if the configuration is valid first:

josevnz@raspberrypi:~$ /usr/bin/docker run --rm --interactive --tty --publish 9102:9102 --publish 9125:9125 --publish 9125:9125/udp --volume $HOME/etc/statsd_exporter_mapping.yaml:/etc/statsd_exporter_mapping.yaml:ro  prom/statsd-exporter --statsd.mapping-config=/etc/statsd_exporter_mapping.yaml --web.enable-lifecycle --check-config
level=info ts=2021-04-26T15:49:21.932Z caller=main.go:321 msg="Starting StatsD -> Prometheus Exporter" version="(version=0.20.1, branch=HEAD, revision=2b5239a67f716418a9dbdf70ca7bf2513fc9f7cc)"
level=info ts=2021-04-26T15:49:21.932Z caller=main.go:322 msg="Build context" context="(go=go1.16.2, user=root@f0e567f47a2a, date=20210326-17:33:12)"
WARN[0000] backtracking required because of match "*.ti_failures", matching performance may be degraded source="fsm.go:313"
WARN[0000] backtracking required because of match "*.ti_successes", matching performance may be degraded source="fsm.go:313"
...
level=info ts=2021-04-26T15:49:21.943Z caller=main.go:357 msg="Configuration check successful, exiting"

Hmm, few warnings nothing really broken.

Then we run the statsd_exporter docker container in interactive mode and with — loglevel=debug to make sure Airflow and statsd_exporter are talking to each other:

/usr/bin/docker run --rm --interactive --tty --publish 9102:9102 --publish 9125:9125 --publish 9125:9125/udp --volume $HOME/etc/statsd_exporter_mapping.yaml:/etc/statsd_exporter_mapping.yaml:ro  prom/statsd-exporter --statsd.mapping-config=/etc/statsd_exporter_mapping.yaml --web.enable-lifecycle --log.level=debug# Then
/usr/bin/docker run --rm --interactive --tty --publish 9102:9102 --publish 9125:9125 --publish 9125:9125/udp --volume $HOME/etc/statsd_exporter_mapping.yaml:/etc/statsd_exporter_mapping.yaml:ro prom/statsd-exporter --statsd.mapping-config=/etc/statsd_exporter_mapping.yaml --web.enable-lifecycle --log.level=debug
evel=debug ts=2021-05-02T14:44:23.042Z caller=listener.go:73 msg="Incoming line" proto=udp line=airflow.dag_processing.import_errors:0|g
level=debug ts=2021-05-02T14:44:23.152Z caller=listener.go:73 msg="Incoming line" proto=udp line=airflow.scheduler.critical_section_duration:14.026885|ms
level=debug ts=2021-05-02T14:44:23.160Z caller=listener.go:73 msg="Incoming line" proto=udp line=airflow.executor.open_slots:16|g
level=debug ts=2021-05-02T14:44:23.160Z caller=listener.go:73 msg="Incoming line" proto=udp line=airflow.executor.queued_tasks:0|g
level=debug ts=2021-05-02T14:44:23.161Z caller=listener.go:73 msg="Incoming line" proto=udp line=airflow.executor.running_tasks:0|g
level=debug ts=2021-05-02T14:44:23.236Z caller=exporter.go:121 msg="counter must be non-negative value" metric=af_agg_dag_processing_processes event_value=-1
level=debug ts=2021-05-02T14:44:23.900Z caller=listener.go:73 msg="Incoming line" proto=udp line=airflow.dag_processing.total_parse_time:0.8583512239856645|g
level=debug ts=2021-05-02T14:44:23.900Z caller=listener.go:73 msg="Incoming line" proto=udp line=airflow.dagbag_size:32|g
level=debug ts=2021-05-02T14:44:23.900Z caller=listener.go:73 msg="Incoming line" proto=udp line=airflow.dag_processing.import_errors:0|g

Once you’re happy with the chatty conversation, kill the container and restart it on detached mode, loglevel=info

/usr/bin/docker run --name statsd-exporter --detach --publish 9102:9102 --publish 9125:9125 --publish 9125:9125/udp --volume $HOME/etc/statsd_exporter_mapping.yaml:/etc/statsd_exporter_mapping.yaml:ro  prom/statsd-exporter  --no-statsd.parse-dogstatsd-tags --no-statsd.parse-librato-tags --no-statsd.parse-signalfx-tags --statsd.mapping-config=/etc/statsd_exporter_mapping.yaml --web.enable-lifecycle

let’s look at the logs:

josevnz@raspberrypi:~$ docker logs --follow statsd-exporter
level=info ts=2021-04-25T16:28:44.131Z caller=main.go:321 msg="Starting StatsD -> Prometheus Exporter" version="(version=0.20.1, branch=HEAD, revision=2b5239a67f716418a9dbdf70ca7bf2513fc9f7cc)"
level=info ts=2021-04-25T16:28:44.132Z caller=main.go:322 msg="Build context" context="(go=go1.16.2, user=root@f0e567f47a2a, date=20210326-17:33:12)"
level=info ts=2021-04-25T16:28:44.132Z caller=main.go:361 msg="Accepting StatsD Traffic" udp=:9125 tcp=:9125 unixgram=
level=info ts=2021-04-25T16:28:44.132Z caller=main.go:362 msg="Accepting Prometheus Requests" addr=:9102

The make sure it is up with curl (Or your favorite browser)…

josevnz@raspberrypi:~$ /usr/bin/curl --silent http://raspberrypi:9102/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000104263
go_gc_duration_seconds{quantile="0.25"} 0.000217977
go_gc_duration_seconds{quantile="0.5"} 0.000220407
go_gc_duration_seconds{quantile="0.75"} 0.000259482
go_gc_duration_seconds{quantile="1"} 0.000573545
go_gc_duration_seconds_sum 0.004404551
go_gc_duration_seconds_count 18
# HELP go_goroutines Number of goroutines that currently exist.

Time to scrap the data using Prometheus

PrometheusQL will allow you to drill down into those metrics. Also will help you to get ready for your Grafana Dashboard which also speaks the same language

So the data is sitting pretty on the statsd-exporter, it is time to fecth it and store it using Prometheus.

So first, we point the scrapper to the exporter

global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_timeout: 10s
external_labels:
monitor: 'nunez-family-monitor'
scrape_configs:
- job_name: 'statsd-exporter'
static_configs:
- targets: ['raspberrypi:9102']
- job_name: 'postgres-exporter'
static_configs:
- targets: ['raspberrypi:9187']
tls_config:
insecure_skip_verify: true

We will run the scrapper in a bit, first we need to prepare the data storage

Setting up a file-system for my time series data

I purchased a cheap 120 GB USB SSD and decided to format it with ‘B-Tree Filesystem’. BTRFS is pretty good storing small files and their metadata, has the ability to allocate inodes dynamically and has special support for SSD, so I was sold :-)

Create and format the partition:

# Not showing the fdisk /dev/sda partition part.
# Search on Duck Duck Go: https://duckduckgo.com/?q=linux+using+fdisk&t=brave&ia=web
josevnz@raspberrypi:~$ sudo mkfs.btrfs /dev/sda1
btrfs-progs v5.4.1
See http://btrfs.wiki.kernel.org for more information.
Detected a SSD, turning off metadata duplication. Mkfs with -m dup if you want to force metadata duplication.
Label: (null)
UUID: 246e2323-dce4-4f56-b099-60a3cba59ac9
Node size: 16384
Sector size: 4096
Filesystem size: 118.91GiB
Block group profiles:
Data: single 8.00MiB
Metadata: single 8.00MiB
System: single 4.00MiB
SSD detected: yes
Incompat features: extref, skinny-metadata
Checksum: crc32c
Number of devices: 1
Devices:
ID SIZE PATH
1 118.91GiB /dev/sda1

Get the UUID, and mount it with the proper permissions

josevnz@raspberrypi:~$ sudo blkid
/dev/mmcblk0p1: LABEL_FATBOOT="system-boot" LABEL="system-boot" UUID="4D3B-86C0" TYPE="vfat" PARTUUID="4ec8ea53-01"
/dev/mmcblk0p2: LABEL="writable" UUID="79af43d1-801b-4c28-81d5-724c930bcc83" TYPE="ext4" PARTUUID="4ec8ea53-02"
/dev/loop0: TYPE="squashfs"
/dev/loop1: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop3: TYPE="squashfs"
/dev/loop4: TYPE="squashfs"
/dev/loop5: TYPE="squashfs"
/dev/zram0: UUID="9df2de79-0f34-47c0-9bc4-28eb524c5405" TYPE="swap"
/dev/zram1: UUID="d97d6934-09cf-456b-9518-607dfac02662" TYPE="swap"
/dev/zram2: UUID="ccbdca3c-58fc-4ae1-a15a-3918056275d4" TYPE="swap"
/dev/zram3: UUID="404b8f98-39bc-415e-be39-abef791be0f9" TYPE="swap"
/dev/sda1: UUID="246e2323-dce4-4f56-b099-60a3cba59ac9" UUID_SUB="7f143f0a-6d9f-4e67-8dc0-f18a5aeee73a" TYPE="btrfs" PARTUUID="cb0d1d32-01"
sudo /usr/bin/mkdir -p -v /data
sudo echo 'UUID=246e2323-dce4-4f56-b099-60a3cba59ac9 /data btrfs defaults,noatime,errors=remount-ro 0 0' >> /etc/fstab
sudo /sbin/mount -a

Then run the Prometheus scrapper

josevnz@raspberrypi:~$ /bin/mkdir -p -v sudo mkdir -p -v /data/prometheussudo /bin/chown -R josevnz /data/prometheusjosevnz@raspberrypi:~$ /usr/bin/docker run --restart always --user 0 --name prometheus --detach --publish 9090:9090 --volume $HOME/etc/prometheus.yaml:/etc/prometheus/prometheus.yml:ro --volume /data/prometheus:/prometheus:rw prom/prometheus
a49e63040b110186a712ae73c8f46cb47834e2486375308bf6b7d84ded378f91
# Check if it came up...
josevnz@raspberrypi:~$ docker logs --follow prometheus
level=info ts=2021-04-25T18:25:04.027Z caller=main.go:380 msg="No time or size retention was set so using the default time retention" duration=15d
level=info ts=2021-04-25T18:25:04.027Z caller=main.go:418 msg="Starting Prometheus" version="(version=2.26.0, branch=HEAD, revision=3cafc58827d1ebd1a67749f88be4218f0bab3d8d)"
level=info ts=2021-04-25T18:25:04.028Z caller=main.go:423 build_context="(go=go1.16.2, user=root@75b99c73e918, date=20210331-11:56:41)"
level=info ts=2021-04-25T18:25:04.028Z caller=main.go:424 host_details="(Linux 5.4.0-1034-raspi #37-Ubuntu SMP PREEMPT Mon Apr 12 23:14:49 UTC 2021 aarch64 a49e63040b11 (none))"
level=info ts=2021-04-25T18:25:04.028Z caller=main.go:425 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-04-25T18:25:04.028Z caller=main.go:426 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-04-25T18:25:04.034Z caller=web.go:540 component=web msg="Start listening for connections" address=0.0.0.0:9090
...

And for the last test, see if the scrapper is talking to the exporter (looks much better on a full fledged browser like FireFox or Brave):

josevnz@raspberrypi:~$ /usr/bin/curl --silent http://raspberrypi:9090/targets
<!doctype html><html lang="en"><head><meta charset="utf-8"/><link rel="shortcut icon" href="./favicon.ico"/><meta name="viewport" content="width=device-width,initial-scale=1,shrink-to-fit=no"/><meta name="theme-color" content="#000000"/><script>const GLOBAL_CONSOLES_LINK=""</script><link rel="manifest" href="./manifest.json" crossorigin="use-credentials"/><title>Prometheus Time Series Collection and Processing Server</title><link href="./static/css/2.89610b22.chunk.css" rel="stylesheet"><link href="./static/css/main.6d3c05e2.chunk.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div><script>!function(e){function r(r){for(var n,l,a=r[0],p=r[1],f=r[2],c=0,s=[];c<a.length;c++)l=a[c],Object.prototype.hasOwnProperty.call(o,l)&&o[l]&&s.push(o[l][0]),o[l]=0;for(n in p)Object.prototype.hasOwnProperty.call(p,n)&&(e[n]=p[n]);for(i&&i(r);s.length;)s.shift()();return u.push.apply(u,f||[]),t()}function t(){for(var e,r=0;r<u.length;r++){for(var t=u[r],n=!0,a=1;a<t.length;a++){var p=t[a];0!==o[p]&&(n=!1)}n&&(u.splice(r--,1),e=l(l.s=t[0]))}return e}var n={},o={1:0},u=[];function l(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,l),t.l=!0,t.exports}l.m=e,l.c=n,l.d=function(e,r,t){l.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},l.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},l.t=function(e,r){if(1&r&&(e=l(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(l.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)l.d(t,n,function(r){return e[r]}.bind(null,n));return t},l.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return l.d(r,"a",r),r},l.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},l.p="./";var a=this.webpackJsonpgraph=this.webpackJsonpgraph||[],p=a.push.bind(a);a.push=r,a=a.slice();for(var f=0;f<a.length;f++)r(a[f]);var i=p;t()}([])</script><script src="./static/js/2.f5312e24.chunk.js"></script><script src="./static/js/main.9c239b4f.chunk.js"></script></body></html>josevnz@raspberrypi:~$

What about the Airflow PostgreSQL database?

Also I want to monitor the database used by Airflow. Will do that through the Prometheus Postgresq exporter:

# You can create an admin user as follows using psql
# postgres=# CREATE ROLE user WITH LOGIN SUPERUSER PASSWORD 'password';
josevnz@raspberrypi:~$ cat $HOME/etc/postgres_exporter
DATA_SOURCE_EXPORTER="postgresql://user:password@macmini2:5432/postgres?sslmode=disable"
/usr/bin/docker run --restart always --detach --name postgres-exporter --net=host --env-file $HOME/etc/postgres_exporter quay.io/prometheuscommunity/postgres-exporter --auto-discover-databases

Data capture part is done!. Time to focus on the visualization, by letting Grafana talk to Prometheus.

I want to see my metrics in a nice Dashboard. Let’s ask Grafana…

This Dashboard looks very similar to the Prometheus graph. Yep, It uses Prometheus QL

The Grafana Docker container is convenient enough for my home network.

So let’s run Grafana with a few overrides

josevnz@raspberrypi:~$ ID=$(/usr/bin/id -u)
josevnz@raspberrypi:~$ /usr/bin/mkdir /data/grafana
josevnz@raspberrypi:~$ /usr/bin/docker run --name grafana --detach --user $ID --volume /data/grafana:/var/lib/grafana --publish 3000:3000 grafana/grafana

Grafana allows you do define dashboards to display the metrics you collected using Prometheus. Some of them are ready to use, like this one for PosgreSQL and this one for node_exporter. Both are beautifully done and ready to use!

Surprisingly Airflow does not have one and the few I saw were not exactly what I was looking for. So, I decided to write my own, using PrometheusQL:

af_agg_dag_processing_last_run_seconds{dag_file =~ "website_backups|website_security|git_backups"}

The resulting Grafana dashboard:

{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"id": 9,
"links": [],
"panels": [
{
"datasource": "Prometheus",
"description": "Airflow task duration. Filtered out example tasks that come with Airflow.",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "Duration",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"graph": false,
"legend": false,
"tooltip": false
},
"lineInterpolation": "linear",
"lineStyle": {
"fill": "solid"
},
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": true
},
"mappings": [
{
"from": "",
"id": 1,
"text": "",
"to": "",
"type": 1,
"value": ""
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 19,
"w": 19,
"x": 0,
"y": 0
},
"id": 2,
"options": {
"graph": {},
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltipOptions": {
"mode": "single"
}
},
"pluginVersion": "7.5.4",
"targets": [
{
"exemplar": true,
"expr": "af_agg_dag_processing_last_run_seconds{dag_file =~ \"website_backups|website_security|git_backups\"}",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"legendFormat": "",
"refId": "A"
}
],
"title": "Airflow task durations",
"type": "timeseries"
}
],
"schemaVersion": 27,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-30m",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "Nunez family Airflow tasks duration",
"uid": "OaTTYQrMk",
"version": 1
}

There are many interesting metrics. I will spend more time figuring out which ones I care the most and will make them part of my dashboard.

Wait there is more: Monitoring OS stats using the node exporter

Host performance monitoring in Grafana. This Dashboard is very detailed.

Wow, you’re still here? Well, as a reward, let me show you how you can also monitor the operating system using Prometheus and Grafana.

For this we can use the node exporter.

But before I move on this, why not to use a docker container, like all the other exporters?

Well, this container will need to run with privileged flags. That’s a bad idea, even the GitHub page for node_exporter recommends not do this. So instead, we will use Systemd and a non system user to accomplish the same.

(airflow) [josevnz@macmini2 src]$ sudo /bin/bash -c 'curl --location --output - https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz| tar --directory /usr/local/sbin --gzip --extract --file - node_exporter-1.1.2.linux-amd64/node_exporter'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 642 100 642 0 0 3710 0 --:--:-- --:--:-- --:--:-- 3710
100 9029k 100 9029k 0 0 2113k 0 0:00:04 0:00:04 --:--:-- 2301k
[josevnz@macmini2 ~]$ sudo /bin/bash -c 'mv -v /usr/local/sbin/node_exporter-1.1.2.linux-amd64/node_exporter /usr/local/sbin/node_exporter; sudo rmdir /usr/local/sbin/node_exporter-1.1.2.linux-amd64'renamed '/usr/local/sbin/node_exporter-1.1.2.linux-amd64/node_exporter' -> '/usr/local/sbin/node_exporter'

And someone nice enough wrote Systemd units, so we just need to tweak them a little bit:

[josevnz@macmini2 ~]$ sudo useradd --system --no-create-home --comment 'Prometheus user' --shell /sbin/nologin node_exporter
(airflow) [josevnz@macmini2 src]$ sudo /bin/bash -c '/usr/bin/curl --location --output /etc/sysconfig/node_exporter https://raw.githubusercontent.com/prometheus/node_exporter/master/examples/systemd/sysconfig.node_exporter'
[josevnz@macmini2 ~]$ sudo /bin/bash -c '/usr/bin/curl --location --output /etc/systemd/system/node_exporter.service https://raw.githubusercontent.com/prometheus/node_exporter/master/examples/systemd/node_exporter.service'(airflow) [josevnz@macmini2 src]$ sudo /usr/bin/sed -i 's#/usr/sbin#/usr/local/sbin#' /etc/systemd/system/node_exporter.service
(airflow) [josevnz@macmini2 src]$ sudo systemctl enable node_exporter.service --now

And now let’s check if it works:

[josevnz@macmini2 ~]$ sudo systemctl status node_exporter.service 
● node_exporter.service - Node Exporter
Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2021-04-25 19:27:48 EDT; 17s ago
Main PID: 32584 (node_exporter)
Tasks: 4 (limit: 2310)
Memory: 13.3M
CGroup: /system.slice/node_exporter.service
└─32584 /usr/local/sbin/node_exporter --collector.textfile.directory /var/lib/node_exporter/textfile_collector
Apr 25 19:27:49 macmini2 node_exporter[32584]: level=info ts=2021-04-25T23:27:49.052Z caller=node_exporter.go:113 collector=thermal_zone
Apr 25 19:27:49 macmini2 node_exporter[32584]: level=info ts=2021-04-25T23:27:49.052Z caller=node_exporter.go:113 collector=time
Apr 25 19:27:49 macmini2 node_exporter[32584]: level=info ts=2021-04-25T23:27:49.052Z caller=node_exporter.go:113 collector=timex
Apr 25 19:27:49 macmini2 node_exporter[32584]: level=info ts=2021-04-25T23:27:49.052Z caller=node_exporter.go:113 collector=udp_queues
Apr 25 19:27:49 macmini2 node_exporter[32584]: level=info ts=2021-04-25T23:27:49.052Z caller=node_exporter.go:113 collector=uname
Apr 25 19:27:49 macmini2 node_exporter[32584]: level=info ts=2021-04-25T23:27:49.052Z caller=node_exporter.go:113 collector=vmstat
Apr 25 19:27:49 macmini2 node_exporter[32584]: level=info ts=2021-04-25T23:27:49.052Z caller=node_exporter.go:113 collector=xfs
Apr 25 19:27:49 macmini2 node_exporter[32584]: level=info ts=2021-04-25T23:27:49.052Z caller=node_exporter.go:113 collector=zfs
Apr 25 19:27:49 macmini2 node_exporter[32584]: level=info ts=2021-04-25T23:27:49.052Z caller=node_exporter.go:195 msg="Listening on" address=:9100
Apr 25 19:27:49 macmini2 node_exporter[32584]: level=info ts=2021-04-25T23:27:49.054Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
[josevnz@macmini2 ~]$ curl --silent http://macmini2:9100/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 6.4795e-05
go_gc_duration_seconds{quantile="0.25"} 6.4795e-05
go_gc_duration_seconds{quantile="0.5"} 6.4795e-05
go_gc_duration_seconds{quantile="0.75"} 6.4795e-05
go_gc_duration_seconds{quantile="1"} 6.4795e-05
go_gc_duration_seconds_sum 6.4795e-05

Good so faron my Fedora hosts. My Raspberry PI 4 is running Ubuntu, and setup there was much easier as I have a package ready to go (look ma, no editing!)

josevnz@raspberrypi:~$ sudo apt-get install prometheus-node-exporter
josevnz@raspberrypi:~$ sudo systemctl enable prometheus-node-exporter.service --now

Now we re-configure our Prometheus scrapper to talk to thew new exporters ($HOME/etc/prometheus.yaml):

global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_timeout: 10s
external_labels:
monitor: 'nunez-family-monitor'
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['192.168.1.16:9100', 'raspberrypi.home:9100']
- job_name: 'statsd-exporter'
static_configs:
- targets: ['raspberrypi:9102']
tls_config:
insecure_skip_verify: true

And we restart the Prometheus container:

josevnz@raspberrypi:~$ docker restart prometheus

I’m not done with you: You can also monitor your docker daemon (dockerd) using all this tools

Yeah, it comes for free, so why not keep tabs on the Docker daemon as well?

For that we enable the metrics over TCP (It is my home network so I’m not afraid of someone checking my container stats. Mi casa es su casa :-)). On the /etc/docker/daemon.json file:

{
"metrics-addr" : "0.0.0.0:9323",
"experimental" : true
}

You will need to bounce your docker daemon to have this change to be effective:

sudo systemctl restart docker

And then, let’s configure Prometheus to scrape this data sources:

global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_timeout: 10s
external_labels:
monitor: 'nunez-family-monitor'
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['192.168.1.16:9100', 'raspberrypi.home:9100', 'mac-pro-1-1:9100', 'dmaf5:9100']
- job_name: 'statsd-exporter'
static_configs:
- targets: ['raspberrypi:9102']
- job_name: 'postgres-exporter'
static_configs:
- targets: ['raspberrypi:9187']
# New docker-exporter section. Dockerd exports the metrics
- job_name: 'docker-exporter'
static_configs:
- targets: ['raspberrypi.home:9323', 'mac-pro-1-1:9323', 'dmaf5:9323']
tls_config:
insecure_skip_verify: true

Restart the Prometheus container and then on Grafana import the excellent Dashboard: Docker Engine Metrics (1229)

How does it the Dockerd dashboard looks?

See for yourself :-)

And now you can keep an eye on your Docker containers, and see how well/ bad they are performing

Epilogue

It is amazing how easy is to add metrics to monitor different applications (Airflow, PostgreSQL, Docker) as well the operating system where they run. This article only scratches the surface of what you can do, for example how you can add your own custom metrics using Prometheus or how to set alerts using Grafana.

I have my own homework to do, I will improve my Grafana Airflow dashboard and will share once it is ready.

Please clap if you like this article! I also want to know what you think, so please leave me a message in the comments section.

--

--

José Vicente Núñez

🇻🇪 🇺🇸, proud dad and husband, DevOps and sysadmin, recreational runner and geek.