Monitor your infrastructure with Streamlit

10 min readApr 3, 2023

Streamlit is an open source data visualization framework written in Python. In Display your application data with Streamlit,I introduced the framework and provided some basic examples of how to use it.

This article expands on the previous one, covering the following topics:

Using Ansible to set up a Prometheus Node Exporter and a scraper to collect metrics about your system.
Connecting to the Prometheus scraper to get the metrics and display them in real-time using Streamlit.

You will need the following to complete the tutorial:

Elevated permissions to install Prometheus Node Exporter and the scraper.
Some experience with Python programming.
Curiosity!

The first step is to learn how to collect metrics over time from your machines using Prometheus.

Great DevOps downloads

Monitor memory utilization periodically from host machines

Prometheus is an open source framework created to collect metrics about your system. It also provides visualization, efficient storage, and many other cool features.

In a typical on-premises Prometheus setup, you have one or more Node Exporters collecting data and a main scraper aggregating the data from all those nodes.

You can visualize the collected data directly on your agent using the local URL http://prometheuscollector:9090/, or you can have another visualization tool, like Grafana, talk to the collector.

In this article, I’ll develop a Streamlit application that talks to Prometheus to visualize this data. First, I’ll look at some of the required components.

The Prometheus collector (scraper)

Say your homelab has two machines, named raspberrypi and dmaf5, both running Prometheus Node Exporter agents. You can set up your Prometheus scraper configuration prometheus.yaml to reflect:

--- global: scrape_interval: 30s evaluation_interval: 30s scrape_timeout: 10s external_labels: monitor: 'nunez-family-lab-monitor' scrape_configs: - job_name: 'node-exporter' static_configs: - targets: ['raspberrypi.home:9100', 'dmaf5.home:9100']

How can you install all the data acquisition pieces? You can use an Ansible playbook for that.

[ Write your first Ansible playbook in this hands-on interactive lab. ]

Provision metrics collection (or how to automate the boring stuff with Ansible)

Put the following pieces together for the automation recipe to work:

Inventory file: Tells Ansible where to install and deploy the Node Exporter agents
Template to set up the Prometheus daemon that collects the metrics
Systemd unit files for Prometheus and Node Exporter. The Node Exporter runs on each node, and Prometheus runs on the master node.
This playbook that orchestrates all the actions, from copying configurations, to downloading software, and finally starting processes:

# Playbook to provision Prometheus scraper and node-exporter servers # Download binaries from: https://prometheus.io/download/#prometheus # Checked with: ansible-lint provision_prometheus.yaml --- - name: Setup Prometheus on your home lab hosts: home_lab become_user: root become: true vars: prometheus_scraper_version: "2.42.0" prometheus_node_exporter_version: "1.5.0" prometheus_url: "https://github.com/prometheus/prometheus/releases/download" exporter_url: "https://github.com/prometheus/node_exporter/releases/download" prometheus_node_exporter_port: 9100 prometheus_install_dir: "/opt/prometheus" prometheus_data_dir: "/data/prometheus" prometheus_conf_dir: "/etc/prometheus" prometheus_checksum: amd64: "422dab055ed9c7bcaff52b718705f9192c6fac0de6b7e78dd278e70ee2663dcc" arm64: "b947e4890d221f6b1d09c963fd0440720f2b5d08c29e190d6365f4b3d3e92a9d" prometheus_node_exporter_checksum: amd64: "af999fd31ab54ed3a34b9f0b10c28e9acee9ef5ac5a5d5edfdde85437db7acbb" arm64: "e031a539af9a619c06774788b54c23fccc2a852d41437315725a086ccdb0ed16" tasks: - name: Set architecture (home lab has amd64 and arm64) ansible.builtin.set_fact: prom_arch: "{{ 'amd64' if ansible_architecture == 'x86_64' else 'arm64' }}" - name: Prometheus user tags: user ansible.builtin.user: name: prometheus system: true create_home: false comment: "Prometheus service account" shell: /sbin/nologin state: present - name: Prometheus install directory ansible.builtin.file: state: directory owner: "prometheus" mode: 0755 path: "{{ prometheus_install_dir }}" - name: Scraper management when: "'master_lab' in group_names" block: - name: Install Prometheus Scraper tags: scraper ansible.builtin.get_url: dest: "/tmp/prometheus-{{ prometheus_scraper_version }}.linux-{{ prom_arch }}.tar.gz" url: "{{ prometheus_url }}/v{{ prometheus_scraper_version }}/prometheus-{{ prometheus_scraper_version }}.linux-{{ prom_arch }}.tar.gz" checksum: "sha256:{{ prometheus_checksum[prom_arch | default('arm64')] }}" mode: "u=rw" - name: Unpack Prometheus Scraper tags: unpack_scraper ansible.builtin.unarchive: remote_src: true owner: "prometheus" group: "prometheus" src: "/tmp/prometheus-{{ prometheus_scraper_version }}.linux-{{ prom_arch }}.tar.gz" dest: "{{ prometheus_install_dir }}" notify: - Restart Prometheus - name: Prometheus config directory ansible.builtin.file: state: directory owner: "prometheus" group: "prometheus" mode: 0770 path: "{{ item }}" loop: - "{{ prometheus_conf_dir }}" - "{{ prometheus_data_dir }}" - name: Install Prometheus configuration tags: config_prometheus_copy ansible.builtin.template: src: templates/prometheus.yaml.j2 dest: /etc/prometheus/prometheus.yaml owner: prometheus group: prometheus mode: 0644 notify: - Restart Prometheus - name: Install Prometheus systemd unit tags: systemd_prometheus_copy ansible.builtin.template: src: templates/prometheus.service.j2 dest: /etc/systemd/system/prometheus.service owner: root group: root mode: 0644 notify: - Restart Prometheus - name: Make sure Prometheus is running tags: systemd_prometheus_start ansible.builtin.systemd: state: started name: prometheus enabled: true daemon_reload: true - name: Install Prometheus Node exporter tags: exporter ansible.builtin.get_url: dest: "/tmp/node_exporter-{{ prometheus_node_exporter_version }}.linux-{{ prom_arch }}.tar.gz" url: "{{ exporter_url }}/v{{ prometheus_node_exporter_version }}/node_exporter-{{ prometheus_node_exporter_version }}.linux-{{ prom_arch }}.tar.gz" checksum: "sha256:{{ prometheus_node_exporter_checksum[prom_arch | default('arm64')] }}" mode: "u=rw" - name: Unpack Node exporter tags: unpack_exporter ansible.builtin.unarchive: remote_src: true owner: "prometheus" group: "prometheus" src: "/tmp/node_exporter-{{ prometheus_node_exporter_version }}.linux-{{ prom_arch }}.tar.gz" dest: "{{ prometheus_install_dir }}" notify: - Restart Node Exporter - name: Install Node Exporter systemd unit tags: systemd_node_exporter_copy ansible.builtin.template: src: templates/node_exporter.service.j2 dest: /etc/systemd/system/node_exporter.service owner: root group: root mode: 0644 notify: - Restart Node Exporter - name: Install Node Exporter environment overrides tags: systemd_node_exporter_env ansible.builtin.copy: src: files/prometheus-node-exporter dest: /etc/default/prometheus-node-exporter owner: root group: root mode: 0644 notify: - Restart Node Exporter - name: Make sure Node Exporter is running tags: systemd_node_exporter_start ansible.builtin.systemd: state: started name: node_exporter enabled: true daemon_reload: true handlers: - name: Restart Node Exporter tags: restart_node_exporter ansible.builtin.service: name: node_exporter state: restarted - name: Restart Prometheus tags: restart_prometheus ansible.builtin.service: name: prometheus state: restarted

You can see it in action:

$ ansible-playbook --inventory inventory provision_prometheus.yaml

After running the playbook, you can check whether it collects data by connecting to the Prometheus host you set up in your inventory using the URL http://prometheuscollector:9090/. Then you can run queries using PromQL like this one:

rate(node_cpu_seconds_total{mode="system"}[1m])

Display Prometheus metrics with Streamlit

Prometheus supports queries using the PromQL language. It also offers a REST API to expose those metrics to clients.

To demonstrate how the monitoring works, use the following metric:

node_memory_MemFree_bytes

If you query the scraper node, it returns a JSON response like this:

{ "status": "success", "data": { "resultType": "matrix", "result": [ { "metric": { "__name__": "node_memory_MemFree_bytes", "instance": "dmaf5.home:9100", "job": "node-exporter" }, "values": [ [ 1678066203.119, "1086185472" ], [ 1678066233.119, "1086185472" ], [ 1678066263.119, "1086185472" ] ] } ] } }

You can see the full example in my GitHub repo.

The next step is writing a simple Streamlit application to collect the metrics and display them using a line chart:

import json import logging from datetime import datetime, timedelta from json.decoder import JSONDecodeError import os import textwrap import altair import requests import streamlit as st from pandas import DataFrame, Series, Timestamp from numpy import float64 from requests import HTTPError, RequestException MINUTES_BACK = 60 DEFAULT_TIME_BACK = timedelta(minutes=-MINUTES_BACK) DEFAULT_QUERY = 'node_memory_MemFree_bytes' STEP_DURATION = "30s" @st.cache_data def full_url(url: str, has_time_range: bool = True) -> str: if has_time_range: return f"{url}/api/v1/query_range" # Range query return f"{url}/api/v1/query" # Instant query def get_metrics( the_payload: dict[str, any], url: str, start_range: datetime = None, end_range: datetime = None ) -> (dict[any, any], int): new_query = {} new_query.update(the_payload) if start_range and end_range: new_query['start'] = start_range.timestamp() new_query['end'] = end_range.timestamp() new_query['step'] = STEP_DURATION logging.info("url=%s, params=%s", url, new_query) response = requests.get(url=url, params=new_query) return response.json(), response.status_code def transform(m_data: dict[any, any]) -> DataFrame: """ Convert a Prometheus data structure into a Panda DataFrame :param m_data: :return: DataFrame """ df = DataFrame({ mtr['metric']['instance']: Series( data=[float64(vl[1]) for vl in mtr['values']], index=[Timestamp(vl[0], unit='s') for vl in mtr['values']], name="Free memory (bytes)" ) for mtr in m_data['data']['result'] }) logging.info(f"Columns: {df.columns}") logging.info(f"Index: {df.index}") logging.info(f"Index: {df}") return df if __name__ == "__main__": st.title("Realtime Prometheus monitoring") data_load_state = st.text('No data loaded yet...') if 'PROMETHEUS_URL' not in os.environ: st.markdown("## Please define the following environment variable and restart this application (example below):") st.code(textwrap.dedent(f""" PROMETHEUS_URL="http://raspberrypi:9090/" export PROMETHEUS_URL streamlit run {__file__} """)) st.markdown( "New to Prometheus?. Please check the [Official](https://prometheus.io/docs/prometheus/latest/querying/api/) documentation") data_load_state.error("No data was loaded.") else: code = 0 metrics = {} try: PROM_URL = full_url(os.environ['PROMETHEUS_URL'], has_time_range=True) st.info(f"Using '{PROM_URL}'") query = DEFAULT_QUERY payload = {'query': query} # First query we boostrap with a reasonable time range END: datetime = datetime.now() START = END + DEFAULT_TIME_BACK if payload: (graph, raw) = st.tabs(["Time Series", "Debugging"]) metrics, code = get_metrics( url=PROM_URL, the_payload=payload, start_range=START, end_range=END, ) data: DataFrame = DataFrame() if code == 200: now = datetime.now() data_load_state.info(f"Metrics data refreshed ({now}).") logging.info(f"Metrics data refreshed ({now}).") try: data = transform(m_data=metrics) with graph: st.title("Time series") # See auto-refresh dilema: https://github.com/streamlit/streamlit/issues/168 if st.button('Click to refresh!'): st.write("Refreshing") st.experimental_rerun() # st.line_chart(data=data) # You can see all the possible type encodings here: # https://altair-viz.github.io/altair-tutorial/notebooks/02-Simple-Charts.html chart = ( altair.Chart(data.stack().rename("value").reset_index()).mark_line().encode( x="level_0:T", y="value:Q", color="level_1" ) ) st.altair_chart(chart, use_container_width=True) except ValueError as val: st.exception(val) with raw: if not data.empty: st.title("DataFrame for Free memory (bytes)") st.dataframe(data) st.title("Query:") st.markdown(f"```{query}, start={START}, end={END}```") st.title("Prometheus data:") st.json(metrics) else: st.warning(f"Hmm, invalid query?: {query}") st.warning(json.dumps(metrics, indent=True)) except (HTTPError, JSONDecodeError, RequestException, KeyError) as exp: st.error(f"There was a problem while running the query (HTTP_CODE={code})...") if isinstance(exp, KeyError): st.code(f"Metrics={json.dumps(metrics, indent=True)}...") st.exception(exp)

A few things to note:

The data source is a Prometheus scraper that returns a JSON document. It’s converted to a Pandas DataFrame, one of the most well-supported formats on Streamlit.
Next, it adds the graphical components; one tab to show the time-series data and the other to display the data and queries in tabular format for debugging purposes.
It also adds a button to refresh the plot data manually. There are ways to auto-refresh the contents.

Before running the application, reload the Python virtual environment you created for Streamlit. For additional information, revisit Display your application data with Streamlit.

What does it look like? See it in this screenshot of the Prometheus times-series line chart:

Here is the debugging tab, where you can see the raw data, queries, and other useful stuff:

Next, I’ll compare Steamlit with other popular visualization tools to help you decide if it’s a good fit for a DevOps or systems administrator.

Kubernetes and OpenShift

Compare Streamlit with other visualization tools

A brief comparison of Grafana, InfluxDB, and Streamlit out-of-the-box capabilities can help you decide which one to use. As usual, no single tool can do everything, and you most likely will need a mix of them.

For this analysis, the focus is on observability in general. Some features may be more or less appealing to you.

All three open source solutions support tables and time series graphics. Specific features of each (and that may be particularly interesting if your focus is on data science) include:

Grafana 9.4:
Embedded alerting: Yes, multiple destinations supported
Multiple data sources: Yes
Supported languages: Golang, depends on the used data source (JSONPath, FluxQL)
Automatic data refresh: Yes
Easy to write a “live” paper? Supports Markdown but requires write permission on the dashboard
InfluxDB 2.6.1:
Embedded alerting: Yes
Multiple data sources: No
Supported languages: Flux, InFluxQL
Automatic data refresh: Yes
Easy to write a “live” paper? Supports Markdown but requires write permission on the dashboard
Streamlit 1.20.0:
Embedded alerting: No, requires external libraries and logic
Multiple data sources: Yes
Supported languages: Python
Automatic data refresh: No, DIY
Easy to write a “live” paper? Full freedom with Markdown, videos, intermixed

These three tools overlap in many areas, but there are key differences:

They can all run in containers, which may simplify their deployment.
Grafana is the most feature-rich and needs to interface with an external data source like Prometheus, InfluxDB, or a JSON REST service to visualize data. It is not a database and has powerful alerting mechanisms (email, chat applications, and more).
InfluxDB is a time-series database that also has dashboards and alerting capabilities, though more limited than Grafana.
Streamlit is a framework written in Python for data science applications. It can interface with infinite data sources. Among these three, it’s the easiest to set up, but it is not a database and has no alerting capabilities.

So which one should you use? It depends. If you need quick prototyping and no alerting, Streamlit is a good fit; if you only need a database with simple visualization and alerting, then InfluxDB is for you; and for a more complete solution, although it’s more complex to set up, you can’t go wrong with Grafana.

Not everything is dashboards: A TCP port tester written with a Streamlit GUI

As a bonus, I wrote a simple TCP port scanner with Streamlit. You just need to define a YAML configuration file that tells the app which machines and ports to scan:

# Description of host and TCP ports to check --- hosts: - name: raspberrypi.home ports: - 3000 - 22 - 9090 - name: dmaf5.home ports: - 22 - name: www.redhat.com ports: - 80 - 443 - name: github.com ports: - 80 - 443

[ Get the YAML cheat sheet ]

The script goes through every port and checks if it’s open or not. Note that this app is not meant to be stealth, and it’s not fast, as it performs a sequential port scan:

#!/usr/bin/env python3 import logging import textwrap import socket from typing import Any from yaml import load import streamlit as st try: from yaml import CLoader as Loader except ImportError: from yaml import Loader PROGRESS_TEXT = "Scanning hosts. Please wait" def check_tcp_port_xmas(dst_ip: str, dst_port: int) -> str: try: with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock: result = sock.connect_ex((dst_ip, dst_port)) if result == 0: return "Open" else: return "Closed" except (TypeError, PermissionError) as perm_err: logging.exception(perm_err) return "Error" def load_config(stream) -> Any: return load(stream, Loader=Loader) if __name__ == "__main__": st.title("TCP Port scanner") st.markdown(textwrap.dedent(""" Simple TCP/ IP port scanner. """)) st.file_uploader( "Please provide the configuration file to load", accept_multiple_files=False, key="portscan_config" ) if st.session_state['portscan_config']: yaml = load_config(st.session_state['portscan_config']) hosts_details = yaml['hosts'] chunks = len(hosts_details) try: data_load_state = st.text('Preparing to scan...') with st.spinner(f"Total hosts to scan: {chunks}"): ip = None for host in hosts_details: host_name = host['name'].strip() ports = host['ports'] try: ip = socket.gethostbyname(host_name) for port in ports: status = check_tcp_port_xmas(dst_ip=ip, dst_port=port) data_load_state.text(f"Processing: {host_name}({ip}):{port}, status={status}") if status == "Open": st.info(f"{host_name}:{port}, status={status}") elif status == "Closed": st.warning(f"{host_name}:{port}, status={status}") else: st.error(f"{host_name}:{port}, status={status}") except TypeError as os_err: raise data_load_state.success(f"Finished scanning {chunks} hosts") except (KeyError, ValueError, OSError, TypeError) as err: st.error(hosts_details) st.exception(err) else: st.warning("Please load a PortTester configuration file to proceed")

This Python script requires some additional dependencies as listed in requirements.txt:

streamlit==1.20.0 pyyaml==6.0 pandas

Install the dependencies using pip:

$ pip3 install -r requirements.txt

Now run the script:

$ streamlit run porttester/port_tester.py

You will see something similar to this:

Wrap up

Streamlit is well tuned for data-science applications, but you can also use it for infrastructure monitoring. You can also create live reports to quickly check the health of your sites, with very little coding. Here’s a summary:

[ Want to test your sysadmin skills? Take a skills assessment today. ]

Originally published at https://www.redhat.com on April 3, 2023.