The Stack Explained
- Import dashboard ID 1860 from grafana.com instead of building your own — you'll learn PromQL faster by reading working queries than by writing them from zero.
- Set up Alertmanager on day one, not "later." If nobody gets notified, you're just making pretty graphs.
- Prometheus - Pulls metrics from your services on a schedule and stores them as time-series data
- Grafana - The dashboard layer. Connects to Prometheus and turns numbers into graphs.
- Exporters - Small agents you run on each host. They translate system stats into a format Prometheus understands.
- Alertmanager - Receives firing alerts from Prometheus and routes them to email, Slack, Telegram, whatever
Docker Compose Setup
version: '3'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
node_exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration
This goes in prometheus.yml in the same directory as your compose file:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node_exporter:9100']
- job_name: 'docker'
static_configs:
- targets: ['host.docker.internal:9323']
Start Everything
docker compose up -d
Give it 30 seconds to pull images if it's your first run. Then check:
Prometheus UI: http://localhost:9090
Grafana: http://localhost:3000
Add Prometheus to Grafana
- Log into Grafana with admin/changeme
- Go to Configuration → Data Sources → Add
- Pick Prometheus from the list
- For the URL, enter
http://prometheus:9090(not localhost — containers talk to each other by service name) - Hit Save & Test. You should see a green checkmark.
Import Dashboards
Building dashboards from scratch is a waste of time when you're starting out. Hundreds of good ones already exist on grafana.com. Here's how to grab one:
- Dashboards → Import
- Type in a dashboard ID (1860 is the classic Node Exporter Full)
- Point it at your Prometheus data source
- Import. Done.
Dashboard IDs I actually use:
- 1860 - Node Exporter Full
- 893 - Docker and System Monitoring
- 13946 - Container metrics
PromQL Basics
PromQL is Prometheus's query language. It's weird at first — the syntax doesn't look like SQL or anything else you've used. But these four queries cover 90% of what a homelab needs:
# Current CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk space used
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
# Request rate
rate(http_requests_total[5m])
Alerting
Alerts are the entire point. Without them you're just looking at graphs after something already broke. Create an alert_rules.yml file:
# alert_rules.yml
groups:
- name: node_alerts
rules:
- alert: HighCPU
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: High CPU usage on {{ $labels.instance }}
- alert: DiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
for: 10m
labels:
severity: critical
annotations:
summary: Disk space low on {{ $labels.instance }}
Alertmanager Setup
Alertmanager is a separate container. Add this block to your docker-compose.yml:
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
Then tell it where to send notifications. This goes in alertmanager.yml:
route:
receiver: 'email-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
auth_username: 'user'
auth_password: 'password'
Common Exporters
- node_exporter - The big one. CPU, memory, disk, network. Install this first on every host.
- cadvisor - Container-level stats. Pairs well with node_exporter if you run Docker.
- blackbox_exporter - Pings URLs, checks DNS, tests TCP ports. Good for "is my site up?" checks.
- postgres_exporter - Pulls query stats and connection counts from PostgreSQL
- mysqld_exporter - Same idea but for MySQL/MariaDB
Retention and Storage
By default Prometheus only keeps 15 days of data. That's not enough for spotting long-term trends. Add these flags to your Prometheus command:
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=10GB
If you need months or years of history, look into Thanos or Cortex. For a homelab, 30-90 days in plain Prometheus is usually fine.
Best Practices
- node_exporter first, everything else second. It gives you CPU, RAM, disk, and network in one binary.
- Dashboard 1860 on grafana.com. Import it. Tweak it later. Don't start from a blank canvas.
- If you don't configure alerts, you're just collecting data nobody looks at. Set up Alertmanager the same day you set up Prometheus.
- Bump retention to 30d minimum. The default 15 days disappears fast when you're trying to figure out why your server was slow last week.
- Prometheus — 8 hosts, 15s scrape interval
- Grafana — dashboard 1860 + a custom one for Docker containers
- Alertmanager — routes to Telegram bot
- node_exporter on every host, cadvisor on the Docker boxes
- Resource usage: ~800MB RAM, ~2GB disk per month of retention
What I'd change next time
- Alertmanager from the start. I ran Prometheus for two weeks before adding alerts. During those two weeks I still missed a disk filling up. The dashboards existed. I just didn't look at them. Should have set up Telegram notifications on day one.
- Blackbox exporter sooner. I was monitoring system metrics but not whether my actual services were responding. Adding HTTP endpoint checks with blackbox_exporter caught a hung Nextcloud process that looked healthy in every other metric.
- Separate Grafana credentials. I left the default admin/changeme password for a month. Not great. Should have changed it immediately or set up OAuth through my reverse proxy.
💬 Comments