Node Exporter Fusion
Telegen includes a drop-in replacement for Prometheus node_exporter, providing full compatibility with existing dashboards and alerts.
Overview
Node Exporter Fusion provides:
- 120+ system metrics - Full node_exporter compatibility
node_*namespace - Works with existing dashboards- Zero configuration - Automatically enabled
- eBPF enhanced - Additional metrics via eBPF
Compatibility
Telegen replaces node_exporter while maintaining full compatibility:
| Feature | node_exporter | Telegen |
|---|---|---|
| Metric namespace | node_* |
node_* ✅ |
| Grafana dashboards | ✅ | ✅ |
| Alert rules | ✅ | ✅ |
| Prometheus scraping | /metrics |
/metrics ✅ |
| Collectors | 50+ | 50+ ✅ |
Collectors
P0 Collectors (Always Enabled)
| Collector | Metrics | Description |
|---|---|---|
| loadavg | 3 | node_load1, node_load5, node_load15 |
| cpu | 15+ per core | CPU time per mode, frequency, info |
| meminfo | 50+ | Memory statistics from /proc/meminfo |
| diskstats | 17+ per device | Disk I/O statistics |
| filesystem | 8 per mount | Filesystem space and inodes |
| netdev | 25+ per interface | Network device statistics |
| stat | 16 | Boot time, context switches, interrupts |
Sample Metrics
# Load averages
node_load1
node_load5
node_load15
# CPU usage per mode
node_cpu_seconds_total{mode="user"}
node_cpu_seconds_total{mode="system"}
node_cpu_seconds_total{mode="idle"}
node_cpu_seconds_total{mode="iowait"}
# Memory
node_memory_MemTotal_bytes
node_memory_MemFree_bytes
node_memory_MemAvailable_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes
node_memory_SwapTotal_bytes
node_memory_SwapFree_bytes
# Disk I/O
node_disk_read_bytes_total
node_disk_written_bytes_total
node_disk_reads_completed_total
node_disk_writes_completed_total
node_disk_io_time_seconds_total
node_disk_read_time_seconds_total
node_disk_write_time_seconds_total
# Filesystem
node_filesystem_size_bytes
node_filesystem_free_bytes
node_filesystem_avail_bytes
node_filesystem_files
node_filesystem_files_free
# Network
node_network_receive_bytes_total
node_network_transmit_bytes_total
node_network_receive_packets_total
node_network_transmit_packets_total
node_network_receive_errs_total
node_network_transmit_errs_total
node_network_receive_drop_total
node_network_transmit_drop_total
# System
node_boot_time_seconds
node_context_switches_total
node_forks_total
node_intr_total
node_procs_running
node_procs_blocked
Configuration
Enable/Disable Collectors
agent:
nodeexporter:
enabled: true
# Listen address for /metrics endpoint
listen_address: ":9100"
# Metric namespace (default: node)
namespace: "node"
# Collectors to enable
collectors:
loadavg: true
cpu: true
meminfo: true
diskstats: true
filesystem: true
netdev: true
stat: true
# P1 collectors
netstat: true
sockstat: true
vmstat: true
# P2 collectors
hwmon: false # Hardware monitoring
thermal: false # Thermal zones
pressure: true # PSI metrics
Device Filtering
Filter which devices to collect metrics from:
agent:
nodeexporter:
filesystem:
# Ignore these filesystem types
ignored_fs_types:
- autofs
- binfmt_misc
- cgroup
- configfs
- debugfs
- devpts
- devtmpfs
- fusectl
- hugetlbfs
- mqueue
- nsfs
- overlay
- proc
- procfs
- pstore
- securityfs
- sysfs
- tmpfs
- tracefs
# Ignore these mount points
ignored_mount_points:
- "^/(dev|proc|sys|var/lib/docker/.+)($|/)"
diskstats:
# Only these devices
device_include:
- "^sd[a-z]+$"
- "^nvme[0-9]+n[0-9]+$"
# Ignore these devices
device_exclude:
- "^loop[0-9]+$"
- "^ram[0-9]+$"
netdev:
# Ignore these interfaces
device_exclude:
- "^veth.*"
- "^docker.*"
- "^br-.*"
TLS/mTLS Configuration
Secure the metrics endpoint with TLS and optional mutual TLS (mTLS):
agent:
nodeexporter:
enabled: true
listen_address: ":9100"
# TLS configuration
tls:
enabled: true
# Server certificate and key
cert_file: "/etc/telegen/certs/server.crt"
key_file: "/etc/telegen/certs/server.key"
# Enable mTLS (client certificate verification)
client_auth: true
# CA certificate for verifying client certs
client_ca_file: "/etc/telegen/certs/ca.crt"
| Option | Description | Default |
|---|---|---|
tls.enabled |
Enable TLS for metrics endpoint | false |
tls.cert_file |
Path to server certificate | - |
tls.key_file |
Path to server private key | - |
tls.client_auth |
Require client certificates (mTLS) | false |
tls.client_ca_file |
CA for verifying client certificates | - |
Metric Cardinality Controls
Control metric cardinality to prevent explosion from high-cardinality labels:
agent:
nodeexporter:
enabled: true
# Cardinality controls
cardinality:
enabled: true
# Maximum number of metric families
max_metrics: 1000
# Include only these metrics (regex patterns)
include_metrics:
- "node_cpu_.*"
- "node_memory_.*"
- "node_disk_.*"
- "node_filesystem_.*"
- "node_network_.*"
- "node_load.*"
# Exclude these metrics (regex patterns)
exclude_metrics:
- "node_scrape_.*"
- "go_.*"
# Drop these labels from all metrics
drop_labels:
- "id"
- "name"
| Option | Description | Default |
|---|---|---|
cardinality.enabled |
Enable cardinality filtering | false |
cardinality.max_metrics |
Maximum metric families (0 = unlimited) | 0 |
cardinality.include_metrics |
Regex patterns to include | [] (all) |
cardinality.exclude_metrics |
Regex patterns to exclude | [] |
cardinality.drop_labels |
Labels to remove from all metrics | [] |
API Endpoints
The node exporter provides several HTTP endpoints:
| Endpoint | Description |
|---|---|
/metrics |
Prometheus metrics endpoint |
/metrics/description |
JSON documentation of all metrics with OTEL mappings |
/health |
Health check (JSON) |
/ready |
Readiness probe |
/live |
Liveness probe |
Metric Descriptions Endpoint
The /metrics/description endpoint returns JSON documentation for all available metrics, including OTEL semantic convention mappings:
curl http://localhost:9100/metrics/description
Response:
{
"categories": [
{
"category": "CPU",
"count": 4,
"metrics": [
{
"name": "node_cpu_seconds_total",
"otel_name": "system.cpu.time",
"description": "Seconds the CPUs spent in each mode",
"unit": "s",
"type": "counter",
"labels": {"cpu": "cpu", "mode": "cpu.mode"},
"has_otel_mapping": true
}
]
}
],
"total": 35,
"otel_info": {
"version": "v1.38.0",
"mapped_count": 35,
"total_metrics": 35,
"coverage": "100.0%"
}
}
Prometheus Integration
Scrape Configuration
# prometheus.yml
scrape_configs:
- job_name: 'node'
static_configs:
- targets:
- 'host1:9100'
- 'host2:9100'
Service Discovery (Kubernetes)
scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.+):10250'
replacement: '${1}:9100'
target_label: __address__
Grafana Dashboards
Telegen is compatible with standard node_exporter dashboards:
Recommended Dashboards
| Dashboard | Grafana ID | Description |
|---|---|---|
| Node Exporter Full | 1860 | Comprehensive system metrics |
| Node Exporter for Prometheus | 11074 | Clean, modern layout |
| Linux Server Metrics | 180 | Classic dashboard |
Import Dashboard
- Go to Grafana → Dashboards → Import
- Enter dashboard ID (e.g.,
1860) - Select Prometheus data source
- Dashboard works immediately with Telegen
eBPF-Enhanced Metrics
Telegen adds eBPF-based metrics beyond standard node_exporter:
Additional Metrics
| Metric | Description |
|---|---|
node_tcp_rtt_microseconds |
TCP round-trip time |
node_tcp_retransmits_total |
TCP retransmissions |
node_process_open_fds |
Open file descriptors per process |
node_cgroup_cpu_usage_seconds_total |
Per-cgroup CPU usage |
node_cgroup_memory_usage_bytes |
Per-cgroup memory usage |
Enable eBPF Enhancements
agent:
nodeexporter:
enabled: true
# Enable eBPF-enhanced metrics
ebpf_enhanced: true
Migration from node_exporter
Step 1: Deploy Telegen
Deploy Telegen alongside node_exporter:
# Telegen on different port initially
agent:
nodeexporter:
listen_address: ":9101" # Different port
Step 2: Compare Metrics
Verify metric compatibility:
# Compare CPU metrics
node_cpu_seconds_total{port="9100"} # node_exporter
node_cpu_seconds_total{port="9101"} # Telegen
Step 3: Switch Scrape Targets
Update Prometheus to scrape Telegen:
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['host:9100'] # Now points to Telegen
Step 4: Remove node_exporter
Once verified, remove node_exporter.
Textfile Collector
Import custom metrics from files:
agent:
nodeexporter:
textfile:
enabled: true
directory: "/var/lib/node_exporter/textfile_collector"
Create Custom Metrics
# /var/lib/node_exporter/textfile_collector/custom.prom
# HELP node_custom_metric A custom metric
# TYPE node_custom_metric gauge
node_custom_metric{label="value"} 42
Performance
Resource Usage
| Metric | Value |
|---|---|
| CPU overhead | < 0.5% |
| Memory | ~20MB |
| Scrape time | < 100ms |
Optimization
For large systems (many disks, interfaces):
agent:
nodeexporter:
# Increase scrape timeout
timeout: 10s
# Reduce collection frequency
collector_interval: 30s
# Limit concurrent collectors
max_procs: 2
Common Queries
CPU Usage
# CPU utilization percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Per-CPU utilization
1 - avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle"}[5m]))
Memory Usage
# Memory utilization percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Memory breakdown
node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes
Disk I/O
# Disk read/write rate
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# Disk I/O utilization
rate(node_disk_io_time_seconds_total[5m]) * 100
Network
# Network throughput
rate(node_network_receive_bytes_total[5m]) * 8
rate(node_network_transmit_bytes_total[5m]) * 8
# Packet errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
Filesystem
# Filesystem usage percentage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
# Inode usage
(1 - node_filesystem_files_free / node_filesystem_files) * 100
Alerting Examples
groups:
- name: node
rules:
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Host {{ $labels.instance }} is running out of memory"
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Host {{ $labels.instance }} disk space is below 10%"
- alert: HostHighDiskIO
expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High disk I/O on {{ $labels.instance }}"
Next Steps
- Auto Discovery - Automatic service detection
- Distributed Tracing - Application tracing
- Agent Mode - Full configuration