Node Exporter Fusion

Telegen includes a drop-in replacement for Prometheus node_exporter, providing full compatibility with existing dashboards and alerts.

Overview

Node Exporter Fusion provides:

  • 120+ system metrics - Full node_exporter compatibility
  • node_* namespace - Works with existing dashboards
  • Zero configuration - Automatically enabled
  • eBPF enhanced - Additional metrics via eBPF

Compatibility

Telegen replaces node_exporter while maintaining full compatibility:

Feature node_exporter Telegen
Metric namespace node_* node_*
Grafana dashboards
Alert rules
Prometheus scraping /metrics /metrics
Collectors 50+ 50+ ✅

Collectors

P0 Collectors (Always Enabled)

Collector Metrics Description
loadavg 3 node_load1, node_load5, node_load15
cpu 15+ per core CPU time per mode, frequency, info
meminfo 50+ Memory statistics from /proc/meminfo
diskstats 17+ per device Disk I/O statistics
filesystem 8 per mount Filesystem space and inodes
netdev 25+ per interface Network device statistics
stat 16 Boot time, context switches, interrupts

Sample Metrics

# Load averages
node_load1
node_load5
node_load15

# CPU usage per mode
node_cpu_seconds_total{mode="user"}
node_cpu_seconds_total{mode="system"}
node_cpu_seconds_total{mode="idle"}
node_cpu_seconds_total{mode="iowait"}

# Memory
node_memory_MemTotal_bytes
node_memory_MemFree_bytes
node_memory_MemAvailable_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes
node_memory_SwapTotal_bytes
node_memory_SwapFree_bytes

# Disk I/O
node_disk_read_bytes_total
node_disk_written_bytes_total
node_disk_reads_completed_total
node_disk_writes_completed_total
node_disk_io_time_seconds_total
node_disk_read_time_seconds_total
node_disk_write_time_seconds_total

# Filesystem
node_filesystem_size_bytes
node_filesystem_free_bytes
node_filesystem_avail_bytes
node_filesystem_files
node_filesystem_files_free

# Network
node_network_receive_bytes_total
node_network_transmit_bytes_total
node_network_receive_packets_total
node_network_transmit_packets_total
node_network_receive_errs_total
node_network_transmit_errs_total
node_network_receive_drop_total
node_network_transmit_drop_total

# System
node_boot_time_seconds
node_context_switches_total
node_forks_total
node_intr_total
node_procs_running
node_procs_blocked

Configuration

Enable/Disable Collectors

agent:
  nodeexporter:
    enabled: true
    
    # Listen address for /metrics endpoint
    listen_address: ":9100"
    
    # Metric namespace (default: node)
    namespace: "node"
    
    # Collectors to enable
    collectors:
      loadavg: true
      cpu: true
      meminfo: true
      diskstats: true
      filesystem: true
      netdev: true
      stat: true
      
      # P1 collectors
      netstat: true
      sockstat: true
      vmstat: true
      
      # P2 collectors
      hwmon: false      # Hardware monitoring
      thermal: false    # Thermal zones
      pressure: true    # PSI metrics

Device Filtering

Filter which devices to collect metrics from:

agent:
  nodeexporter:
    filesystem:
      # Ignore these filesystem types
      ignored_fs_types:
        - autofs
        - binfmt_misc
        - cgroup
        - configfs
        - debugfs
        - devpts
        - devtmpfs
        - fusectl
        - hugetlbfs
        - mqueue
        - nsfs
        - overlay
        - proc
        - procfs
        - pstore
        - securityfs
        - sysfs
        - tmpfs
        - tracefs
      
      # Ignore these mount points
      ignored_mount_points:
        - "^/(dev|proc|sys|var/lib/docker/.+)($|/)"
    
    diskstats:
      # Only these devices
      device_include:
        - "^sd[a-z]+$"
        - "^nvme[0-9]+n[0-9]+$"
      
      # Ignore these devices
      device_exclude:
        - "^loop[0-9]+$"
        - "^ram[0-9]+$"
    
    netdev:
      # Ignore these interfaces
      device_exclude:
        - "^veth.*"
        - "^docker.*"
        - "^br-.*"

TLS/mTLS Configuration

Secure the metrics endpoint with TLS and optional mutual TLS (mTLS):

agent:
  nodeexporter:
    enabled: true
    listen_address: ":9100"
    
    # TLS configuration
    tls:
      enabled: true
      
      # Server certificate and key
      cert_file: "/etc/telegen/certs/server.crt"
      key_file: "/etc/telegen/certs/server.key"
      
      # Enable mTLS (client certificate verification)
      client_auth: true
      
      # CA certificate for verifying client certs
      client_ca_file: "/etc/telegen/certs/ca.crt"
Option Description Default
tls.enabled Enable TLS for metrics endpoint false
tls.cert_file Path to server certificate -
tls.key_file Path to server private key -
tls.client_auth Require client certificates (mTLS) false
tls.client_ca_file CA for verifying client certificates -

Metric Cardinality Controls

Control metric cardinality to prevent explosion from high-cardinality labels:

agent:
  nodeexporter:
    enabled: true
    
    # Cardinality controls
    cardinality:
      enabled: true
      
      # Maximum number of metric families
      max_metrics: 1000
      
      # Include only these metrics (regex patterns)
      include_metrics:
        - "node_cpu_.*"
        - "node_memory_.*"
        - "node_disk_.*"
        - "node_filesystem_.*"
        - "node_network_.*"
        - "node_load.*"
      
      # Exclude these metrics (regex patterns)
      exclude_metrics:
        - "node_scrape_.*"
        - "go_.*"
      
      # Drop these labels from all metrics
      drop_labels:
        - "id"
        - "name"
Option Description Default
cardinality.enabled Enable cardinality filtering false
cardinality.max_metrics Maximum metric families (0 = unlimited) 0
cardinality.include_metrics Regex patterns to include [] (all)
cardinality.exclude_metrics Regex patterns to exclude []
cardinality.drop_labels Labels to remove from all metrics []

API Endpoints

The node exporter provides several HTTP endpoints:

Endpoint Description
/metrics Prometheus metrics endpoint
/metrics/description JSON documentation of all metrics with OTEL mappings
/health Health check (JSON)
/ready Readiness probe
/live Liveness probe

Metric Descriptions Endpoint

The /metrics/description endpoint returns JSON documentation for all available metrics, including OTEL semantic convention mappings:

curl http://localhost:9100/metrics/description

Response:

{
  "categories": [
    {
      "category": "CPU",
      "count": 4,
      "metrics": [
        {
          "name": "node_cpu_seconds_total",
          "otel_name": "system.cpu.time",
          "description": "Seconds the CPUs spent in each mode",
          "unit": "s",
          "type": "counter",
          "labels": {"cpu": "cpu", "mode": "cpu.mode"},
          "has_otel_mapping": true
        }
      ]
    }
  ],
  "total": 35,
  "otel_info": {
    "version": "v1.38.0",
    "mapped_count": 35,
    "total_metrics": 35,
    "coverage": "100.0%"
  }
}

Prometheus Integration

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - 'host1:9100'
          - 'host2:9100'

Service Discovery (Kubernetes)

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):10250'
        replacement: '${1}:9100'
        target_label: __address__

Grafana Dashboards

Telegen is compatible with standard node_exporter dashboards:

Dashboard Grafana ID Description
Node Exporter Full 1860 Comprehensive system metrics
Node Exporter for Prometheus 11074 Clean, modern layout
Linux Server Metrics 180 Classic dashboard

Import Dashboard

  1. Go to Grafana → Dashboards → Import
  2. Enter dashboard ID (e.g., 1860)
  3. Select Prometheus data source
  4. Dashboard works immediately with Telegen

eBPF-Enhanced Metrics

Telegen adds eBPF-based metrics beyond standard node_exporter:

Additional Metrics

Metric Description
node_tcp_rtt_microseconds TCP round-trip time
node_tcp_retransmits_total TCP retransmissions
node_process_open_fds Open file descriptors per process
node_cgroup_cpu_usage_seconds_total Per-cgroup CPU usage
node_cgroup_memory_usage_bytes Per-cgroup memory usage

Enable eBPF Enhancements

agent:
  nodeexporter:
    enabled: true
    
    # Enable eBPF-enhanced metrics
    ebpf_enhanced: true

Migration from node_exporter

Step 1: Deploy Telegen

Deploy Telegen alongside node_exporter:

# Telegen on different port initially
agent:
  nodeexporter:
    listen_address: ":9101"  # Different port

Step 2: Compare Metrics

Verify metric compatibility:

# Compare CPU metrics
node_cpu_seconds_total{port="9100"}  # node_exporter
node_cpu_seconds_total{port="9101"}  # Telegen

Step 3: Switch Scrape Targets

Update Prometheus to scrape Telegen:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['host:9100']  # Now points to Telegen

Step 4: Remove node_exporter

Once verified, remove node_exporter.


Textfile Collector

Import custom metrics from files:

agent:
  nodeexporter:
    textfile:
      enabled: true
      directory: "/var/lib/node_exporter/textfile_collector"

Create Custom Metrics

# /var/lib/node_exporter/textfile_collector/custom.prom
# HELP node_custom_metric A custom metric
# TYPE node_custom_metric gauge
node_custom_metric{label="value"} 42

Performance

Resource Usage

Metric Value
CPU overhead < 0.5%
Memory ~20MB
Scrape time < 100ms

Optimization

For large systems (many disks, interfaces):

agent:
  nodeexporter:
    # Increase scrape timeout
    timeout: 10s
    
    # Reduce collection frequency
    collector_interval: 30s
    
    # Limit concurrent collectors
    max_procs: 2

Common Queries

CPU Usage

# CPU utilization percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-CPU utilization
1 - avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle"}[5m]))

Memory Usage

# Memory utilization percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Memory breakdown
node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes

Disk I/O

# Disk read/write rate
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Disk I/O utilization
rate(node_disk_io_time_seconds_total[5m]) * 100

Network

# Network throughput
rate(node_network_receive_bytes_total[5m]) * 8
rate(node_network_transmit_bytes_total[5m]) * 8

# Packet errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

Filesystem

# Filesystem usage percentage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

# Inode usage
(1 - node_filesystem_files_free / node_filesystem_files) * 100

Alerting Examples

groups:
  - name: node
    rules:
      - alert: HostHighCpuLoad
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU load on {{ $labels.instance }}"
      
      - alert: HostOutOfMemory
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is running out of memory"
      
      - alert: HostOutOfDiskSpace
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} disk space is below 10%"
      
      - alert: HostHighDiskIO
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High disk I/O on {{ $labels.instance }}"

Next Steps