Node Exporter Fusion

Telegen includes a drop-in replacement for Prometheus node_exporter, providing full compatibility with existing dashboards and alerts.

Overview

Node Exporter Fusion provides:

120+ system metrics - Full node_exporter compatibility
node_* namespace - Works with existing dashboards
Zero configuration - Automatically enabled
eBPF enhanced - Additional metrics via eBPF

Compatibility

Telegen replaces node_exporter while maintaining full compatibility:

Feature	node_exporter	Telegen
Metric namespace	`node_*`	`node_*` ✅
Grafana dashboards	✅	✅
Alert rules	✅	✅
Prometheus scraping	`/metrics`	`/metrics` ✅
Collectors	50+	50+ ✅

Collectors

P0 Collectors (Always Enabled)

Collector	Metrics	Description
loadavg	3	`node_load1`, `node_load5`, `node_load15`
cpu	15+ per core	CPU time per mode, frequency, info
meminfo	50+	Memory statistics from `/proc/meminfo`
diskstats	17+ per device	Disk I/O statistics
filesystem	8 per mount	Filesystem space and inodes
netdev	25+ per interface	Network device statistics
stat	16	Boot time, context switches, interrupts

Sample Metrics

# Load averages
node_load1
node_load5
node_load15

# CPU usage per mode
node_cpu_seconds_total{mode="user"}
node_cpu_seconds_total{mode="system"}
node_cpu_seconds_total{mode="idle"}
node_cpu_seconds_total{mode="iowait"}

# Memory
node_memory_MemTotal_bytes
node_memory_MemFree_bytes
node_memory_MemAvailable_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes
node_memory_SwapTotal_bytes
node_memory_SwapFree_bytes

# Disk I/O
node_disk_read_bytes_total
node_disk_written_bytes_total
node_disk_reads_completed_total
node_disk_writes_completed_total
node_disk_io_time_seconds_total
node_disk_read_time_seconds_total
node_disk_write_time_seconds_total

# Filesystem
node_filesystem_size_bytes
node_filesystem_free_bytes
node_filesystem_avail_bytes
node_filesystem_files
node_filesystem_files_free

# Network
node_network_receive_bytes_total
node_network_transmit_bytes_total
node_network_receive_packets_total
node_network_transmit_packets_total
node_network_receive_errs_total
node_network_transmit_errs_total
node_network_receive_drop_total
node_network_transmit_drop_total

# System
node_boot_time_seconds
node_context_switches_total
node_forks_total
node_intr_total
node_procs_running
node_procs_blocked

Configuration

Enable/Disable Collectors

agent:
  nodeexporter:
    enabled: true
    
    # Listen address for /metrics endpoint
    listen_address: ":9100"
    
    # Metric namespace (default: node)
    namespace: "node"
    
    # Collectors to enable
    collectors:
      loadavg: true
      cpu: true
      meminfo: true
      diskstats: true
      filesystem: true
      netdev: true
      stat: true
      
      # P1 collectors
      netstat: true
      sockstat: true
      vmstat: true
      
      # P2 collectors
      hwmon: false      # Hardware monitoring
      thermal: false    # Thermal zones
      pressure: true    # PSI metrics

Device Filtering

Filter which devices to collect metrics from:

agent:
  nodeexporter:
    filesystem:
      # Ignore these filesystem types
      ignored_fs_types:
        - autofs
        - binfmt_misc
        - cgroup
        - configfs
        - debugfs
        - devpts
        - devtmpfs
        - fusectl
        - hugetlbfs
        - mqueue
        - nsfs
        - overlay
        - proc
        - procfs
        - pstore
        - securityfs
        - sysfs
        - tmpfs
        - tracefs
      
      # Ignore these mount points
      ignored_mount_points:
        - "^/(dev|proc|sys|var/lib/docker/.+)($|/)"
    
    diskstats:
      # Only these devices
      device_include:
        - "^sd[a-z]+$"
        - "^nvme[0-9]+n[0-9]+$"
      
      # Ignore these devices
      device_exclude:
        - "^loop[0-9]+$"
        - "^ram[0-9]+$"
    
    netdev:
      # Ignore these interfaces
      device_exclude:
        - "^veth.*"
        - "^docker.*"
        - "^br-.*"

TLS/mTLS Configuration

Secure the metrics endpoint with TLS and optional mutual TLS (mTLS):

agent:
  nodeexporter:
    enabled: true
    listen_address: ":9100"
    
    # TLS configuration
    tls:
      enabled: true
      
      # Server certificate and key
      cert_file: "/etc/telegen/certs/server.crt"
      key_file: "/etc/telegen/certs/server.key"
      
      # Enable mTLS (client certificate verification)
      client_auth: true
      
      # CA certificate for verifying client certs
      client_ca_file: "/etc/telegen/certs/ca.crt"

Option	Description	Default
`tls.enabled`	Enable TLS for metrics endpoint	`false`
`tls.cert_file`	Path to server certificate	-
`tls.key_file`	Path to server private key	-
`tls.client_auth`	Require client certificates (mTLS)	`false`
`tls.client_ca_file`	CA for verifying client certificates	-

Metric Cardinality Controls

Control metric cardinality to prevent explosion from high-cardinality labels:

agent:
  nodeexporter:
    enabled: true
    
    # Cardinality controls
    cardinality:
      enabled: true
      
      # Maximum number of metric families
      max_metrics: 1000
      
      # Include only these metrics (regex patterns)
      include_metrics:
        - "node_cpu_.*"
        - "node_memory_.*"
        - "node_disk_.*"
        - "node_filesystem_.*"
        - "node_network_.*"
        - "node_load.*"
      
      # Exclude these metrics (regex patterns)
      exclude_metrics:
        - "node_scrape_.*"
        - "go_.*"
      
      # Drop these labels from all metrics
      drop_labels:
        - "id"
        - "name"

Option	Description	Default
`cardinality.enabled`	Enable cardinality filtering	`false`
`cardinality.max_metrics`	Maximum metric families (0 = unlimited)	`0`
`cardinality.include_metrics`	Regex patterns to include	`[]` (all)
`cardinality.exclude_metrics`	Regex patterns to exclude	`[]`
`cardinality.drop_labels`	Labels to remove from all metrics	`[]`

API Endpoints

The node exporter provides several HTTP endpoints:

Endpoint	Description
`/metrics`	Prometheus metrics endpoint
`/metrics/description`	JSON documentation of all metrics with OTEL mappings
`/health`	Health check (JSON)
`/ready`	Readiness probe
`/live`	Liveness probe

Metric Descriptions Endpoint

The /metrics/description endpoint returns JSON documentation for all available metrics, including OTEL semantic convention mappings:

curl http://localhost:9100/metrics/description

Response:

{
  "categories": [
    {
      "category": "CPU",
      "count": 4,
      "metrics": [
        {
          "name": "node_cpu_seconds_total",
          "otel_name": "system.cpu.time",
          "description": "Seconds the CPUs spent in each mode",
          "unit": "s",
          "type": "counter",
          "labels": {"cpu": "cpu", "mode": "cpu.mode"},
          "has_otel_mapping": true
        }
      ]
    }
  ],
  "total": 35,
  "otel_info": {
    "version": "v1.38.0",
    "mapped_count": 35,
    "total_metrics": 35,
    "coverage": "100.0%"
  }
}

Prometheus Integration

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - 'host1:9100'
          - 'host2:9100'

Service Discovery (Kubernetes)

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):10250'
        replacement: '${1}:9100'
        target_label: __address__

Grafana Dashboards

Telegen is compatible with standard node_exporter dashboards:

Recommended Dashboards

Dashboard	Grafana ID	Description
Node Exporter Full	1860	Comprehensive system metrics
Node Exporter for Prometheus	11074	Clean, modern layout
Linux Server Metrics	180	Classic dashboard

Import Dashboard

Go to Grafana → Dashboards → Import
Enter dashboard ID (e.g., 1860)
Select Prometheus data source
Dashboard works immediately with Telegen

eBPF-Enhanced Metrics

Telegen adds eBPF-based metrics beyond standard node_exporter:

Additional Metrics

Metric	Description
`node_tcp_rtt_microseconds`	TCP round-trip time
`node_tcp_retransmits_total`	TCP retransmissions
`node_process_open_fds`	Open file descriptors per process
`node_cgroup_cpu_usage_seconds_total`	Per-cgroup CPU usage
`node_cgroup_memory_usage_bytes`	Per-cgroup memory usage

Enable eBPF Enhancements

agent:
  nodeexporter:
    enabled: true
    
    # Enable eBPF-enhanced metrics
    ebpf_enhanced: true

Migration from node_exporter

Step 1: Deploy Telegen

Deploy Telegen alongside node_exporter:

# Telegen on different port initially
agent:
  nodeexporter:
    listen_address: ":9101"  # Different port

Step 2: Compare Metrics

Verify metric compatibility:

# Compare CPU metrics
node_cpu_seconds_total{port="9100"}  # node_exporter
node_cpu_seconds_total{port="9101"}  # Telegen

Step 3: Switch Scrape Targets

Update Prometheus to scrape Telegen:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['host:9100']  # Now points to Telegen

Step 4: Remove node_exporter

Once verified, remove node_exporter.

Textfile Collector

Import custom metrics from files:

agent:
  nodeexporter:
    textfile:
      enabled: true
      directory: "/var/lib/node_exporter/textfile_collector"

Create Custom Metrics

# /var/lib/node_exporter/textfile_collector/custom.prom
# HELP node_custom_metric A custom metric
# TYPE node_custom_metric gauge
node_custom_metric{label="value"} 42

Performance

Resource Usage

Metric	Value
CPU overhead	< 0.5%
Memory	~20MB
Scrape time	< 100ms

Optimization

For large systems (many disks, interfaces):

agent:
  nodeexporter:
    # Increase scrape timeout
    timeout: 10s
    
    # Reduce collection frequency
    collector_interval: 30s
    
    # Limit concurrent collectors
    max_procs: 2

Common Queries

CPU Usage

# CPU utilization percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-CPU utilization
1 - avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle"}[5m]))

Memory Usage

# Memory utilization percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Memory breakdown
node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes

Disk I/O

# Disk read/write rate
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Disk I/O utilization
rate(node_disk_io_time_seconds_total[5m]) * 100

Network

# Network throughput
rate(node_network_receive_bytes_total[5m]) * 8
rate(node_network_transmit_bytes_total[5m]) * 8

# Packet errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

Filesystem

# Filesystem usage percentage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

# Inode usage
(1 - node_filesystem_files_free / node_filesystem_files) * 100

Alerting Examples

groups:
  - name: node
    rules:
      - alert: HostHighCpuLoad
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU load on {{ $labels.instance }}"
      
      - alert: HostOutOfMemory
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is running out of memory"
      
      - alert: HostOutOfDiskSpace
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} disk space is below 10%"
      
      - alert: HostHighDiskIO
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High disk I/O on {{ $labels.instance }}"

Next Steps

Auto Discovery - Automatic service detection
Distributed Tracing - Application tracing
Agent Mode - Full configuration