Performance Tuning

Optimize Telegen for your environment and workload.

Resource Guidelines

Default Resource Requirements

Component CPU Memory
Agent (minimal) 0.1 cores 128MB
Agent (full features) 0.5 cores 512MB
Agent (high volume) 1.0 cores 1GB
Collector (SNMP) 0.2 cores 256MB
Collector (storage) 0.3 cores 384MB

Kubernetes Resources

resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "1000m"
    memory: "1Gi"

Ring Buffer Tuning

The ring buffer is the primary channel for eBPF events.

Sizing

Buffer Size Use Case Event Capacity
4MB Low traffic, testing ~40K events
16MB Default, balanced ~160K events
64MB High traffic ~640K events
256MB Very high volume ~2.5M events

Configuration

agent:
  ebpf:
    ringbuf_size: 16777216  # 16MB (default)

Signs You Need Larger Buffer

# High loss rate
rate(telegen_ebpf_ringbuf_lost_total[5m]) > 100

If events are being lost, increase buffer size:

agent:
  ebpf:
    ringbuf_size: 67108864  # 64MB

CPU Optimization

Reduce Collection Overhead

  1. Limit traced ports
    agent:
      ebpf:
        network:
          include_ports:
            - 80
            - 443
            - 8080
          exclude_ports:
            - 22
            - 2379
    
  2. Reduce syscall tracing
    agent:
      ebpf:
        syscalls:
          enabled: false  # Disable if not needed
    
  3. Limit profiling frequency
    agent:
      profiling:
        sample_rate: 49  # Lower than default 99 Hz
    

Parallel Processing

agent:
  processing:
    workers: 4  # Match available CPU cores

Memory Optimization

Queue Limits

queues:
  traces:
    mem_limit: "128Mi"
    max_age: "1h"
    batch_size: 256
  
  metrics:
    mem_limit: "64Mi"
    max_age: "5m"
    batch_size: 500
  
  logs:
    mem_limit: "128Mi"
    max_age: "6h"
    batch_size: 500

Reduce Cardinality

High cardinality labels increase memory:

agent:
  kubernetes:
    # Only essential labels
    label_allowlist:
      - "app"
      - "version"
    # NOT: "*"

Limit Active Connections Tracked

agent:
  ebpf:
    network:
      # Limit tracked connections
      max_connections: 50000  # Default: 100000

Network/Export Optimization

Compression

otlp:
  compression: gzip  # Reduce bandwidth

Batching

queues:
  traces:
    batch_size: 512     # Larger batches = fewer requests
    flush_interval: 5s  # Don't wait too long

Connection Pooling

otlp:
  max_connections: 10  # Connection pool size
  idle_timeout: 60s

Sampling

Head-Based Sampling

Sample at collection time:

otlp:
  traces:
    sample_rate: 0.1  # 10% of traces

Tail-Based Sampling

For more intelligent sampling, configure your OTel Collector:

# OTel Collector config
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

Per-Feature Tuning

Profiling

agent:
  profiling:
    # Lower sample rate for less overhead
    sample_rate: 49  # Hz
    
    # Longer upload interval
    upload_interval: 120s
    
    # Disable unused profile types
    mutex: false
    block: false
    goroutine: false

Security Monitoring

agent:
  security:
    # Focus on critical syscalls only
    syscall_audit:
      syscalls:
        - execve
        - setuid
        - ptrace
      # NOT all syscalls
    
    # Limit file paths
    file_integrity:
      paths:
        - /etc/passwd
        - /etc/shadow
      # NOT: /var/**

Network Monitoring

agent:
  network:
    # Use sampling for high-volume
    tcp:
      sample_rate: 10  # 1 in 10 connections
    
    # XDP sampling
    xdp:
      sample_rate: 1000  # 0.1% of packets

High-Volume Environments

For environments with >10K requests/second:

telegen:
  log_level: warn  # Reduce logging

agent:
  ebpf:
    ringbuf_size: 134217728  # 128MB
    perf_buffer_size: 32768  # 32KB per CPU
    
    network:
      exclude_paths:
        - "/health*"
        - "/ready*"
        - "/metrics"
      exclude_ports:
        - 22
        - 2379
        - 2380
        - 10250
  
  resources:
    cpu_limit: 2.0
    memory_limit: "2Gi"
    rate_limit:
      spans_per_second: 100000
      metrics_per_second: 200000

otlp:
  compression: gzip
  
queues:
  traces:
    mem_limit: "512Mi"
    batch_size: 1024

Low-Resource Environments

Minimal Configuration

For resource-constrained environments:

telegen:
  log_level: error

agent:
  ebpf:
    ringbuf_size: 4194304  # 4MB
    
    network:
      enabled: true
      http: true
      grpc: false
      dns: false
    
    syscalls:
      enabled: false
  
  profiling:
    enabled: false
  
  security:
    enabled: false

queues:
  traces:
    mem_limit: "64Mi"
    batch_size: 128

Kubernetes Resources

resources:
  requests:
    cpu: "50m"
    memory: "128Mi"
  limits:
    cpu: "200m"
    memory: "256Mi"

Monitoring Performance

Key Metrics to Watch

# CPU usage
rate(telegen_process_cpu_seconds_total[5m])

# Memory usage
telegen_process_resident_memory_bytes

# Event loss rate
rate(telegen_ebpf_ringbuf_lost_total[5m]) / rate(telegen_ebpf_ringbuf_events_total[5m])

# Export latency
histogram_quantile(0.99, rate(telegen_export_latency_seconds_bucket[5m]))

# Queue depth
telegen_export_queue_size

Performance Alerts

groups:
  - name: telegen-performance
    rules:
      - alert: TelegenHighCPU
        expr: rate(telegen_process_cpu_seconds_total[5m]) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Telegen using high CPU"
      
      - alert: TelegenHighMemory
        expr: telegen_process_resident_memory_bytes > 1.5e9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Telegen memory above 1.5GB"
      
      - alert: TelegenExportSlow
        expr: histogram_quantile(0.99, rate(telegen_export_latency_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Telegen export P99 latency high"

Benchmarking

Test Configuration

Before deploying changes, benchmark:

# Generate test load
hey -n 10000 -c 100 http://your-app:8080/api/test

# Monitor Telegen metrics
watch -n 1 'curl -s http://localhost:19090/metrics | grep -E "cpu|memory|lost"'

Compare Before/After

  1. Baseline current configuration
  2. Apply changes
  3. Run same load test
  4. Compare metrics

Best Practices Summary

  1. Start conservative - Begin with defaults, tune based on actual needs
  2. Monitor loss rates - If losing events, increase buffers
  3. Use sampling - For high-volume, sample rather than drop
  4. Filter noise - Exclude health checks, internal traffic
  5. Batch efficiently - Larger batches reduce export overhead
  6. Set limits - Protect against runaway memory usage
  7. Test changes - Benchmark before and after tuning

Next Steps