Performance Tuning

Optimize Telegen for your environment and workload.

Resource Guidelines

Default Resource Requirements

Component	CPU	Memory
Agent (minimal)	0.1 cores	128MB
Agent (full features)	0.5 cores	512MB
Agent (high volume)	1.0 cores	1GB
Collector (SNMP)	0.2 cores	256MB
Collector (storage)	0.3 cores	384MB

Kubernetes Resources

resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "1000m"
    memory: "1Gi"

Ring Buffer Tuning

The ring buffer is the primary channel for eBPF events.

Sizing

Buffer Size	Use Case	Event Capacity
4MB	Low traffic, testing	~40K events
16MB	Default, balanced	~160K events
64MB	High traffic	~640K events
256MB	Very high volume	~2.5M events

Configuration

agent:
  ebpf:
    ringbuf_size: 16777216  # 16MB (default)

Signs You Need Larger Buffer

# High loss rate
rate(telegen_ebpf_ringbuf_lost_total[5m]) > 100

If events are being lost, increase buffer size:

agent:
  ebpf:
    ringbuf_size: 67108864  # 64MB

CPU Optimization

Reduce Collection Overhead

Limit traced ports

agent:
  ebpf:
    network:
      include_ports:
        - 80
        - 443
        - 8080
      exclude_ports:
        - 22
        - 2379

Reduce syscall tracing

agent:
  ebpf:
    syscalls:
      enabled: false  # Disable if not needed

Limit profiling frequency

agent:
  profiling:
    sample_rate: 49  # Lower than default 99 Hz

Parallel Processing

agent:
  processing:
    workers: 4  # Match available CPU cores

Memory Optimization

Queue Limits

queues:
  traces:
    mem_limit: "128Mi"
    max_age: "1h"
    batch_size: 256
  
  metrics:
    mem_limit: "64Mi"
    max_age: "5m"
    batch_size: 500
  
  logs:
    mem_limit: "128Mi"
    max_age: "6h"
    batch_size: 500

Reduce Cardinality

High cardinality labels increase memory:

agent:
  kubernetes:
    # Only essential labels
    label_allowlist:
      - "app"
      - "version"
    # NOT: "*"

Limit Active Connections Tracked

agent:
  ebpf:
    network:
      # Limit tracked connections
      max_connections: 50000  # Default: 100000

Network/Export Optimization

Compression

otlp:
  compression: gzip  # Reduce bandwidth

Batching

queues:
  traces:
    batch_size: 512     # Larger batches = fewer requests
    flush_interval: 5s  # Don't wait too long

Connection Pooling

otlp:
  max_connections: 10  # Connection pool size
  idle_timeout: 60s

Sampling

Head-Based Sampling

Sample at collection time:

otlp:
  traces:
    sample_rate: 0.1  # 10% of traces

Tail-Based Sampling

For more intelligent sampling, configure your OTel Collector:

# OTel Collector config
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

Per-Feature Tuning

Profiling

agent:
  profiling:
    # Lower sample rate for less overhead
    sample_rate: 49  # Hz
    
    # Longer upload interval
    upload_interval: 120s
    
    # Disable unused profile types
    mutex: false
    block: false
    goroutine: false

Security Monitoring

agent:
  security:
    # Focus on critical syscalls only
    syscall_audit:
      syscalls:
        - execve
        - setuid
        - ptrace
      # NOT all syscalls
    
    # Limit file paths
    file_integrity:
      paths:
        - /etc/passwd
        - /etc/shadow
      # NOT: /var/**

Network Monitoring

agent:
  network:
    # Use sampling for high-volume
    tcp:
      sample_rate: 10  # 1 in 10 connections
    
    # XDP sampling
    xdp:
      sample_rate: 1000  # 0.1% of packets

High-Volume Environments

Recommended Configuration

For environments with >10K requests/second:

telegen:
  log_level: warn  # Reduce logging

agent:
  ebpf:
    ringbuf_size: 134217728  # 128MB
    perf_buffer_size: 32768  # 32KB per CPU
    
    network:
      exclude_paths:
        - "/health*"
        - "/ready*"
        - "/metrics"
      exclude_ports:
        - 22
        - 2379
        - 2380
        - 10250
  
  resources:
    cpu_limit: 2.0
    memory_limit: "2Gi"
    rate_limit:
      spans_per_second: 100000
      metrics_per_second: 200000

otlp:
  compression: gzip
  
queues:
  traces:
    mem_limit: "512Mi"
    batch_size: 1024

Low-Resource Environments

Minimal Configuration

For resource-constrained environments:

telegen:
  log_level: error

agent:
  ebpf:
    ringbuf_size: 4194304  # 4MB
    
    network:
      enabled: true
      http: true
      grpc: false
      dns: false
    
    syscalls:
      enabled: false
  
  profiling:
    enabled: false
  
  security:
    enabled: false

queues:
  traces:
    mem_limit: "64Mi"
    batch_size: 128

Kubernetes Resources

resources:
  requests:
    cpu: "50m"
    memory: "128Mi"
  limits:
    cpu: "200m"
    memory: "256Mi"

Monitoring Performance

Key Metrics to Watch

# CPU usage
rate(telegen_process_cpu_seconds_total[5m])

# Memory usage
telegen_process_resident_memory_bytes

# Event loss rate
rate(telegen_ebpf_ringbuf_lost_total[5m]) / rate(telegen_ebpf_ringbuf_events_total[5m])

# Export latency
histogram_quantile(0.99, rate(telegen_export_latency_seconds_bucket[5m]))

# Queue depth
telegen_export_queue_size

Performance Alerts

groups:
  - name: telegen-performance
    rules:
      - alert: TelegenHighCPU
        expr: rate(telegen_process_cpu_seconds_total[5m]) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Telegen using high CPU"
      
      - alert: TelegenHighMemory
        expr: telegen_process_resident_memory_bytes > 1.5e9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Telegen memory above 1.5GB"
      
      - alert: TelegenExportSlow
        expr: histogram_quantile(0.99, rate(telegen_export_latency_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Telegen export P99 latency high"

Benchmarking

Test Configuration

Before deploying changes, benchmark:

# Generate test load
hey -n 10000 -c 100 http://your-app:8080/api/test

# Monitor Telegen metrics
watch -n 1 'curl -s http://localhost:19090/metrics | grep -E "cpu|memory|lost"'

Compare Before/After

Baseline current configuration
Apply changes
Run same load test
Compare metrics

Best Practices Summary

Start conservative - Begin with defaults, tune based on actual needs
Monitor loss rates - If losing events, increase buffers
Use sampling - For high-volume, sample rather than drop
Filter noise - Exclude health checks, internal traffic
Batch efficiently - Larger batches reduce export overhead
Set limits - Protect against runaway memory usage
Test changes - Benchmark before and after tuning

Next Steps

Monitoring - Set up performance monitoring
Troubleshooting - Diagnose performance issues
Full Reference - All configuration options