Troubleshooting
Common issues and solutions for Telegen.
Quick Diagnostics
Check Telegen Status
# Health check
curl http://localhost:19090/healthz
# Readiness
curl http://localhost:19090/ready
# Full status
curl http://localhost:19090/status
Check eBPF Programs
# List loaded programs
bpftool prog list | grep -i telegen
# Check if eBPF is working
cat /sys/kernel/debug/tracing/trace_pipe | head -20
Check Logs
# Kubernetes
kubectl logs -l app=telegen -n monitoring --tail=100
# Docker
docker logs telegen --tail=100
# Systemd
journalctl -u telegen -f
Common Issues
eBPF Program Load Failures
Symptom: Telegen starts but shows “eBPF program load failed”
Causes and Solutions:
- Kernel too old
- Minimum: Linux 4.18
- Recommended: Linux 5.8+
uname -r # Check kernel version
- Missing capabilities
# Docker docker run --privileged ... # Kubernetes securityContext: privileged: true - BPF filesystem not mounted
mount | grep bpf # Should show: bpffs on /sys/fs/bpf type bpf # Mount if missing mount -t bpf bpf /sys/fs/bpf - BTF not available
ls /sys/kernel/btf/vmlinux # Should exist for CO-RE support
No Traces Being Collected
Symptom: Telegen running but no traces in backend
Diagnostics:
# Check if spans are being collected
curl -s http://localhost:19090/metrics | grep telegen_spans
# Check export status
curl -s http://localhost:19090/metrics | grep telegen_export
Solutions:
- OTLP endpoint unreachable
# Test connectivity nc -zv otel-collector 4317 # Check DNS nslookup otel-collector - Network tracing disabled
agent: ebpf: network: enabled: true # Ensure enabled - Wrong port configuration
agent: ebpf: network: include_ports: - 80 - 443 - 8080 # Add your app ports - TLS issues
otlp: endpoint: "otel-collector:4317" insecure: true # Try without TLS first
High Memory Usage
Symptom: Telegen using excessive memory
Diagnostics:
# Check memory metrics
curl -s http://localhost:19090/metrics | grep telegen_process_resident_memory
# Check queue sizes
curl -s http://localhost:19090/metrics | grep telegen_export_queue
Solutions:
- Reduce ring buffer size
agent: ebpf: ringbuf_size: 8388608 # 8MB instead of 16MB - Limit queue memory
queues: traces: mem_limit: "128Mi" metrics: mem_limit: "64Mi" - Increase export frequency
- Check if backend is slow
- Reduce batch sizes
queues: traces: batch_size: 256 # Smaller batches
Event Loss (Ring Buffer)
Symptom: telegen_ebpf_ringbuf_lost_total increasing
Diagnostics:
# Check loss rate
curl -s http://localhost:19090/metrics | grep ringbuf_lost
Solutions:
- Increase ring buffer size
agent: ebpf: ringbuf_size: 67108864 # 64MB - Reduce event volume
agent: ebpf: network: exclude_ports: - 22 # SSH - 2379 # etcd syscalls: exclude: - futex - nanosleep - Check CPU bottleneck
- Telegen may not be processing fast enough
- Increase CPU limits
Export Errors
Symptom: telegen_export_errors_total increasing
Diagnostics:
# Check specific errors
curl -s http://localhost:19090/metrics | grep export_errors
# Check logs
grep -i "export" /var/log/telegen.log | tail -20
Solutions:
- Connection refused
# Verify endpoint curl -v http://otel-collector:4317 # Check endpoint config cat /etc/telegen/config.yaml | grep endpoint - TLS certificate errors
otlp: tls: ca_file: "/etc/ssl/certs/ca.crt" insecure_skip_verify: false # Ensure CA is correct - Authentication failures
otlp: headers: Authorization: "Bearer ${OTEL_TOKEN}" - Backend overloaded
- Increase retry backoff
- Check backend capacity
backoff: initial: "1s" max: "60s"
Missing Kubernetes Metadata
Symptom: Traces lack k8s.pod.name, k8s.namespace labels
Diagnostics:
# Check if running in K8s
kubectl get pods -l app=telegen -n monitoring
# Check RBAC
kubectl auth can-i get pods --as=system:serviceaccount:monitoring:telegen
Solutions:
- Missing RBAC permissions
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: telegen
rules:
- apiGroups: [””] resources: [“pods”, “nodes”, “services”] verbs: [“get”, “list”, “watch”] ```
- Downward API not configured
```yaml
env:
- name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name
- name: POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace ```
- Discovery disabled
agent: discovery: detect_kubernetes: true
No GPU Metrics
Symptom: GPU metrics not appearing
Diagnostics:
# Check NVML
nvidia-smi
# Check if device is mounted
ls /dev/nvidia*
Solutions:
- NVML not available
- Ensure NVIDIA drivers installed
- Mount NVIDIA device in container
```yaml
volumes:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidiactl:/dev/nvidiactl ```
- Container not GPU-enabled
spec: runtimeClassName: nvidia containers: - name: telegen resources: limits: nvidia.com/gpu: 0 # Access without allocating - GPU monitoring disabled
agent: gpu: enabled: true nvidia: true
Profiling Not Working
Symptom: No profiles in backend
Diagnostics:
# Check profiling enabled
curl -s http://localhost:19090/metrics | grep profile
# Check perf_event access
cat /proc/sys/kernel/perf_event_paranoid
Solutions:
- perf_event_paranoid too restrictive
# Temporary sysctl kernel.perf_event_paranoid=1 # Permanent echo 'kernel.perf_event_paranoid=1' >> /etc/sysctl.conf - Missing capability
securityContext: capabilities: add: - SYS_ADMIN # or PERFMON on newer kernels - Profiling disabled
agent: profiling: enabled: true
Container Not Starting
Symptom: Container exits immediately
Diagnostics:
# Check exit code
docker inspect telegen --format='{{.State.ExitCode}}'
# Check last logs
docker logs telegen 2>&1 | tail -50
Solutions:
- Config file error
# Validate config telegen --validate-config /etc/telegen/config.yaml - Required mounts missing
docker run -d \ -v /sys:/sys:ro \ -v /proc:/host/proc:ro \ -v /sys/kernel/debug:/sys/kernel/debug \ -v /sys/fs/bpf:/sys/fs/bpf \ ... - Kernel version mismatch
- BTF for wrong kernel
- Use
-fno-BTFbuilds or matching kernel
Debug Mode
Enable comprehensive debugging:
telegen:
log_level: debug
agent:
ebpf:
debug: true
Or via environment:
TELEGEN_LOG_LEVEL=debug \
TELEGEN_AGENT_EBPF_DEBUG=true \
telegen
Getting Help
Collect Diagnostics
# Create diagnostic bundle
telegen diagnostics > telegen-diagnostics.tar.gz
Bundle includes:
- Configuration (sanitized)
- Metrics snapshot
- eBPF program list
- Kernel info
- Recent logs
Log an Issue
When reporting issues, include:
- Telegen version:
telegen version - Kernel version:
uname -a - Distribution:
cat /etc/os-release - Diagnostic bundle
- Steps to reproduce
Next Steps
- Monitoring - Set up monitoring
- Performance Tuning - Optimize performance
- Full Reference - Configuration options