AI/ML Observability
Telegen provides observability for AI/ML workloads, including GPU monitoring and LLM inference metrics.
Overview
AI/ML observability includes:
- GPU monitoring - NVIDIA and AMD GPU metrics
- LLM inference - Token throughput, latency, TTFT
- Model serving - Batch size, queue depth, inference time
- Training metrics - Loss, throughput, GPU utilization
GPU Monitoring
NVIDIA GPU Metrics
Telegen collects NVIDIA GPU metrics via NVML (NVIDIA Management Library):
| Metric | Description |
|---|---|
gpu_utilization_percent |
GPU compute utilization |
gpu_memory_used_bytes |
GPU memory used |
gpu_memory_total_bytes |
GPU memory total |
gpu_memory_free_bytes |
GPU memory free |
gpu_temperature_celsius |
GPU temperature |
gpu_power_usage_watts |
Current power draw |
gpu_power_limit_watts |
Power limit |
gpu_sm_clock_hz |
Streaming multiprocessor clock |
gpu_memory_clock_hz |
Memory clock |
gpu_pcie_tx_bytes |
PCIe transmit throughput |
gpu_pcie_rx_bytes |
PCIe receive throughput |
gpu_encoder_utilization_percent |
Video encoder utilization |
gpu_decoder_utilization_percent |
Video decoder utilization |
Per-Process GPU Metrics
Track GPU usage per process:
| Metric | Description |
|---|---|
gpu_process_memory_bytes |
Memory used by process |
gpu_process_sm_utilization_percent |
SM utilization by process |
Configuration
agent:
gpu:
enabled: true
# NVIDIA support
nvidia: true
# AMD support (via ROCm SMI)
amd: false
# Polling interval
poll_interval: 10s
# Metrics to collect
metrics:
utilization: true
memory: true
temperature: true
power: true
clock: true
pcie_throughput: true
encoder_decoder: true
# Per-process tracking
per_process: true
AMD GPU Metrics
For AMD GPUs, Telegen uses ROCm SMI:
| Metric | Description |
|---|---|
gpu_utilization_percent |
GPU utilization |
gpu_memory_used_bytes |
VRAM used |
gpu_memory_total_bytes |
VRAM total |
gpu_temperature_celsius |
GPU temperature |
gpu_power_usage_watts |
Power consumption |
gpu_fan_speed_percent |
Fan speed |
Configuration
agent:
gpu:
enabled: true
nvidia: false
amd: true
poll_interval: 10s
LLM Inference Metrics
Track LLM inference performance:
Key Metrics
| Metric | Description |
|---|---|
llm_request_total |
Total inference requests |
llm_request_duration_seconds |
End-to-end request duration |
llm_time_to_first_token_seconds |
Time to first token (TTFT) |
llm_inter_token_latency_seconds |
Time between tokens |
llm_tokens_generated_total |
Total tokens generated |
llm_tokens_per_second |
Token generation throughput |
llm_prompt_tokens_total |
Input prompt tokens |
llm_queue_depth |
Requests waiting in queue |
llm_batch_size |
Current batch size |
llm_kv_cache_usage_bytes |
KV cache memory usage |
Example Metrics
# Average time to first token
histogram_quantile(0.95,
sum(rate(llm_time_to_first_token_seconds_bucket[5m])) by (le, model)
)
# Token throughput
sum(rate(llm_tokens_generated_total[5m])) by (model)
# Request rate by model
sum(rate(llm_request_total[5m])) by (model)
# Queue depth
llm_queue_depth{model="llama-3-70b"}
Labels
| Label | Description |
|---|---|
model |
Model name/version |
instance |
Server instance |
gpu |
GPU device index |
Model Serving Frameworks
Supported Frameworks
| Framework | Auto-Instrumentation |
|---|---|
| vLLM | ✅ Full metrics |
| TGI (Text Generation Inference) | ✅ Full metrics |
| NVIDIA Triton | ✅ Full metrics |
| TensorFlow Serving | ✅ Basic metrics |
| TorchServe | ✅ Basic metrics |
| ONNX Runtime | ✅ Basic metrics |
vLLM Integration
agent:
aiml:
frameworks:
vllm:
enabled: true
# Collect all vLLM metrics
metrics:
- request_duration
- time_to_first_token
- tokens_per_second
- kv_cache_usage
- batch_size
Triton Integration
agent:
aiml:
frameworks:
triton:
enabled: true
metrics_endpoint: "http://localhost:8002/metrics"
Training Observability
Monitor ML training jobs:
Metrics
| Metric | Description |
|---|---|
training_loss |
Current training loss |
training_step |
Current training step |
training_epoch |
Current epoch |
training_learning_rate |
Current learning rate |
training_throughput_samples_per_second |
Training throughput |
training_gpu_utilization_percent |
GPU utilization during training |
training_gradient_norm |
Gradient norm |
Configuration
agent:
aiml:
training:
enabled: true
# Detect common training frameworks
detect_frameworks:
- pytorch
- tensorflow
- jax
# Log training metrics to OTLP
export_metrics: true
Multi-GPU Monitoring
GPU Labels
All GPU metrics include device labels:
gpu_utilization_percent{
device="0",
name="NVIDIA A100-SXM4-80GB",
uuid="GPU-abc123",
k8s_pod="llm-server-abc"
} 85.5
Multi-Node Training
Track distributed training across nodes:
# Total GPU utilization across all training nodes
sum(gpu_utilization_percent{job="distributed-training"})
# GPU memory per node
gpu_memory_used_bytes{job="distributed-training"} by (node)
# Communication overhead (NCCL)
rate(gpu_nccl_send_bytes_total[5m]) + rate(gpu_nccl_recv_bytes_total[5m])
Alerting Examples
GPU Alerts
groups:
- name: gpu
rules:
- alert: GPUHighTemperature
expr: gpu_temperature_celsius > 85
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.device }} temperature is {{ $value }}°C"
- alert: GPUOutOfMemory
expr: gpu_memory_used_bytes / gpu_memory_total_bytes > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.device }} memory is {{ $value | humanizePercentage }}"
- alert: GPULowUtilization
expr: gpu_utilization_percent < 10
for: 30m
labels:
severity: info
annotations:
summary: "GPU {{ $labels.device }} underutilized"
LLM Alerts
groups:
- name: llm
rules:
- alert: LLMHighLatency
expr: histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "LLM P95 latency is {{ $value | humanizeDuration }}"
- alert: LLMHighQueueDepth
expr: llm_queue_depth > 100
for: 5m
labels:
severity: warning
annotations:
summary: "LLM queue depth is {{ $value }}"
- alert: LLMSlowTTFT
expr: histogram_quantile(0.95, rate(llm_time_to_first_token_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "LLM time-to-first-token P95 is {{ $value | humanizeDuration }}"
Kubernetes GPU Support
NVIDIA GPU Operator
When using NVIDIA GPU Operator in Kubernetes:
# DaemonSet config
spec:
containers:
- name: telegen
resources:
limits:
nvidia.com/gpu: 0 # Don't request GPU, just monitor
volumeMounts:
# Mount NVML socket
- name: nvidia-mps
mountPath: /var/run/nvidia
volumes:
- name: nvidia-mps
hostPath:
path: /var/run/nvidia
MIG (Multi-Instance GPU) Support
Monitor MIG partitions:
gpu_utilization_percent{
device="0",
mig_device="mig-1g.5gb-0",
mig_profile="1g.5gb"
} 75.2
Dashboard Examples
GPU Overview
# GPU fleet summary
sum(gpu_utilization_percent) by (name) / count(gpu_utilization_percent) by (name)
# Memory pressure
sum(gpu_memory_used_bytes) / sum(gpu_memory_total_bytes) * 100
# Power efficiency (tokens per watt)
sum(rate(llm_tokens_generated_total[5m])) / sum(gpu_power_usage_watts)
LLM Performance
# Requests per second
sum(rate(llm_request_total[5m])) by (model)
# Token generation rate
sum(rate(llm_tokens_generated_total[5m])) by (model)
# Latency percentiles
histogram_quantile(0.50, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))
histogram_quantile(0.95, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))
histogram_quantile(0.99, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))
Best Practices
1. Enable Per-Process Tracking
Identify which processes use GPU resources:
agent:
gpu:
per_process: true
2. Monitor KV Cache
KV cache is critical for LLM performance:
# Alert when KV cache is near capacity
llm_kv_cache_usage_bytes / llm_kv_cache_capacity_bytes > 0.9
3. Correlate with Traces
Link inference metrics to traces:
agent:
aiml:
# Add trace context to LLM metrics
trace_correlation: true
Next Steps
- Continuous Profiling - Profile GPU workloads
- Agent Mode - GPU configuration
- Monitoring - GPU dashboards