AI/ML Observability

Telegen provides observability for AI/ML workloads, including GPU monitoring and LLM inference metrics.

Overview

AI/ML observability includes:

  • GPU monitoring - NVIDIA and AMD GPU metrics
  • LLM inference - Token throughput, latency, TTFT
  • Model serving - Batch size, queue depth, inference time
  • Training metrics - Loss, throughput, GPU utilization

GPU Monitoring

NVIDIA GPU Metrics

Telegen collects NVIDIA GPU metrics via NVML (NVIDIA Management Library):

Metric Description
gpu_utilization_percent GPU compute utilization
gpu_memory_used_bytes GPU memory used
gpu_memory_total_bytes GPU memory total
gpu_memory_free_bytes GPU memory free
gpu_temperature_celsius GPU temperature
gpu_power_usage_watts Current power draw
gpu_power_limit_watts Power limit
gpu_sm_clock_hz Streaming multiprocessor clock
gpu_memory_clock_hz Memory clock
gpu_pcie_tx_bytes PCIe transmit throughput
gpu_pcie_rx_bytes PCIe receive throughput
gpu_encoder_utilization_percent Video encoder utilization
gpu_decoder_utilization_percent Video decoder utilization

Per-Process GPU Metrics

Track GPU usage per process:

Metric Description
gpu_process_memory_bytes Memory used by process
gpu_process_sm_utilization_percent SM utilization by process

Configuration

agent:
  gpu:
    enabled: true
    
    # NVIDIA support
    nvidia: true
    
    # AMD support (via ROCm SMI)
    amd: false
    
    # Polling interval
    poll_interval: 10s
    
    # Metrics to collect
    metrics:
      utilization: true
      memory: true
      temperature: true
      power: true
      clock: true
      pcie_throughput: true
      encoder_decoder: true
      
    # Per-process tracking
    per_process: true

AMD GPU Metrics

For AMD GPUs, Telegen uses ROCm SMI:

Metric Description
gpu_utilization_percent GPU utilization
gpu_memory_used_bytes VRAM used
gpu_memory_total_bytes VRAM total
gpu_temperature_celsius GPU temperature
gpu_power_usage_watts Power consumption
gpu_fan_speed_percent Fan speed

Configuration

agent:
  gpu:
    enabled: true
    nvidia: false
    amd: true
    poll_interval: 10s

LLM Inference Metrics

Track LLM inference performance:

Key Metrics

Metric Description
llm_request_total Total inference requests
llm_request_duration_seconds End-to-end request duration
llm_time_to_first_token_seconds Time to first token (TTFT)
llm_inter_token_latency_seconds Time between tokens
llm_tokens_generated_total Total tokens generated
llm_tokens_per_second Token generation throughput
llm_prompt_tokens_total Input prompt tokens
llm_queue_depth Requests waiting in queue
llm_batch_size Current batch size
llm_kv_cache_usage_bytes KV cache memory usage

Example Metrics

# Average time to first token
histogram_quantile(0.95, 
  sum(rate(llm_time_to_first_token_seconds_bucket[5m])) by (le, model)
)

# Token throughput
sum(rate(llm_tokens_generated_total[5m])) by (model)

# Request rate by model
sum(rate(llm_request_total[5m])) by (model)

# Queue depth
llm_queue_depth{model="llama-3-70b"}

Labels

Label Description
model Model name/version
instance Server instance
gpu GPU device index

Model Serving Frameworks

Supported Frameworks

Framework Auto-Instrumentation
vLLM ✅ Full metrics
TGI (Text Generation Inference) ✅ Full metrics
NVIDIA Triton ✅ Full metrics
TensorFlow Serving ✅ Basic metrics
TorchServe ✅ Basic metrics
ONNX Runtime ✅ Basic metrics

vLLM Integration

agent:
  aiml:
    frameworks:
      vllm:
        enabled: true
        # Collect all vLLM metrics
        metrics:
          - request_duration
          - time_to_first_token
          - tokens_per_second
          - kv_cache_usage
          - batch_size

Triton Integration

agent:
  aiml:
    frameworks:
      triton:
        enabled: true
        metrics_endpoint: "http://localhost:8002/metrics"

Training Observability

Monitor ML training jobs:

Metrics

Metric Description
training_loss Current training loss
training_step Current training step
training_epoch Current epoch
training_learning_rate Current learning rate
training_throughput_samples_per_second Training throughput
training_gpu_utilization_percent GPU utilization during training
training_gradient_norm Gradient norm

Configuration

agent:
  aiml:
    training:
      enabled: true
      
      # Detect common training frameworks
      detect_frameworks:
        - pytorch
        - tensorflow
        - jax
      
      # Log training metrics to OTLP
      export_metrics: true

Multi-GPU Monitoring

GPU Labels

All GPU metrics include device labels:

gpu_utilization_percent{
  device="0",
  name="NVIDIA A100-SXM4-80GB",
  uuid="GPU-abc123",
  k8s_pod="llm-server-abc"
} 85.5

Multi-Node Training

Track distributed training across nodes:

# Total GPU utilization across all training nodes
sum(gpu_utilization_percent{job="distributed-training"})

# GPU memory per node
gpu_memory_used_bytes{job="distributed-training"} by (node)

# Communication overhead (NCCL)
rate(gpu_nccl_send_bytes_total[5m]) + rate(gpu_nccl_recv_bytes_total[5m])

Alerting Examples

GPU Alerts

groups:
  - name: gpu
    rules:
      - alert: GPUHighTemperature
        expr: gpu_temperature_celsius > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.device }} temperature is {{ $value }}°C"
      
      - alert: GPUOutOfMemory
        expr: gpu_memory_used_bytes / gpu_memory_total_bytes > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.device }} memory is {{ $value | humanizePercentage }}"
      
      - alert: GPULowUtilization
        expr: gpu_utilization_percent < 10
        for: 30m
        labels:
          severity: info
        annotations:
          summary: "GPU {{ $labels.device }} underutilized"

LLM Alerts

groups:
  - name: llm
    rules:
      - alert: LLMHighLatency
        expr: histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM P95 latency is {{ $value | humanizeDuration }}"
      
      - alert: LLMHighQueueDepth
        expr: llm_queue_depth > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM queue depth is {{ $value }}"
      
      - alert: LLMSlowTTFT
        expr: histogram_quantile(0.95, rate(llm_time_to_first_token_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM time-to-first-token P95 is {{ $value | humanizeDuration }}"

Kubernetes GPU Support

NVIDIA GPU Operator

When using NVIDIA GPU Operator in Kubernetes:

# DaemonSet config
spec:
  containers:
    - name: telegen
      resources:
        limits:
          nvidia.com/gpu: 0  # Don't request GPU, just monitor
      volumeMounts:
        # Mount NVML socket
        - name: nvidia-mps
          mountPath: /var/run/nvidia
  volumes:
    - name: nvidia-mps
      hostPath:
        path: /var/run/nvidia

MIG (Multi-Instance GPU) Support

Monitor MIG partitions:

gpu_utilization_percent{
  device="0",
  mig_device="mig-1g.5gb-0",
  mig_profile="1g.5gb"
} 75.2

Dashboard Examples

GPU Overview

# GPU fleet summary
sum(gpu_utilization_percent) by (name) / count(gpu_utilization_percent) by (name)

# Memory pressure
sum(gpu_memory_used_bytes) / sum(gpu_memory_total_bytes) * 100

# Power efficiency (tokens per watt)
sum(rate(llm_tokens_generated_total[5m])) / sum(gpu_power_usage_watts)

LLM Performance

# Requests per second
sum(rate(llm_request_total[5m])) by (model)

# Token generation rate
sum(rate(llm_tokens_generated_total[5m])) by (model)

# Latency percentiles
histogram_quantile(0.50, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))
histogram_quantile(0.95, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))
histogram_quantile(0.99, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))

Best Practices

1. Enable Per-Process Tracking

Identify which processes use GPU resources:

agent:
  gpu:
    per_process: true

2. Monitor KV Cache

KV cache is critical for LLM performance:

# Alert when KV cache is near capacity
llm_kv_cache_usage_bytes / llm_kv_cache_capacity_bytes > 0.9

3. Correlate with Traces

Link inference metrics to traces:

agent:
  aiml:
    # Add trace context to LLM metrics
    trace_correlation: true

Next Steps