Network Observability
Telegen provides deep network observability using eBPF.
Overview
Network observability includes:
- DNS tracing - Query/response correlation
- TCP metrics - RTT, retransmits, connection tracking
- HTTP/gRPC tracing - Request/response details
- Flow tracking - Connection topology
- XDP packet analysis - High-performance packet inspection
DNS Tracing
What’s Captured
| Field | Description |
|---|---|
| Query | Domain name, type (A, AAAA, CNAME) |
| Response | Answer records, response code |
| Latency | Query-to-response time |
| Server | DNS server address |
Sample Event
{
"timestamp": "2024-01-15T10:30:00.123Z",
"attributes": {
"dns.question.name": "api.example.com",
"dns.question.type": "A",
"dns.response_code": "NOERROR",
"dns.answers": ["10.0.1.100", "10.0.1.101"],
"dns.latency_ms": 2.5,
"net.peer.ip": "10.0.0.2",
"net.peer.port": 53,
"process.pid": 12345,
"k8s.pod.name": "my-app-xyz"
}
}
Configuration
agent:
network:
dns:
enabled: true
capture_queries: true
capture_responses: true
# Capture query/response content
capture_content: true
TCP Metrics
Metrics Collected
| Metric | Description |
|---|---|
tcp_rtt_us |
Round-trip time in microseconds |
tcp_retransmits |
Packet retransmission count |
tcp_connections |
Connection count |
tcp_bytes_sent |
Bytes transmitted |
tcp_bytes_received |
Bytes received |
Connection Tracking
# Metrics example
tcp_rtt_us{
src_ip="10.0.1.50",
dst_ip="10.0.2.100",
dst_port="5432",
k8s_src_pod="api-server",
k8s_dst_service="postgres"
} 1250
tcp_retransmits_total{
src_ip="10.0.1.50",
dst_ip="10.0.2.100",
dst_port="5432"
} 3
Configuration
agent:
network:
tcp:
enabled: true
rtt: true
retransmits: true
connection_tracking: true
# Flow sampling (1 in N connections)
sample_rate: 1 # Capture all
HTTP/gRPC Tracing
HTTP Details
| Field | Description |
|---|---|
http.method |
GET, POST, PUT, DELETE, etc. |
http.url |
Full request URL |
http.route |
Matched route pattern |
http.status_code |
Response status |
http.request_content_length |
Request body size |
http.response_content_length |
Response body size |
gRPC Details
| Field | Description |
|---|---|
rpc.system |
grpc |
rpc.service |
Service name |
rpc.method |
Method name |
rpc.grpc.status_code |
gRPC status code |
Configuration
agent:
ebpf:
network:
enabled: true
http: true
grpc: true
# URL/path filtering
exclude_paths:
- "/health"
- "/healthz"
- "/ready"
- "/metrics"
- "/favicon.ico"
# Capture request/response headers
capture_headers:
- "content-type"
- "user-agent"
- "x-request-id"
Service Topology
Telegen automatically builds a service dependency map:
flowchart LR
subgraph External
LB["Load Balancer"]
end
subgraph Cluster["Kubernetes Cluster"]
FE["Frontend"]
API["API Gateway"]
US["User Service"]
OS["Order Service"]
PG["PostgreSQL"]
RD["Redis"]
KF["Kafka"]
end
LB -->|HTTP| FE
FE -->|HTTP| API
API -->|gRPC| US
API -->|gRPC| OS
US -->|SQL| PG
OS -->|SQL| PG
API -->|TCP| RD
OS -->|Produce| KF
Topology Data
topology:
nodes:
- id: "api-gateway"
type: "service"
attributes:
k8s.deployment: "api-gateway"
k8s.namespace: "default"
- id: "user-service"
type: "service"
attributes:
k8s.deployment: "user-service"
k8s.namespace: "default"
edges:
- source: "api-gateway"
target: "user-service"
attributes:
protocol: "grpc"
requests_per_second: 150
avg_latency_ms: 12
error_rate: 0.01
XDP Packet Analysis
For high-performance packet inspection at the NIC level:
Configuration
agent:
network:
xdp:
enabled: true
# Sample rate (1 in N packets)
sample_rate: 1000 # 0.1% of packets
# Interfaces to attach
interfaces:
- eth0
- eth1
# Packet filters
filters:
# Only specific ports
ports:
- 80
- 443
- 8080
# Only specific protocols
protocols:
- tcp
- udp
Use Cases
- DDoS detection - High packet rate anomalies
- Protocol analysis - Non-HTTP traffic inspection
- Network debugging - Low-level packet issues
Network Metrics
RED Metrics (Rate, Errors, Duration)
# Request rate by service
sum(rate(http_server_requests_total[5m])) by (service_name)
# Error rate
sum(rate(http_server_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_server_requests_total[5m]))
# Latency percentiles
histogram_quantile(0.99,
sum(rate(http_server_duration_bucket[5m])) by (le, service_name)
)
Connection Metrics
# Active connections by service pair
telegen_tcp_connections{state="established"}
# Connection errors
sum(rate(telegen_tcp_connection_errors_total[5m])) by (error_type)
# Retransmit rate
sum(rate(telegen_tcp_retransmits_total[5m]))
/ sum(rate(telegen_tcp_segments_total[5m]))
DNS Metrics
# DNS query rate
sum(rate(telegen_dns_queries_total[5m])) by (domain)
# DNS latency
histogram_quantile(0.95,
sum(rate(telegen_dns_latency_bucket[5m])) by (le)
)
# DNS errors
sum(rate(telegen_dns_queries_total{response_code!="NOERROR"}[5m]))
Interface Filtering
Control which network interfaces are monitored:
agent:
network:
# Include specific interfaces
interfaces:
- eth0
- ens5
# Or exclude interfaces
exclude_interfaces:
- lo # Loopback
- docker0 # Docker bridge
- veth* # Container veths
Port Filtering
Focus on specific ports:
agent:
ebpf:
network:
# Only trace these ports
include_ports:
- 80
- 443
- 8080
- 3000
- 5432
- 6379
# Or exclude ports
exclude_ports:
- 22 # SSH
- 2379 # etcd
- 2380 # etcd peer
Network Security
Suspicious Connection Detection
agent:
network:
security:
enabled: true
# Detect connections to unusual ports
suspicious_ports:
- 4444 # Common reverse shell
- 31337 # Elite port
# Detect connections to external IPs
external_connection_alerts: true
# Known bad IP lists
blocklists:
- "/etc/telegen/ip-blocklist.txt"
Example Alert
{
"timestamp": "2024-01-15T10:30:00Z",
"severity": "WARNING",
"body": "Suspicious outbound connection to known bad IP",
"attributes": {
"network.event_type": "suspicious_connection",
"net.peer.ip": "198.51.100.50",
"net.peer.port": 4444,
"process.pid": 12345,
"process.executable.path": "/tmp/shell",
"k8s.pod.name": "compromised-pod"
}
}
Performance Considerations
Overhead
| Feature | CPU Impact | Memory Impact |
|---|---|---|
| TCP metrics | ~0.5% | 10MB |
| DNS tracing | ~0.2% | 5MB |
| HTTP tracing | ~1% | 20MB |
| XDP (sampled) | ~0.1% | 5MB |
Reducing Overhead
agent:
network:
# Reduce ring buffer size
ring_buffer_size: 8388608 # 8MB instead of 16MB
# Increase sampling
tcp:
sample_rate: 10 # 1 in 10 connections
# Limit captured data
http:
max_body_capture: 0 # Don't capture bodies
max_headers: 5 # Limit headers
Best Practices
1. Filter Noisy Traffic
Exclude health checks and internal traffic:
agent:
ebpf:
network:
exclude_paths:
- "/health*"
- "/ready*"
- "/metrics"
exclude_ports:
- 2379 # etcd
- 10250 # kubelet
2. Use Appropriate Sampling
For high-traffic environments:
agent:
network:
tcp:
sample_rate: 100 # 1% of connections
xdp:
sample_rate: 10000 # 0.01% of packets
3. Monitor Key Services
Focus on critical paths:
agent:
network:
include_ports:
- 80 # HTTP
- 443 # HTTPS
- 5432 # PostgreSQL
- 6379 # Redis
Messaging Protocols
Telegen captures tracing data for AMQP 0-9-1, CQL (Cassandra), and NATS at the eBPF level — no SDK instrumentation or configuration changes required.
AMQP 0-9-1 Tracing
AMQP 0-9-1 is the wire protocol used by RabbitMQ and other brokers. Telegen captures publish and consume operations at the channel level.
What’s Captured
| Field | Description |
|---|---|
messaging.system |
rabbitmq |
messaging.operation |
publish or process |
messaging.destination.name |
Exchange name |
messaging.rabbitmq.destination.routing_key |
Routing key |
messaging.client_id |
AMQP channel ID |
net.peer.ip / net.peer.port |
Broker address |
Sample Span
{
"name": "orders.created publish",
"kind": "PRODUCER",
"duration_ms": 0.8,
"attributes": {
"messaging.system": "rabbitmq",
"messaging.operation": "publish",
"messaging.destination.name": "events",
"messaging.rabbitmq.destination.routing_key": "orders.created",
"net.peer.ip": "10.0.2.50",
"net.peer.port": 5672
}
}
Configuration
agent:
network:
protocols:
amqp:
enabled: true
capture_routing_key: true
CQL (Cassandra) Tracing
Telegen parses the Cassandra Query Language binary protocol (CQL v3–v5) to capture query statements, keyspaces, batch operations, and prepared statement execution.
See Database Tracing for the full Cassandra tracing reference.
NATS Tracing
NATS is a lightweight, text-based publish/subscribe messaging system. Telegen captures PUB, MSG, and subscription operations from the NATS wire protocol.
What’s Captured
| Field | Description |
|---|---|
messaging.system |
nats |
messaging.operation |
publish or process |
messaging.destination.name |
Subject name |
net.peer.ip / net.peer.port |
NATS server address |
Sample Span
{
"name": "sensor.readings publish",
"kind": "PRODUCER",
"duration_ms": 0.2,
"attributes": {
"messaging.system": "nats",
"messaging.operation": "publish",
"messaging.destination.name": "sensor.readings",
"net.peer.ip": "10.0.3.10",
"net.peer.port": 4222
}
}
Configuration
agent:
network:
protocols:
nats:
enabled: true
capture_subject: true
Connection Statistics
Telegen tracks byte-level connection statistics via TCP close events, providing a low-overhead measure of throughput per connection without full payload capture.
Metrics Emitted
| Metric | Type | Labels | Description |
|---|---|---|---|
telegen.connection.bytes_sent |
Counter | src, dst, port | Bytes sent per connection lifetime |
telegen.connection.bytes_received |
Counter | src, dst, port | Bytes received per connection lifetime |
These metrics are emitted when a TCP connection closes and complement the per-request span data produced by the protocol parsers.
Configuration
agent:
ebpf:
conn_stats:
enabled: true
Next Steps
- Database Tracing - Deep database network tracing
- Security Observability - Network security events
- Agent Mode - Network configuration