Observability Setup
Tumult emits OpenTelemetry traces, metrics, and logs for every experiment run. This guide covers the complete span hierarchy, attribute reference, structured events, and how to route telemetry to your backend.
Architecture
Tumult ──OTLP──▶ OTel Collector ──▶ Your Backend
(the fan-out)
│
├──▶ Jaeger / Tempo (traces)
├──▶ Prometheus / Mimir (metrics)
├──▶ Loki / Elasticsearch (logs)
└──▶ SigNoz / Datadog / etc.
Tumult speaks OTLP only. The OTel Collector routes telemetry to your backend of choice. You never need to change Tumult configuration when switching backends.
Quick Start (Development)
The fastest way to see traces locally:
cd docker/
docker compose up -d
This starts:
- OTel Collector on
localhost:14317(gRPC) andlocalhost:14318(HTTP) - SigNoz UI on
http://localhost:13301 - Jaeger UI (classic stack) on
http://localhost:16686
Then run an experiment:
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:14317 tumult run experiment.toon
Open SigNoz at http://localhost:13301 → Services → tumult, and you’ll see the experiment trace with all phases.
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
TUMULT_OTEL_ENABLED | true | Enable/disable telemetry collection |
TUMULT_OTEL_CONSOLE | false | Also print spans to stdout |
TUMULT_MCP_TOKEN | — | Bearer token for MCP server auth (unset = no auth) |
TUMULT_CLICKHOUSE_URL | — | ClickHouse URL for SigNoz cross-correlation mode |
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4317 | OTLP collector endpoint |
OTEL_SERVICE_NAME | tumult | Service name in telemetry |
OTEL_RESOURCE_ATTRIBUTES | — | Additional resource attributes (e.g., deployment.environment=staging) |
Disabling Telemetry
TUMULT_OTEL_ENABLED=false tumult run experiment.toon
Telemetry is still collected internally (for the journal), but nothing is exported via OTLP.
Collector Configurations
Reference configs are provided in the collector/ directory:
| File | Backend | Use Case |
|---|---|---|
otel-collector-config.yaml | Console (stdout) | Development, debugging |
otel-collector-dev.yaml | Jaeger | Local development with docker-compose |
otel-collector-signoz.yaml | SigNoz | All-in-one observability |
otel-collector-grafana.yaml | Tempo + Mimir + Loki | Grafana stack |
otel-collector-e2e.yaml | Multi-backend | E2E test environment |
SigNoz
# Start via Docker (recommended — see docker/README.md)
make up-observe
open http://localhost:13301
# Or run the collector standalone:
otelcol --config collector/otel-collector-signoz.yaml
Grafana Stack (Tempo + Mimir + Loki)
# Requires Tempo, Mimir, and Loki running
otelcol --config collector/otel-collector-grafana.yaml
Span Hierarchy
Every experiment produces the following span tree. The root span is resilience.experiment; all nested spans are children.
resilience.experiment (root — tumult-core runner)
├── resilience.hypothesis.before
│ └── resilience.probe (one per hypothesis probe)
│ └── script.execute (tumult-plugin)
│ └── [subprocess spans via TRACEPARENT env var]
├── resilience.action (one per method step)
│ ├── script.execute (for script plugins)
│ │ └── [subprocess spans via TRACEPARENT env var]
│ ├── ssh.connect (tumult-ssh — when target is Ssh)
│ ├── ssh.execute (tumult-ssh — remote command)
│ ├── k8s.pod.delete (tumult-kubernetes)
│ ├── k8s.node.drain
│ ├── k8s.deployment.scale
│ └── k8s.network_policy.apply
├── resilience.hypothesis.after
│ └── resilience.probe
├── resilience.rollback (one per rollback step)
│ └── resilience.action
├── baseline.acquire (tumult-baseline)
│ └── baseline.sample (repeated per interval)
├── resilience.analytics.ingest (tumult-analytics → DuckDB or ClickHouse)
│ ├── resilience.analytics.query
│ └── resilience.analytics.export
└── mcp.tool.call (tumult-mcp — when run via MCP)
Trace Context Propagation
When Tumult executes a script plugin, it injects TRACEPARENT and TRACESTATE environment variables into the subprocess. This allows scripts that emit their own OTel spans to attach as children of the script.execute span:
#!/usr/bin/env bash
# The TRACEPARENT env var is automatically set by Tumult.
# Any OTel-instrumented process launched here inherits the trace context.
my-otel-instrumented-service --do-chaos
When running via the MCP server, you can pass a parent_context in RunConfig to link the experiment root span to the calling agent’s trace.
Span Attributes Reference
resilience.experiment (root span)
| Attribute | Type | Description |
|---|---|---|
resilience.experiment.id | string | UUID for this experiment run |
resilience.experiment.name | string | Experiment title |
resilience.experiment.status | string | Completed, Failed, Aborted, Interrupted |
resilience.experiment.duration_ms | int | Total experiment wall-clock time |
resilience.hypothesis.met | bool | Did the steady-state hypothesis hold? |
resilience.hypothesis.deviations | int | Number of probe deviations detected |
resilience.action / resilience.probe
| Attribute | Type | Description |
|---|---|---|
resilience.action.name | string | Activity name from experiment definition |
resilience.probe.name | string | Probe name from experiment definition |
resilience.plugin.name | string | Plugin executing the activity |
resilience.activity.duration_ms | int | Activity execution duration |
resilience.activity.status | string | success, failure, timeout |
resilience.activity.phase | string | before, method, after, rollback |
script.execute
| Attribute | Type | Description |
|---|---|---|
script.plugin_name | string | Script plugin name |
script.function_name | string | Action or probe function name |
script.exit_code | int | Script process exit code |
script.duration_ms | int | Script execution duration |
ssh.connect / ssh.execute
| Attribute | Type | Description |
|---|---|---|
net.peer.name | string | SSH target hostname |
net.peer.port | int | SSH port (default 22) |
ssh.user | string | SSH username |
ssh.auth_method | string | key_file, agent, password |
ssh.command_exit_code | int | Remote command exit code |
baseline.acquire
| Attribute | Type | Description |
|---|---|---|
baseline.probe_name | string | Name of the probe being baselined |
baseline.method | string | mean_stddev, percentile, iqr, error_rate |
baseline.sample_count | int | Number of samples collected |
baseline.duration_ms | int | Baseline acquisition wall time |
baseline.anomaly_detected | bool | Whether the baseline itself was anomalous |
resilience.analytics.ingest
| Attribute | Type | Description |
|---|---|---|
analytics.backend | string | duckdb or clickhouse |
analytics.experiment_id | string | Experiment ID being ingested |
analytics.rows_inserted | int | Number of activity rows written |
Structured Span Events
Tumult emits structured span events (not logs) at key lifecycle points. These appear as timeline markers within spans in Jaeger/SigNoz.
| Event Name | Parent Span | Fields | Description |
|---|---|---|---|
journal.ingested | resilience.analytics.ingest | experiment_id, activity_count | Journal successfully written to store |
drain.completed | resilience.experiment | spans_exported, metrics_exported | OTel flush completed at experiment end |
tolerance.derived | baseline.acquire | probe_name, method, lower, upper | Baseline tolerance bounds calculated |
anomaly.detected | baseline.acquire | probe_name, reason, cv | Baseline anomaly found before experiment |
script.completed | script.execute | exit_code, stdout_bytes, stderr_bytes | Script plugin finished execution |
experiment.started | resilience.experiment | experiment_id, title, triggered_by | Audit event: experiment begins |
experiment.completed | resilience.experiment | experiment_id, status, duration_ms | Audit event: experiment ends |
Audit Events
The experiment.started and experiment.completed events are also emitted as structured tracing::info! log records with fields compatible with SIEM ingestion:
INFO experiment.started experiment_id="abc-123" title="Kill DB connections" triggered_by="cli"
INFO experiment.completed experiment_id="abc-123" status="Completed" duration_ms=45231
These events appear in log aggregators (Loki, Elasticsearch) correlated with the experiment trace via trace_id.
Metrics Reference
All metrics use the resilience. namespace.
Counters
| Metric | Labels | Description |
|---|---|---|
resilience.experiments.total | status | Experiments run, by outcome |
resilience.actions.total | plugin, outcome | Actions executed |
resilience.probes.total | plugin, outcome | Probes executed |
resilience.hypothesis.deviations.total | experiment | Steady-state violations, by experiment name |
resilience.script.executions.total | plugin, function, outcome | Script plugin invocations |
resilience.rollbacks.total | outcome | Rollback executions |
resilience.rollback.failures | — | Rollback steps that failed (non-fatal) |
Histograms
| Metric | Labels | Description |
|---|---|---|
resilience.action.duration_seconds | plugin | Action execution latency |
resilience.probe.duration_seconds | plugin | Probe execution latency |
resilience.experiment.duration_seconds | status | Total experiment duration |
resilience.baseline.duration_seconds | method | Baseline acquisition time |
Gauges (Store)
| Metric | Description |
|---|---|
resilience.store.experiments | Total experiments in persistent store |
resilience.store.activities | Total activity rows in store |
resilience.store.size_bytes | DuckDB file size in bytes |
resilience.store.disk_usage_pct | Store disk usage as percentage of volume |
Trace-to-Metrics Correlation (SigNoz)
When using SigNoz with the ClickHouse backend, experiment data lands in the same database as SigNoz traces and metrics. This enables powerful cross-signal queries:
-- Find all traces for a specific experiment
SELECT e.title, e.status, t.traceID, t.serviceName
FROM tumult.experiments e
JOIN signoz_traces.signoz_index_v2 t
ON e.experiment_id = t.traceID
WHERE e.status = 'Failed'
-- Correlate experiment timing with infrastructure metrics
SELECT e.title, s.unix_milli, s.value AS cpu_pct
FROM tumult.experiments e
JOIN signoz_metrics.samples_v4 s
ON s.unix_milli BETWEEN e.started_at AND e.completed_at
WHERE s.metric_name = 'system.cpu.utilization'
To enable:
TUMULT_CLICKHOUSE_URL=http://localhost:8123 tumult run experiment.toon
Trace Context from MCP Callers
When an AI agent or orchestration system calls Tumult via MCP (tumult_run_experiment), the experiment’s root span can be linked to the agent’s trace. Pass the W3C trace context in the MCP call metadata:
{
"_meta": {
"extra": {
"authorization": "Bearer <token>",
"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
}
}
}
The MCP handler extracts the traceparent header and wires it into RunConfig.parent_context, making the experiment a child span of the calling agent.
Troubleshooting
No traces appearing?
- Check
TUMULT_OTEL_ENABLEDis notfalse - Verify the collector is running:
curl -v localhost:4317 - Check collector logs:
docker compose logs otel-collector - Try
TUMULT_OTEL_CONSOLE=true tumult run experiment.toonto dump spans to stdout
Traces appear but no metrics?
- Ensure your collector config has a
metricspipeline - Verify the backend supports OTLP metrics ingestion
hypothesis.deviations.total not broken down by experiment?
- This metric carries the
experimentlabel. Ensure your metrics backend supports high-cardinality labels, or filter with--experiment <name>in queries.
Subprocess spans not connecting to parent?
- The subprocess must read
TRACEPARENT/TRACESTATEfrom environment and use them as its OTel context. Most OTel SDKs do this automatically if you callopentelemetry::global::get_text_map_propagator.
SigNoz not showing experiment data?
- Confirm
TUMULT_CLICKHOUSE_URLis set correctly. - Check ClickHouse is healthy:
curl http://localhost:8123/ping - The ClickHouse backend retries 3 times with exponential backoff (2s/4s/8s) before failing.