Data-Driven Chaos: SQL Analytics Over Experiment Journals

Part 6 of the Tumult series. ← Part 5: Writing Your First Experiment
Most chaos engineering tools answer one question: did this experiment pass or fail? That is necessary, but it is not sufficient. The questions that matter for a production engineering team are different:
- Is our payment service getting more or less resilient over time?
- Which systems are consistently exceeding their recovery time objectives?
- Where are our predictions furthest from reality — and what does that tell us about our mental models?
- Which experiments have never run against production?
Answering these questions requires treating chaos experiment data as data — not as logs to scroll through, but as structured records to query, aggregate, and visualize. Tumult’s analytics pipeline is built on this premise.
The Pipeline
Every experiment run produces a structured journal in TOON format. The analytics pipeline transforms those journals into queryable data:
Experiment → TOON Journal → Apache Arrow (columnar) → DuckDB (embedded SQL) → Parquet (export)
The key properties of this pipeline:
Arrow: experiments are parsed into Apache Arrow RecordBatch format — columnar, memory-mapped, and interoperable with the entire Arrow ecosystem (Polars, pandas, Spark, BigQuery).
DuckDB: an embedded analytical database. No server, no setup, no network. DuckDB runs in-process alongside Tumult, meaning you can run SQL queries against thousands of journals without any infrastructure.
Parquet: the export format. Parquet is the standard for portable, compressed columnar data. A Parquet file from Tumult can be opened in Jupyter, loaded into Spark, queried in BigQuery, or archived to S3 — by any tool in the data ecosystem.
The Data Model
Tumult analytics exposes two primary tables.
experiments
One row per experiment run.
| Column | Type | Description |
|---|---|---|
experiment_id | VARCHAR | UUID for this run |
title | VARCHAR | Experiment title |
status | VARCHAR | Completed, Deviated, Aborted, Failed |
started_at_ns | BIGINT | Start time (epoch nanoseconds) |
ended_at_ns | BIGINT | End time |
duration_ms | UBIGINT | Total experiment duration |
method_step_count | BIGINT | Method steps executed |
rollback_count | BIGINT | Rollback steps executed |
hypothesis_before_met | BOOLEAN | Was steady state met before fault? |
hypothesis_after_met | BOOLEAN | Was steady state met after fault? |
estimate_accuracy | DOUBLE | Estimate vs actual (0.0–1.0) |
resilience_score | DOUBLE | Overall resilience score (0.0–1.0) |
activity_results
One row per action or probe execution within an experiment.
| Column | Type | Description |
|---|---|---|
experiment_id | VARCHAR | Links to experiments |
name | VARCHAR | Activity name |
activity_type | VARCHAR | Action or Probe |
status | VARCHAR | Succeeded, Failed, Timeout |
started_at_ns | BIGINT | Start time |
duration_ms | UBIGINT | Execution duration |
output | VARCHAR | Activity output |
error | VARCHAR | Error message (if failed) |
phase | VARCHAR | hypothesis_before, method, hypothesis_after, rollback |
Running Analytics
Built-in summary
# Default summary across all journals in a directory
tumult analyze journals/
Output:
Experiments: 47
Completed: 38 (81%)
Deviated: 7 (15%)
Aborted: 2 (4%)
Average duration: 3m 24s
Average resilience score: 0.73
Top deviating experiments:
postgresql-failover 7 deviations / 12 runs
redis-cache-flush 3 deviations / 8 runs
kafka-broker-kill 2 deviations / 5 runs
Custom SQL queries
# Run any SQL query against the journals
tumult analyze journals/ --query "
SELECT status, count(*) as runs, avg(duration_ms) as avg_ms
FROM experiments
GROUP BY status"
# Query a single journal
tumult analyze journal.toon --query "
SELECT name, phase, duration_ms
FROM activity_results
ORDER BY duration_ms DESC"
Useful Queries
Resilience trend over time
SELECT
DATE_TRUNC('week', TIMESTAMP 'epoch' + started_at_ns * INTERVAL '1 nanosecond') AS week,
COUNT(*) AS runs,
AVG(CASE WHEN status = 'Completed' THEN 1.0 ELSE 0.0 END) AS success_rate,
AVG(resilience_score) AS avg_score
FROM experiments
GROUP BY week
ORDER BY week;
Watching success_rate over time tells you whether the engineering organization is making services more resilient or less. A declining trend warrants attention before it becomes an incident.
Slowest-recovering systems
SELECT
title,
COUNT(*) AS runs,
AVG(duration_ms) / 1000.0 AS avg_duration_s,
MAX(duration_ms) / 1000.0 AS worst_s,
MIN(duration_ms) / 1000.0 AS best_s
FROM experiments
WHERE status IN ('Completed', 'Deviated')
GROUP BY title
ORDER BY avg_duration_s DESC
LIMIT 10;
Activities that consistently fail
SELECT
ar.name,
ar.phase,
ar.activity_type,
COUNT(*) AS total_runs,
SUM(CASE WHEN ar.status = 'Succeeded' THEN 0 ELSE 1 END) AS failures,
ROUND(
SUM(CASE WHEN ar.status = 'Succeeded' THEN 0 ELSE 1 END) * 100.0 / COUNT(*), 1
) AS failure_pct
FROM activity_results ar
GROUP BY ar.name, ar.phase, ar.activity_type
HAVING failures > 0
ORDER BY failure_pct DESC;
If a probe is failing 30% of the time in the hypothesis_before phase, it means the system is regularly not healthy before chaos is even injected — a signal worth investigating independently of the chaos experiments.
Estimate accuracy analysis
SELECT
title,
COUNT(*) AS runs,
AVG(estimate_accuracy) AS avg_accuracy,
SUM(CASE WHEN estimate_accuracy = 1.0 THEN 1 ELSE 0 END) AS exact_hits,
SUM(CASE WHEN estimate_accuracy = 0.0 THEN 1 ELSE 0 END) AS complete_misses
FROM experiments
WHERE estimate_accuracy IS NOT NULL
GROUP BY title
ORDER BY avg_accuracy ASC;
Experiments with consistently zero estimate accuracy are the most valuable learning opportunities. They represent scenarios where the team’s mental model of the system is meaningfully wrong.
Hypothesis deviation rate by phase
SELECT
CASE
WHEN NOT hypothesis_before_met THEN 'Failed before fault (Aborted)'
WHEN NOT hypothesis_after_met THEN 'Failed after fault (Deviated)'
ELSE 'Both passed (Completed)'
END AS hypothesis_result,
COUNT(*) AS count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM experiments
GROUP BY hypothesis_result
ORDER BY count DESC;
A high “Failed before fault” rate indicates systemic instability — your systems are frequently unhealthy before the experiment even starts. This is valuable data that pure uptime monitoring would not surface.
Exporting to Parquet
Parquet export makes experiment data portable to any data tool:
# Export a single journal
tumult export journal.toon --format parquet
# Export to CSV (for spreadsheets)
tumult export journal.toon --format csv
# Export to JSON (for compatibility with other tools)
tumult export journal.toon --format json
A Parquet file from Tumult loads directly into pandas:
import pandas as pd
df = pd.read_parquet("journal.parquet")
print(df[["title", "status", "resilience_score", "duration_ms"]].head(20))
Or into Polars for faster processing:
import polars as pl
df = pl.read_parquet("journals/*.parquet")
print(df.group_by("status").agg(pl.count(), pl.col("resilience_score").mean()))
Or directly into a DuckDB session for ad-hoc SQL:
import duckdb
conn = duckdb.connect()
result = conn.execute("""
SELECT title, avg(resilience_score) as score
FROM read_parquet('journals/*.parquet')
GROUP BY title
ORDER BY score ASC
""").fetchdf()
Why Embedded DuckDB?
The choice to embed DuckDB rather than require an external database is deliberate.
No infrastructure. The analytics capability is part of the tumult binary. There is no separate database process, no network connection, no credentials to manage. It works the same on a developer laptop and in a CI container.
Columnar performance. DuckDB is an analytical database — optimized for aggregation queries over many rows. The query patterns in chaos analytics (group by experiment name, aggregate over time, compute percentiles) are exactly what columnar databases excel at.
Parquet-native. DuckDB reads Parquet files directly. Your Tumult journals stored as Parquet in S3 are immediately queryable with read_parquet('s3://your-bucket/journals/*.parquet') without importing or transforming data.
Arrow interoperability. Tumult’s internal analytics builds on Apache Arrow RecordBatches. The Arrow ecosystem — including Polars, pandas, PyArrow, DataFusion, and dozens of other tools — can all consume this data format without conversion.
The Bigger Picture: Chaos Data as Organizational Memory
Individual experiments answer specific questions about specific systems at specific times. Analytics across all experiments answer organizational questions: what is the overall trajectory of system resilience? Which teams are running experiments? Which critical systems have never been chaos-tested?
The Parquet pipeline is the foundation for this. Every experiment produces a structured, portable record. Over time, those records accumulate into an organizational data asset: a queryable history of resilience testing, failure modes, recovery times, and prediction accuracy.
That data asset is what enables the shift from chaos engineering as occasional practice to chaos engineering as continuous discipline. Instead of “we ran some experiments last quarter,” you have “here is the resilience trend for every critical service over the past year, with statistical baselines for each.”
The trend query — is recovery time improving, stable, or degrading? — is the one that matters most:
WITH ranked AS (
SELECT
title,
started_at_ns,
duration_ms,
LAG(duration_ms) OVER (PARTITION BY title ORDER BY started_at_ns) AS prev_ms
FROM experiments
WHERE status IN ('Completed', 'Deviated')
)
SELECT
title,
COUNT(*) AS runs,
AVG(duration_ms - prev_ms) AS avg_delta_ms,
CASE
WHEN AVG(duration_ms - prev_ms) > 5000 THEN 'DEGRADING'
WHEN AVG(duration_ms - prev_ms) < -5000 THEN 'IMPROVING'
ELSE 'STABLE'
END AS trend
FROM ranked
WHERE prev_ms IS NOT NULL
GROUP BY title
ORDER BY avg_delta_ms DESC;
Services trending DEGRADING need attention. Services trending IMPROVING are demonstrating the value of the engineering work invested in resilience.
Next in the series: Part 7 — Kubernetes Chaos: Deep Fault Injection with tumult-kubernetes →