Execution Flow
Tumult experiments follow a five-phase lifecycle. Each phase produces data that feeds into the next. The execution is orchestrated by the run_experiment() function in tumult-core::runner.
Phase Overview
Phase 0: ESTIMATE Record predictions before any measurement
Phase 1: BASELINE Connect, sample, derive statistical thresholds
Phase 2: DURING Inject fault, observe degradation continuously
Phase 3: POST Remove fault, measure recovery
Phase 4: ANALYSIS Compare estimate vs actual, compute scores
Implementation
The runner takes four inputs:
Experiment— the parsed experiment definitionActivityExecutor— trait for executing actions/probes (plugin system)ControlRegistry— lifecycle event handlersRunConfig— rollback strategy, baseline mode, dry-run flag
Returns a Journal containing the complete experiment results with all phases.
Detailed Flow
1. Parse and Validate
tumult run experiment.toon
│
├── Parse TOON → Experiment struct (engine::parse_experiment)
├── Validate: method not empty (engine::validate_experiment)
├── Resolve config (env vars, inline) (engine::resolve_config)
└── Resolve secrets (env vars, files) (engine::resolve_secrets)
2. Initialize
├── CONTROL: BeforeExperiment
├── Generate experiment UUID
├── Record start timestamp (epoch nanoseconds)
└── Record estimate (Phase 0) — preserved in journal as-is
3. Start Load (if configured)
├── Start k6/JMeter in background
└── Wait for load to stabilize
4. Baseline Acquisition (Phase 1)
Baseline acquisition uses tumult-baseline::acquisition::derive_baseline():
├── Warmup: discard first N seconds of samples
├── Sample: collect probe values at interval for duration
├── Per-probe statistics: mean, stddev, p50, p95, p99, min, max
├── Anomaly check: CV > 0.5? range > 10x median? insufficient samples?
├── Derive tolerance bounds (method: mean±Nσ, percentile, IQR)
└── Produce AcquisitionResult with ProbeStats per probe
5. Hypothesis BEFORE
├── CONTROL: BeforeHypothesis
├── For each probe in hypothesis:
│ ├── CONTROL: BeforeActivity{name}
│ ├── Execute probe via ActivityExecutor
│ ├── Evaluate tolerance (exact, range, or regex)
│ └── CONTROL: AfterActivity{name}
├── If ANY probe fails tolerance → HypothesisResult.met = false
├── CONTROL: AfterHypothesis
└── If not met → ABORT → skip method, go to rollbacks
6. Method Execution (Phase 2)
├── CONTROL: BeforeMethod
├── For each activity in method:
│ ├── CONTROL: BeforeActivity{name}
│ ├── Execute via ActivityExecutor
│ ├── Build ActivityResult with status, output, timing, trace IDs
│ └── CONTROL: AfterActivity{name}
└── CONTROL: AfterMethod
7. Hypothesis AFTER
├── CONTROL: BeforeHypothesis
├── Run all hypothesis probes again (same as step 5)
├── CONTROL: AfterHypothesis
└── If ANY probe fails → status = DEVIATED
8. Recovery (Phase 3)
├── Sample probes at interval
├── Detect recovery point (all probes within tolerance)
├── Calculate MTTR
├── Check data integrity
└── Record PostResult
9. Determine Status
├── engine::determine_status(hypothesis_before, hypothesis_after, actions_succeeded)
│ ├── hypothesis_before failed → Aborted
│ ├── actions failed → Failed
│ ├── hypothesis_after failed → Deviated
│ └── all passed → Completed
10. Rollbacks
├── execution::should_rollback(strategy, deviated):
│ ├── Always → execute
│ ├── OnDeviation → execute only if deviated or aborted
│ └── Never → skip
├── CONTROL: BeforeRollback
├── For each rollback action:
│ ├── CONTROL: BeforeActivity{name}
│ ├── Execute via ActivityExecutor
│ └── CONTROL: AfterActivity{name}
└── CONTROL: AfterRollback
11. Stop Load
├── Stop k6/JMeter
└── Collect load metrics (throughput, latency, error rate)
12. Analysis (Phase 4)
├── runner::compute_analysis():
│ ├── Compare estimate.expected_outcome vs actual ExperimentStatus
│ ├── estimate_accuracy: 1.0 if prediction matched, 0.0 otherwise
│ ├── resilience_score: 1.0 if completed, 0.0 otherwise
│ └── estimate_recovery_delta_s: (actual - estimated) recovery time
13. Finalize
├── Record end timestamp
├── Calculate duration_ms
├── CONTROL: AfterExperiment
├── Build Journal with all phases and results
├── Write journal.toon
└── Flush OTel
Controls
Controls receive lifecycle events at each stage. They are registered at experiment start and called synchronously at each hook point. Common uses:
| Control | Purpose |
|---|---|
| Logging | Structured log output at each phase |
| Tracing | Additional span attributes or annotations |
| Safeguards | Abort if safety conditions are breached |
| Notifications | Send alerts on deviation or failure |
Rollback Strategy
| Strategy | Behavior |
|---|---|
always | Execute rollbacks after every experiment regardless of outcome |
on-deviation | Execute rollbacks only when steady-state hypothesis fails (default) |
never | Never execute rollbacks — use when the experiment is naturally idempotent |
Error Handling
- Plugin execution errors are captured in ActivityResult, not propagated — the experiment continues
- SSH connection failures trigger retry with exponential backoff
- OTel export failures are logged but never block execution
- Ctrl+C triggers graceful shutdown: stop method, run rollbacks, flush OTel, write journal