Tumult Metadata Model
Version: 2.0.0 Last Updated: 2026-03-29 Namespace: resilience.*
Abstract
This document defines the Tumult Metadata Model for resilience testing. It provides a structured vocabulary for describing experiments, targets, faults, measurements, outcomes, scoring, and regulatory evidence. The model follows OpenTelemetry Semantic Convention naming rules and is designed to be carried as span attributes, resource attributes, and metric dimensions within OTel-instrumented systems.
The namespace is resilience.*. All attributes defined here live under this single root.
Requirement Tiers
The model is organised into three tiers:
| Tier | Groups | When |
|---|---|---|
| Required | experiment, target, fault, outcome | Every experiment |
| Phase data (required if enabled) | estimate, baseline, during, post, analysis | When five-phase execution is active |
| Optional enrichment | taxonomy, safety, environment, load, regulatory, score | Enhances analytics, compliance, and scoring |
Table of Contents
- 1. Conventions
- 2. Required Attribute Groups
- 3. Phase Data Attribute Groups
- 4. Optional Enrichment Groups
- 5. Fault Subtype Taxonomy
- 6. Time Standard
- 7. Cardinality Rules
- 8. OTel Integration Rules
- 9. Scientific Rules
- 10. Scoring Methodology
- 11. Conformance
1. Conventions
| Convention | Rule |
|---|---|
| Naming | Dot-separated, lowercase, snake_case segments: resilience.<group>.<attribute> |
| Requirement levels | Required – MUST be present on every conforming span. Recommended – SHOULD be present when the information is available. Optional – MAY be present. Conditional – MUST be present when a stated condition is true. |
| Enums | Closed sets. Implementations MUST NOT invent values outside the documented set without proposing an addition to this standard. |
| Arrays | Denoted as string[]. Serialised as JSON arrays in span attributes and as repeated fields in protobuf. |
| Timestamps | Epoch nanoseconds (int64). See Time Standard. |
| Durations | Seconds as float64. |
2. Required Attribute Groups
These four groups MUST be present on every experiment.
2.1 resilience.experiment.*
Identity and lineage of a single experiment run.
| Attribute | Type | Requirement | Description |
|---|---|---|---|
resilience.experiment.id | string | Required | UUID v4 uniquely identifying this experiment run. |
resilience.experiment.name | string | Required | Human-readable experiment name. Stable across runs (e.g. postgresql-failover-recovery). |
resilience.experiment.version | string | Recommended | Semantic version of the experiment definition (e.g. 2.1.0). |
resilience.experiment.suite | string | Optional | Parent campaign or suite that groups related experiments. |
resilience.experiment.tags | string[] | Optional | Free-form tags for filtering and grouping (e.g. ["network", "critical-path", "dora"]). |
resilience.experiment.created_by | string | Recommended | Identity of the user or system that initiated the run. |
resilience.experiment.run_number | int | Recommended | Monotonically increasing run counter for this experiment name. Enables trend analysis across runs. |
2.2 resilience.target.*
The system, service, or component under test.
| Attribute | Type | Requirement | Description | Enum Values |
|---|---|---|---|---|
resilience.target.system | string | Required | Category of the target system. | database, cache, message-broker, api, container, vm, host, kubernetes, network, application, middleware, cloud-service |
resilience.target.technology | string | Required | Specific technology or product name. | Examples: postgresql, mysql, mariadb, oracle, redis, memcached, kafka, rabbitmq, nats, pulsar, docker, podman, kubernetes, nginx, envoy, haproxy, jvm, dotnet, nodejs |
resilience.target.component | string | Required | Specific instance or component identifier (e.g. payments-db-primary, cart-service-pod-3). High-cardinality – see Cardinality Rules. | |
resilience.target.tier | string | Recommended | Architectural tier of the target. | frontend, backend, data, infrastructure, network, platform |
resilience.target.criticality | string | Recommended | Business criticality classification. | critical, high, medium, low |
resilience.target.environment | string | Required | Deployment environment where the experiment runs. | production, staging, development, sandbox |
resilience.target.region | string | Optional | Cloud region or data centre location (e.g. eu-west-1, us-east-1, dc-ams-01). |
2.3 resilience.fault.*
The fault being injected.
| Attribute | Type | Requirement | Description | Enum Values |
|---|---|---|---|---|
resilience.fault.type | string | Required | Top-level fault category. | termination, resource-stress, network, state, dependency, configuration, data |
resilience.fault.subtype | string | Required | Specific fault action within the type. See Fault Subtype Taxonomy for the full closed set per type. | See Section 3 |
resilience.fault.severity | string | Required | Expected severity of the injected fault. | minor, moderate, major, catastrophic |
resilience.fault.duration_s | float | Recommended | Planned duration of fault injection in seconds. | |
resilience.fault.blast_radius | string | Required | Scope of impact of the fault. | single-instance, service, availability-zone, region, global |
resilience.fault.reversible | boolean | Required | Whether the fault can be automatically reversed after the experiment. | |
resilience.fault.plugin | string | Required | Name of the chaos plugin or driver executing the fault (e.g. chaostooling-extension-db, litmus, toxiproxy). | |
resilience.fault.action | string | Required | Fully qualified action identifier within the plugin (e.g. kill_postgresql_process, inject_network_latency). |
2.4 resilience.outcome.*
Final outcome of the experiment run. Moved here as a required group — see original definition below.
3. Phase Data Attribute Groups
These groups are required when five-phase execution is active (estimate → baseline → during → post → analysis).
3.1 resilience.estimate.*
Pre-experiment predictions. These are recorded before fault injection begins and are compared against actual outcomes in resilience.analysis.*.
| Attribute | Type | Requirement | Description | Enum Values |
|---|---|---|---|---|
resilience.estimate.expected_outcome | string | Required | Predicted outcome category. | deviated, recovered, unaffected |
resilience.estimate.expected_recovery_s | float | Recommended | Predicted time to recovery in seconds. | |
resilience.estimate.expected_degradation | string | Recommended | Predicted level of service degradation during fault. | none, minor, moderate, severe |
resilience.estimate.expected_data_loss | boolean | Recommended | Whether data loss is expected. | |
resilience.estimate.confidence | string | Recommended | Confidence level in the prediction. | low, medium, high |
resilience.estimate.rationale | string | Recommended | Free-text explanation of why this outcome is expected. Stored on traces only (high cardinality). | |
resilience.estimate.prior_runs | int | Optional | Number of prior runs of this experiment used to inform the estimate. |
2.5 resilience.baseline.*
Steady-state measurements captured before fault injection. These establish the “normal” against which deviation is measured.
Top-level baseline attributes
| Attribute | Type | Requirement | Description | Enum Values |
|---|---|---|---|---|
resilience.baseline.method | string | Required | Statistical method used to establish baseline thresholds. | static, percentile, mean-stddev, iqr, learned |
resilience.baseline.started_at | int64 | Required | Epoch nanoseconds when baseline collection began. | |
resilience.baseline.ended_at | int64 | Required | Epoch nanoseconds when baseline collection ended. | |
resilience.baseline.duration_s | float | Required | Total baseline collection duration in seconds. | |
resilience.baseline.warmup_s | float | Recommended | Warm-up period in seconds excluded from statistical calculations. | |
resilience.baseline.samples | int | Required | Total number of samples collected during the baseline window. | |
resilience.baseline.interval_s | float | Required | Sampling interval in seconds. | |
resilience.baseline.confidence | float | Recommended | Statistical confidence level, 0.0 to 1.0 (e.g. 0.95 for 95% confidence). | |
resilience.baseline.sigma | float | Conditional | Number of standard deviations used for threshold. Required when method is mean-stddev. | |
resilience.baseline.source | string | Recommended | Origin of the baseline data. | live, historical, aqe |
resilience.baseline.anomaly_detected | boolean | Recommended | Whether anomalies were detected in the baseline window that may affect reliability. |
Per-probe baseline metrics
For each probe {name}, the following attributes are emitted under resilience.baseline.probe.{name}.*:
| Attribute | Type | Requirement | Description |
|---|---|---|---|
resilience.baseline.probe.{name}.mean | float | Required | Arithmetic mean of probe values during baseline. |
resilience.baseline.probe.{name}.stddev | float | Recommended | Standard deviation of probe values. |
resilience.baseline.probe.{name}.p50 | float | Recommended | 50th percentile (median). |
resilience.baseline.probe.{name}.p95 | float | Recommended | 95th percentile. |
resilience.baseline.probe.{name}.p99 | float | Recommended | 99th percentile. |
resilience.baseline.probe.{name}.min | float | Recommended | Minimum observed value. |
resilience.baseline.probe.{name}.max | float | Recommended | Maximum observed value. |
resilience.baseline.probe.{name}.error_rate | float | Recommended | Proportion of failed probe executions, 0.0 to 1.0. |
{name}is the probe’s stable identifier (e.g.query_latency,connection_count). It MUST be lowercase snake_case.
2.6 resilience.during.*
Measurements captured while the fault is active.
Top-level during-fault attributes
| Attribute | Type | Requirement | Description |
|---|---|---|---|
resilience.during.started_at | int64 | Required | Epoch nanoseconds when fault injection began. |
resilience.during.ended_at | int64 | Required | Epoch nanoseconds when fault injection ended. |
resilience.during.fault_active_s | float | Required | Actual duration the fault was active in seconds. |
resilience.during.sample_interval_s | float | Required | Sampling interval in seconds during the fault window. |
resilience.during.degradation_onset_s | float | Recommended | Seconds from fault start to first observable degradation. |
resilience.during.degradation_peak_s | float | Recommended | Seconds from fault start to peak degradation. |
resilience.during.degradation_magnitude | float | Recommended | Peak degradation magnitude expressed in sigma (standard deviations from baseline mean). |
resilience.during.graceful_degradation | boolean | Recommended | Whether the system degraded gracefully (maintained partial service) rather than failing hard. |
Per-probe during-fault metrics
For each probe {name}:
| Attribute | Type | Requirement | Description |
|---|---|---|---|
resilience.during.probe.{name}.samples | int | Required | Number of samples collected for this probe during the fault window. |
resilience.during.probe.{name}.mean | float | Required | Mean probe value during the fault. |
resilience.during.probe.{name}.max | float | Recommended | Maximum probe value during the fault. |
resilience.during.probe.{name}.min | float | Recommended | Minimum probe value during the fault. |
resilience.during.probe.{name}.error_rate | float | Recommended | Proportion of failed probe executions, 0.0 to 1.0. |
resilience.during.probe.{name}.breached_at | int64 | Recommended | Epoch nanoseconds of the first threshold breach. Absent if no breach occurred. |
resilience.during.probe.{name}.breach_count | int | Recommended | Number of times the probe breached its baseline threshold. |
2.7 resilience.post.*
Measurements captured after fault removal to assess recovery.
Top-level post-fault attributes
| Attribute | Type | Requirement | Description |
|---|---|---|---|
resilience.post.started_at | int64 | Required | Epoch nanoseconds when post-fault observation began. |
resilience.post.ended_at | int64 | Required | Epoch nanoseconds when post-fault observation ended. |
resilience.post.duration_s | float | Required | Total post-fault observation duration in seconds. |
resilience.post.samples | int | Required | Total number of samples collected during the post-fault window. |
resilience.post.recovery_time_s | float | Required | Mean Time to Recovery (MTTR) – seconds from fault removal to all probes returning within baseline thresholds. |
resilience.post.full_recovery | boolean | Required | Whether the system fully returned to baseline levels. |
resilience.post.residual_degradation | float | Recommended | Remaining degradation at end of observation, expressed in sigma from baseline mean. 0.0 means fully recovered. |
resilience.post.data_integrity_verified | boolean | Recommended | Whether a data integrity check was performed after recovery. |
resilience.post.data_loss_detected | boolean | Recommended | Whether data loss was detected after recovery. |
Per-probe post-fault metrics
For each probe {name}:
| Attribute | Type | Requirement | Description |
|---|---|---|---|
resilience.post.probe.{name}.mean | float | Required | Mean probe value during the post-fault window. |
resilience.post.probe.{name}.p95 | float | Recommended | 95th percentile during the post-fault window. |
resilience.post.probe.{name}.error_rate | float | Recommended | Error rate during the post-fault window, 0.0 to 1.0. |
resilience.post.probe.{name}.returned_to_baseline | boolean | Required | Whether this probe returned to within baseline thresholds. |
resilience.post.probe.{name}.recovery_time_s | float | Required | Seconds from fault removal until this probe returned to baseline. |
2.8 resilience.outcome.*
Final outcome of the experiment run.
| Attribute | Type | Requirement | Description | Enum Values |
|---|---|---|---|---|
resilience.outcome.status | string | Required | Terminal status of the experiment run. | completed, deviated, aborted, failed, interrupted |
resilience.outcome.hypothesis_met | boolean | Required | Whether the experiment’s hypothesis was confirmed. | |
resilience.outcome.deviation_detected | boolean | Required | Whether the system deviated from expected behaviour. | |
resilience.outcome.deviation_magnitude | float | Conditional | Magnitude of deviation in sigma. Required when deviation_detected is true. | |
resilience.outcome.recovery_time_s | float | Recommended | Observed total recovery time in seconds. | |
resilience.outcome.mttr_s | float | Recommended | Mean Time to Recovery in seconds. Equivalent to resilience.post.recovery_time_s but included here for query convenience. | |
resilience.outcome.data_loss | boolean | Recommended | Whether data loss occurred during the experiment. | |
resilience.outcome.rollback_executed | boolean | Required | Whether a rollback action was triggered (manual or automatic). | |
resilience.outcome.rollback_success | boolean | Conditional | Whether the rollback succeeded. Required when rollback_executed is true. |
2.9 resilience.safety.*
Safety constraints and guardrails for the experiment.
| Attribute | Type | Requirement | Description | Enum Values |
|---|---|---|---|---|
resilience.safety.abort_conditions | string[] | Recommended | List of conditions that trigger automatic experiment abort (e.g. ["error_rate > 0.5", "p95_latency > 10s", "data_loss detected"]). | |
resilience.safety.max_duration_s | float | Recommended | Maximum allowed experiment duration in seconds. The engine MUST abort if exceeded. | |
resilience.safety.max_blast_radius | string | Recommended | Maximum allowed blast radius. The engine MUST NOT inject faults beyond this scope. | |
resilience.safety.requires_approval | boolean | Optional | Whether this experiment requires manual approval before fault injection. | |
resilience.safety.rollback_strategy | string | Required | Strategy for rollback after fault injection. | always, on-deviation, never, manual |
2.10 resilience.environment.*
Infrastructure context where the experiment runs.
| Attribute | Type | Requirement | Description | Enum Values |
|---|---|---|---|---|
resilience.environment.platform | string | Required | Compute platform type. | bare-metal, vm, container, kubernetes, serverless |
resilience.environment.cloud_provider | string | Optional | Cloud provider, if applicable. | aws, gcp, azure, on-premise |
resilience.environment.cluster | string | Optional | Cluster name or identifier (e.g. prod-eu-01, staging-k8s). | |
resilience.environment.namespace | string | Optional | Kubernetes namespace or equivalent isolation boundary. |
2.11 resilience.regulatory.*
Regulatory and compliance mapping for audit evidence.
| Attribute | Type | Requirement | Description | Enum Values |
|---|---|---|---|---|
resilience.regulatory.frameworks | string[] | Recommended | Regulatory frameworks this experiment provides evidence for. | DORA, NIS2, PCI-DSS, ISO-22301, ISO-27001, SOC2, Basel-III |
resilience.regulatory.requirement_id | string | Recommended | Specific requirement or control identifier within the framework (e.g. DORA-Art.25, PCI-DSS-12.10.2). | |
resilience.regulatory.evidence_type | string | Recommended | Type of evidence this experiment run produces (e.g. recovery-test, failover-validation, backup-verification). | |
resilience.regulatory.compliance_status | string | Recommended | Compliance determination based on experiment outcome. | compliant, non-compliant, partial |
2.12 resilience.analysis.*
Post-experiment analysis comparing predictions to actuals and tracking trends.
| Attribute | Type | Requirement | Description | Enum Values |
|---|---|---|---|---|
resilience.analysis.estimate_accuracy | float | Recommended | Accuracy of the pre-experiment estimate, 0.0 to 1.0. 1.0 means the prediction exactly matched the outcome. | |
resilience.analysis.estimate_recovery_delta | float | Recommended | Difference between predicted and actual recovery time in seconds (predicted - actual). Positive means recovery was faster than predicted. | |
resilience.analysis.trend | string | Recommended | Resilience trend based on comparison with prior runs of the same experiment. | improving, stable, degrading |
resilience.analysis.resilience_score | float | Recommended | Composite resilience score, 0.0 to 1.0. Computed per the platform’s documented score methodology. | |
resilience.analysis.regulatory_evidence | string[] | Recommended | List of regulatory evidence artefact identifiers produced by this run (e.g. report IDs, audit log references). |
3. Fault Subtype Taxonomy
Every resilience.fault.subtype value MUST belong to exactly one resilience.fault.type. The following is the complete closed set.
termination
Faults that stop processes, containers, or hosts.
| Subtype | Description |
|---|---|
process-kill | Kill a specific OS process (SIGKILL or equivalent). |
container-kill | Stop and remove a container. |
vm-stop | Power off a virtual machine. |
node-shutdown | Shut down a physical or virtual node. |
pod-delete | Delete a Kubernetes pod. |
service-stop | Stop a systemd/init service. |
resource-stress
Faults that exhaust compute, memory, storage, or OS resources.
| Subtype | Description |
|---|---|
cpu-stress | Saturate CPU cores to a target utilisation. |
memory-stress | Allocate memory to a target utilisation or OOM threshold. |
io-stress | Generate sustained disk I/O load. |
disk-fill | Fill a filesystem to a target percentage. |
fd-exhaustion | Exhaust available file descriptors. |
thread-exhaustion | Exhaust available threads or goroutines. |
network
Faults that degrade or disrupt network communication.
| Subtype | Description |
|---|---|
latency-injection | Add artificial latency to network packets. |
packet-loss | Drop a percentage of network packets. |
partition | Create a network partition between components. |
dns-failure | Cause DNS resolution failures. |
bandwidth-limit | Throttle network bandwidth. |
corruption | Corrupt network packet payloads. |
port-block | Block traffic on specific ports. |
connection-reset | Force TCP connection resets. |
state
Faults that manipulate internal system state.
| Subtype | Description |
|---|---|
connection-kill | Terminate active database or service connections. |
lock-injection | Inject lock contention or deadlocks. |
replication-lag | Introduce artificial replication delay. |
failover-trigger | Force a failover from primary to replica. |
leader-election | Trigger a leader re-election in a consensus group. |
cache-flush | Flush in-memory caches, forcing cold-start behaviour. |
dependency
Faults that affect upstream or downstream dependencies.
| Subtype | Description |
|---|---|
upstream-timeout | Cause upstream service calls to time out. |
downstream-error | Make a downstream dependency return errors. |
circuit-break | Force a circuit breaker to open state. |
service-unavailable | Make a dependency completely unreachable. |
configuration
Faults that alter runtime configuration.
| Subtype | Description |
|---|---|
config-change | Modify a runtime configuration parameter. |
feature-toggle | Flip a feature flag to an unexpected state. |
resource-limit | Reduce resource limits (connection pools, memory caps, rate limits). |
permission-revoke | Revoke permissions or credentials at runtime. |
data
Faults that affect data integrity or ordering.
| Subtype | Description |
|---|---|
corruption | Corrupt stored or in-flight data. |
deletion | Delete data records or files. |
reorder | Reorder messages or events. |
duplication | Duplicate messages or records. |
schema-drift | Introduce schema changes that cause incompatibility. |
4. Time Standard
All temporal values in this standard follow a single set of rules to eliminate ambiguity and conversion overhead.
| Context | Format | Example |
|---|---|---|
| Timestamps in span attributes and journals | Epoch nanoseconds (int64) | 1711670400000000000 |
| Durations in span attributes and journals | Seconds (float64) | 12.345 |
| Experiment definitions (human-authored JSON/YAML) | Human-friendly strings | 120s, 5m, 1h |
| Internal storage and analytics | Epoch nanoseconds | Native in ClickHouse, DuckDB, OTel |
Rationale:
- OTel native: OpenTelemetry’s protobuf encoding uses
fixed64nanosecond timestamps. No conversion is needed at the exporter boundary. - Database native: ClickHouse
DateTime64(9)and DuckDBTIMESTAMP_NSstore nanosecond precision natively. - Zero conversion: A single canonical format eliminates an entire class of timezone, precision, and parsing bugs.
Experiment engines MUST convert human-friendly durations in experiment definitions to float64 seconds before emitting span attributes. The human-friendly format is a convenience for authors, not a wire format.
5. Cardinality Rules
Controlling attribute cardinality is essential for metric backends (Prometheus, ClickHouse, etc.) that create time series per unique label combination.
| Rule | Detail |
|---|---|
| Enum attributes use finite, documented values only | Implementations MUST NOT add ad-hoc enum values. New values require a standard revision. |
| No free-text in metric dimensions | Attributes like resilience.estimate.rationale are high-cardinality and MUST NOT be used as metric labels. They belong on spans only. |
resilience.target.component is high-cardinality | Use on traces and logs. NEVER use as a metric dimension. |
| Metric-safe fault dimensions | resilience.fault.type and resilience.fault.subtype are bounded enums and are safe for metric dimensions. |
| Metric-safe target dimensions | resilience.target.system, resilience.target.technology, and resilience.target.environment are bounded and metric-safe. |
| Per-probe attributes are inherently high-cardinality | The probe.{name}.* pattern creates attributes dynamically. Use these on traces. For metrics, emit per-probe data as separate metric instruments (e.g. resilience.probe.latency{probe.name="query_latency"}). |
6. OTel Integration Rules
Span Attributes
All resilience.* attributes are carried as span attributes on the spans emitted by the chaos engine during experiment execution.
Resource Attributes
resilience.experiment.* attributes are set as resource attributes on the root span of the experiment trace. This ensures they propagate to all child spans and are available for trace-level queries.
Span Propagation
resilience.target.* attributes are set on every span within the experiment trace. This enables filtering any span in the trace by target system, technology, or environment.
Metric Dimensions
The following attributes are approved for use as metric label dimensions:
| Metric Dimension | Cardinality | Source Group |
|---|---|---|
resilience.fault.type | 7 values | resilience.fault.* |
resilience.fault.subtype | ~35 values | resilience.fault.* |
resilience.target.system | 12 values | resilience.target.* |
resilience.target.technology | Open but bounded | resilience.target.* |
resilience.target.environment | 4 values | resilience.target.* |
resilience.outcome.status | 5 values | resilience.outcome.* |
Recommended Metric Instruments
| Metric Name | Type | Unit | Dimensions |
|---|---|---|---|
resilience.experiment.duration | Histogram | s | fault.type, target.system, outcome.status |
resilience.experiment.count | Counter | {experiment} | fault.type, target.system, outcome.status |
resilience.recovery.duration | Histogram | s | fault.type, target.system |
resilience.deviation.count | Counter | {deviation} | fault.type, target.system, fault.severity |
7. Scientific Rules
Resilience testing produces evidence. Evidence must meet scientific standards to be credible, reproducible, and useful for regulatory audit.
1. Reproducibility
Experiment metadata MUST be sufficient to re-run the experiment identically. This means:
- All
resilience.experiment.*,resilience.target.*,resilience.fault.*,resilience.safety.*, andresilience.environment.*attributes are populated. - The experiment definition (JSON/YAML) is version-controlled and referenced by
resilience.experiment.version. - Any external state dependency (e.g. “requires 3-node cluster”) is documented.
2. Falsifiability
Every hypothesis MUST have measurable bounds:
resilience.estimate.expected_outcomedeclares what the experimenter expects.resilience.estimate.expected_recovery_ssets a numeric threshold.resilience.outcome.hypothesis_metprovides a boolean verdict.
A hypothesis of “the system should be resilient” is not falsifiable. A hypothesis of “the system will recover to baseline within 30 seconds after primary database failover” is.
3. Baseline Integrity
Raw samples MUST be preserved, not just derived thresholds:
resilience.baseline.samplesrecords the count.- Per-probe
.mean,.stddev,.p50,.p95,.p99,.min,.maxprovide distributional summary. - Implementations SHOULD additionally store raw sample arrays in the experiment journal for full reproducibility.
4. Controlled Variables
The environment MUST be recorded so that changes between runs are visible:
resilience.environment.*captures platform, provider, cluster, and namespace.resilience.target.*captures the specific component under test.- Differences between runs (e.g. changed cluster size) surface automatically in attribute comparison.
5. Blast Radius Accounting
Scope MUST be declared before injection and enforced during:
resilience.fault.blast_radiusdeclares intended scope.resilience.safety.max_blast_radiussets the hard limit.- The engine MUST abort if observed impact exceeds the declared blast radius.
6. Prediction Tracking
Estimate vs actual MUST be compared to enable organisational learning:
resilience.estimate.*records predictions before injection.resilience.analysis.estimate_accuracyandresilience.analysis.estimate_recovery_deltacompare predictions to outcomes.- Over time, improving estimate accuracy demonstrates growing system understanding.
resilience.analysis.trendtracks whether resilience is improving, stable, or degrading.
8. Conformance
An implementation is conformant with this standard if:
- All attributes marked Required are present on every experiment span.
- All attributes marked Conditional are present when their stated condition is true.
- All enum values are drawn from the documented closed sets.
- Timestamps use epoch nanoseconds (
int64). - Durations use seconds (
float64). - The fault subtype taxonomy in Section 3 is respected – subtypes are not mixed across types.
- Per-probe attributes follow the
probe.{name}.*naming pattern with lowercase snake_case probe names. - High-cardinality attributes (
resilience.target.component,resilience.estimate.rationale, per-probe attributes) are not used as metric dimensions.
Implementations MAY extend this standard with additional attributes under a vendor-specific namespace (e.g. resilience.vendor.mycompany.*), provided the core resilience.* attributes remain conformant.
10. Scoring Methodology (Optional — 3 Layers)
The scoring model is optional but enables deep analytics when populated. Scores are computed in DuckDB from journal data — not during experiment execution.
Layer 1 — Experiment Quality (resilience.score.quality.*)
Six orthogonal signals computed per experiment. Scientific basis: cyclomatic complexity (McCabe, 1976), FMEA severity (IEC 60812:2018), chaos engineering principles (Basiri et al., 2016).
| Attribute | Type | Description | Values |
|---|---|---|---|
resilience.score.quality.complexity | string | Structural complexity based on step count, config params, rollback presence. | low, medium, high, critical |
resilience.score.quality.blast_radius | string | Fault propagation scope based on fault type analysis. | contained, moderate, broad, systemic |
resilience.score.quality.risk | string | Operational risk combining hypothesis absence, rollback quality, deviation history. | low, medium, high, critical |
resilience.score.quality.rollback | string | Rollback completeness: action present, verification probe, idempotent pattern. | none, weak, good, excellent |
resilience.score.quality.grade | string | Run history grade based on success rate over trailing 10 runs. | A (>=90%), B (>=70%), C (>=50%), D (<50%) |
resilience.score.quality.otel_score | float | Continuous 0-100 score for time-series trending. Weighted combination of complexity, risk, probe count, rollback ratio. | 0.0 to 100.0 |
resilience.score.quality.needs_review | boolean | Automatically set when risk is HIGH/CRITICAL AND blast_radius is BROAD/SYSTEMIC. |
FMEA Extension (resilience.fault.fmea_*)
Optional FMEA (Failure Mode and Effects Analysis) attributes per IEC 60812:2018:
| Attribute | Type | Description |
|---|---|---|
resilience.fault.fmea_severity | int | Severity rating 1-10 (10 = most severe). |
resilience.fault.fmea_detectability | int | Detectability rating 1-10 (10 = hardest to detect). |
resilience.fault.fmea_rpn | int | Risk Priority Number = severity x detectability. |
Layer 2 — DevOps Performance Metrics / Accelerate Four Keys (resilience.score.devops.*)
Operational performance metrics from the Accelerate / Four Keys research programme. Requires external data (CI/CD events, incident management). This layer is an integration point — Tumult does not compute these values directly but accepts them from external systems for correlation.
Scientific basis: Forsgren, Humble, Kim (2018), Accelerate; State of DevOps Reports.
Note: These DevOps performance metrics are distinct from DORA (EU 2022/2554) — the Digital Operational Resilience Act — which is a regulatory compliance framework covered in Layer 3 and the Regulatory Mapping document.
| Attribute | Type | Description |
|---|---|---|
resilience.score.devops.deployment_frequency_per_day | float | Deployments per calendar day. |
resilience.score.devops.lead_time_hours | float | Mean time from commit to production deploy. |
resilience.score.devops.change_failure_rate | float | Proportion of deployments causing incidents (0.0-1.0). |
resilience.score.devops.mttr_hours | float | Mean time to recovery from incidents. |
resilience.score.devops.chaos_coverage_pct | float | Proportion of services with experiment runs in period (0.0-1.0). |
Layer 3 — Regulatory Confidence (resilience.score.regulatory.*)
Per-control compliance confidence. Scientific basis: ISO 31000:2018 risk treatment confidence model. The multiplicative formula ensures a single failing factor reduces overall confidence proportionally.
| Attribute | Type | Description |
|---|---|---|
resilience.score.regulatory.framework | string | Framework identifier (DORA, NIS2, PCI-DSS, ISO-22301, ISO-27001, SOC2). |
resilience.score.regulatory.control_id | string | Specific control (e.g. DORA-Art25.1, PCI-DSS-12.10.2). |
resilience.score.regulatory.confidence | float | Confidence score 0.0-1.0, computed as: base_weight x quality_score x complexity_factor x rollback_factor x min(probe_count/threshold, 1.0) x outcome_factor. |
resilience.score.regulatory.outcome_factor | float | PASSED=1.0, FAILED_ROLLBACK_OK=0.85, FAILED=0.30. A failed experiment with successful rollback is stronger evidence than untested. |
Scoring References
- McCabe, T.J. (1976). A Complexity Measure. IEEE TSE, 2(4), 308-320.
- IEC 60812:2018. Failure modes and effects analysis (FMEA and FMECA).
- Forsgren, N., Humble, J. & Kim, G. (2018). Accelerate. IT Revolution Press.
- ISO 31000:2018. Risk management — Guidelines.
- Basiri, A. et al. (2016). Chaos Engineering. IEEE Software, 33(3), 35-41.
- EU Regulation 2022/2554. Digital Operational Resilience Act (DORA).
Appendix A: resilience.taxonomy.* (Optional Classification)
Domain classification for slicing experiments by what is being tested. Optional — enhances analytics but is not required for experiment execution.
| Attribute | Type | Requirement | Description | Enum Values |
|---|---|---|---|---|
resilience.taxonomy.track | string | Optional | Testing track. | chaos, security, performance, continuity |
resilience.taxonomy.category | string | Optional | Fault injection category. | fault_injection, latency, resource_stress, state_manipulation, dependency_failure |
resilience.taxonomy.domain | string | Optional | Infrastructure domain. | compute, network, storage, data, platform, application |
resilience.taxonomy.component | string | Optional | Specific component class. | cpu, memory, io, disk, broker, primary, replica, pod, node, gateway |
resilience.taxonomy.scenario | string | Optional | Named scenario pattern. | cpu_stress, memory_pressure, io_saturation, partition, failover, leader_election, connection_kill |
Analytics Example
SELECT
taxonomy_domain,
taxonomy_scenario,
AVG(post_recovery_time_s) as avg_mttr,
COUNT(*) as runs
FROM journals
WHERE taxonomy_track = 'chaos'
GROUP BY taxonomy_domain, taxonomy_scenario
ORDER BY avg_mttr DESC
Appendix B: resilience.load.* (Optional Load Integration)
Attributes for correlating experiments with load testing tools (k6, JMeter).
| Attribute | Type | Requirement | Description |
|---|---|---|---|
resilience.load.tool | string | Optional | Load testing tool. Values: k6, jmeter. |
resilience.load.vus | int | Optional | Virtual users / threads. |
resilience.load.duration_s | float | Optional | Load test duration in seconds. |
resilience.load.throughput_rps | float | Optional | Observed requests per second. |
resilience.load.latency_p95_ms | float | Optional | 95th percentile response time in milliseconds. |
resilience.load.error_rate | float | Optional | Error rate 0.0-1.0 during load. |
resilience.load.thresholds_met | boolean | Optional | Whether all load thresholds passed. |
Tumult Metadata Model v2.0. The resilience.* namespace is designed for OTel compatibility and can be adopted by any resilience testing platform.