The Plugin System: Chaos Engineering Without a Rust Toolchain

Part 4 of the Tumult series. ← Part 3: Built-In Proof: Native Observability

Extensibility is where chaos engineering platforms have historically been most limiting. You want to inject a fault specific to your infrastructure — say, a Kafka broker partition or a Redis cache flush — and you find yourself either writing a complex Python extension, relying on a community plugin that is two years out of date, or resorting to shell scripts bolted to the side of your experiment definition.

Tumult approaches this differently. The plugin system has two layers: script plugins that anyone can write, and native Rust plugins that integrate deeply with SDKs. You choose the right layer for your use case, and the engine handles the rest.

Two Layers, One Interface

From the experiment definition’s perspective, a script plugin and a native plugin look identical:

# Using a native Rust plugin (tumult-kubernetes)
- name: delete-api-pod
  activity_type: action
  provider:
    type: native
    plugin: tumult-kubernetes
    function: delete_pod
    arguments:
      namespace: production
      name: api-server-7b8c9d-xk2p1

# Using a script plugin (tumult-network)
- name: inject-latency
  activity_type: action
  provider:
    type: process
    path: plugins/tumult-network/actions/add-latency.sh
    env:
      TUMULT_INTERFACE: eth0
      TUMULT_DELAY_MS: "200"
      TUMULT_JITTER_MS: "20"

The experiment author does not need to know whether the plugin is compiled Rust or a bash script. The engine executes both, captures stdout/stderr, records timing, and emits OTel spans with identical structure. The journal entry for a script action and a native action are indistinguishable.

Script Plugins: The Community Layer

Script plugins are the primary path for community contributions and team-specific extensions. The contract is simple: a directory with a manifest and executable scripts.

Directory structure

tumult-nginx/
├── plugin.toon              # manifest: declares actions and probes
├── actions/
│   ├── kill-worker.sh       # action: kill an nginx worker process
│   └── reload-config.sh     # action: force config reload
├── probes/
│   └── connection-count.sh  # probe: count active connections
└── README.md

The manifest

The plugin.toon manifest declares what the plugin can do:

{
  "name": "tumult-nginx",
  "version": "0.1.0",
  "description": "Nginx chaos actions and probes",
  "actions": [
    {
      "name": "kill-worker",
      "script": "actions/kill-worker.sh",
      "description": "Kill an nginx worker process"
    },
    {
      "name": "reload-config",
      "script": "actions/reload-config.sh",
      "description": "Force nginx config reload"
    }
  ],
  "probes": [
    {
      "name": "connection-count",
      "script": "probes/connection-count.sh",
      "description": "Count active nginx connections"
    }
  ]
}

The script contract

Scripts communicate with the engine through a simple, consistent interface:

Input: Arguments are passed as environment variables with the TUMULT_ prefix, uppercased. An argument worker_pid in the experiment becomes TUMULT_WORKER_PID in the script.

Output:

stdout — the result (captured by the engine, stored in the journal)
stderr — diagnostic messages (captured, logged, not stored as result)
Exit code 0 = success, non-zero = failure

Example action script (kill an nginx worker):

#!/bin/bash
set -euo pipefail

WORKER_PID="${TUMULT_WORKER_PID:-$(pgrep -f 'nginx: worker' | head -1)}"

if [ -z "$WORKER_PID" ]; then
    echo "no nginx worker process found" >&2
    exit 1
fi

kill -9 "$WORKER_PID"
echo "killed nginx worker pid=$WORKER_PID"

Example probe script (count nginx connections):

#!/bin/bash
set -euo pipefail

CONNECTIONS=$(curl -s http://localhost:8080/nginx_status \
  | grep 'Active connections' \
  | awk '{print $3}')

if [ -z "$CONNECTIONS" ]; then
    echo "failed to read nginx status" >&2
    exit 1
fi

echo "$CONNECTIONS"

The probe outputs a number. The experiment’s tolerance definition determines whether that number represents a healthy state:

- name: check-connections
  activity_type: probe
  provider:
    type: process
    path: plugins/tumult-nginx/probes/connection-count.sh
  tolerance:
    type: range
    from: 0
    to: 1000

What Script Plugins Enable

The script approach means that any infrastructure capability expressible in a shell script is a Tumult plugin. Consider what that covers:

Existing plugins in the ecosystem:

Plugin	Fault	Probe
`tumult-stress`	`stress-ng` CPU/memory/IO stress	CPU utilization, memory pressure
`tumult-containers`	Docker/Podman kill, stop, pause, resource limits	Container health, restart count
`tumult-process`	Kill/suspend/resume process by PID or name	Process status, CPU usage
`tumult-network`	tc netem latency/loss/corruption, DNS block, iptables partition	Ping latency, DNS resolution
`tumult-db-postgres`	Kill connections, lock tables, exhaust connection pool	Active connections, query latency
`tumult-db-redis`	FLUSHALL, CLIENT PAUSE, DEBUG SLEEP	Memory usage, connected clients
`tumult-kafka`	Kill broker, add partition latency	Consumer lag, broker availability

And anything you can write:

HAProxy backend removal
Vault secret revocation
S3 bucket policy modification
Custom health endpoint failures
Feature flag manipulation via your internal API

If you can express it in a script, it is a Tumult plugin.

A Complete Script Plugin Example: HAProxy Backend Removal

Here is a realistic custom plugin that removes a backend server from HAProxy’s active pool — simulating a graceful backend removal — and verifies the remaining backends are healthy.

Directory structure:

tumult-haproxy/
├── plugin.toon
├── actions/
│   ├── disable-backend.sh
│   └── enable-backend.sh
└── probes/
    └── active-backends.sh

plugin.toon:

{
  "name": "tumult-haproxy",
  "version": "0.1.0",
  "description": "HAProxy chaos actions and probes",
  "actions": [
    {
      "name": "disable-backend",
      "script": "actions/disable-backend.sh",
      "description": "Disable a backend server in HAProxy"
    },
    {
      "name": "enable-backend",
      "script": "actions/enable-backend.sh",
      "description": "Re-enable a backend server in HAProxy"
    }
  ],
  "probes": [
    {
      "name": "active-backends",
      "script": "probes/active-backends.sh",
      "description": "Count active backends in a HAProxy backend pool"
    }
  ]
}

actions/disable-backend.sh:

#!/bin/bash
set -euo pipefail

BACKEND="${TUMULT_BACKEND:?TUMULT_BACKEND is required}"
SERVER="${TUMULT_SERVER:?TUMULT_SERVER is required}"
SOCKET="${TUMULT_SOCKET:-/run/haproxy/admin.sock}"

echo "set server ${BACKEND}/${SERVER} state maint" \
  | socat stdio "${SOCKET}"

echo "disabled ${BACKEND}/${SERVER}"

probes/active-backends.sh:

#!/bin/bash
set -euo pipefail

BACKEND="${TUMULT_BACKEND:?TUMULT_BACKEND is required}"
SOCKET="${TUMULT_SOCKET:-/run/haproxy/admin.sock}"

COUNT=$(echo "show servers state ${BACKEND}" \
  | socat stdio "${SOCKET}" \
  | awk 'NR>2 && $6==2 {count++} END {print count+0}')

echo "$COUNT"

Experiment using the plugin:

title: HAProxy backend removal maintains service
description: Remove one backend and verify requests continue

tags[2]: haproxy, resilience

steady_state_hypothesis:
  title: All backends healthy
  probes[1]:
    - name: backends-available
      activity_type: probe
      provider:
        type: process
        path: plugins/tumult-haproxy/probes/active-backends.sh
        env:
          TUMULT_BACKEND: web-servers
      tolerance:
        type: range
        from: 2
        to: 10

method[1]:
  - name: disable-web-03
    activity_type: action
    provider:
      type: process
      path: plugins/tumult-haproxy/actions/disable-backend.sh
      env:
        TUMULT_BACKEND: web-servers
        TUMULT_SERVER: web-03
    pause_after_s: 10.0

rollbacks[1]:
  - name: re-enable-web-03
    activity_type: action
    provider:
      type: process
      path: plugins/tumult-haproxy/actions/enable-backend.sh
      env:
        TUMULT_BACKEND: web-servers
        TUMULT_SERVER: web-03

Native Plugins: When You Need SDK Depth

Script plugins cover most use cases. But some integrations require deep SDK access that bash cannot easily provide: Kubernetes API calls with authentication, cloud provider SDK operations, database driver-level fault injection.

For these, Tumult supports native Rust plugins compiled as feature-flagged crates:

# Build with Kubernetes and SSH support
cargo install tumult --features kubernetes,ssh

# Build with all native plugins
cargo install tumult --features kubernetes,ssh,analytics

The native plugin interface is the ChaosPlugin trait in tumult-plugin. A native plugin implements actions and probes as async Rust functions with typed arguments. The tumult-kubernetes plugin, for example, uses kube-rs — a full async Kubernetes client — for operations that would be difficult to express reliably in bash:

Pod deletion with configurable grace periods
Deployment scaling with wait-for-convergence
Network policy application and removal
Node drain with proper eviction handling

Plugin Discovery

Tumult discovers plugins from three locations, in order:

./plugins/           — local to the experiment directory
~/.tumult/plugins/   — user-global (persistent across projects)
$TUMULT_PLUGIN_PATH  — colon-separated custom paths

First-found-wins. If the same plugin name appears in multiple paths, the first-discovered version is used.

Verify what is discovered:

# List all discovered plugins
tumult discover

# Show details for a specific plugin
tumult discover --plugin tumult-haproxy

Output:

tumult-haproxy v0.1.0
  HAProxy chaos actions and probes
  Path: ./plugins/tumult-haproxy

  Actions:
    disable-backend  Disable a backend server in HAProxy
    enable-backend   Re-enable a backend server in HAProxy

  Probes:
    active-backends  Count active backends in a HAProxy backend pool

The No-Rust Requirement for Community Plugins

The design principle here is worth stating directly: anyone who can write a bash script can write a Tumult plugin. There is no Rust toolchain requirement, no build step, no compilation. You write scripts, write a manifest, and the plugin works.

This is the same philosophy that made Chaos Toolkit’s Python extension model successful — lower the barrier to contribution and the ecosystem grows. But Tumult’s script model goes further: because the engine is a single Rust binary with no runtime dependencies, plugins run in pristine environments without Python path conflicts or dependency collisions. The script executes, the engine captures the result, and the journal records it.

For platform teams building shared plugin libraries for their engineering organization, this model means plugins can live in a git repository alongside the experiments that use them, with no build infrastructure required for contributors.

Next in the series: Part 5 — Writing Your First Experiment: The TOON Format in Depth →