Tumult Logo Kubernetes Chaos: Deep Fault Injection with tumult-kubernetes

Tumult Banner

Part 7 of the Tumult series. ← Part 6: Data-Driven Chaos: Analytics Pipeline


Kubernetes has become the dominant platform for running production workloads, and chaos engineering for Kubernetes-native systems requires first-class Kubernetes API access. Shell scripts wrapping kubectl work up to a point, but they introduce fragility: dependency on the kubectl binary version, error handling through text parsing, and no access to the Kubernetes watch API for precise timing.

tumult-kubernetes is a native Rust plugin using kube-rs — a full async Kubernetes client — for deep, typed fault injection without the kubectl dependency.


What tumult-kubernetes Can Do

Capability Action Description
Pod chaos delete_pod Immediate or graceful pod deletion
Deployment chaos scale_deployment Scale replicas to zero, down, or up
Node chaos cordon_node Mark node unschedulable
  uncordon_node Restore node schedulability
  drain_node Cordon + evict all non-DaemonSet pods
Network chaos apply_network_policy Create network isolation policies
  delete_network_policy Remove network isolation
Probes pod_is_ready Is a specific pod ready?
  all_pods_ready Are all pods matching a label selector ready?
  deployment_is_ready Is a deployment fully available?
  node_status Node conditions and schedulability
  service_has_endpoints Does a service have healthy backends?
  count_pods_in_phase Count pods in a specific phase

Authentication

tumult-kubernetes uses all kube-rs authentication methods:

# Use ~/.kube/config (default)
tumult run experiment.toon

# Specify a custom kubeconfig
KUBECONFIG=/path/to/cluster.yaml tumult run experiment.toon

# In-cluster (running inside Kubernetes)
# Automatically detected when KUBERNETES_SERVICE_HOST is set
tumult run experiment.toon

No kubectl required. The Kubernetes API calls happen directly from the Tumult binary using the cluster’s service account or kubeconfig credentials.


Scenario 1: Pod Deletion — The Most Common Kubernetes Chaos Test

Pod deletion is the “hello world” of Kubernetes chaos testing. Every Kubernetes workload should survive the deletion of individual pods — that is the entire premise of ReplicaSets and Deployments. But many teams discover edge cases only when they run the test: slow readiness probes, missing pod disruption budgets, sticky sessions that break on pod replacement.

title: API deployment survives pod deletion
description: |
  Delete an API pod and verify the deployment recovers within 30 seconds.
  Validates ReplicaSet behavior, readiness probe configuration, and
  load balancer endpoint updates.

tags[3]: kubernetes, pod-chaos, resilience

estimate:
  expected_outcome: recovered
  expected_recovery_s: 30.0
  expected_degradation: minor
  expected_data_loss: false
  confidence: high
  rationale: Deployment has 3 replicas; single pod loss should trigger immediate replacement
  prior_runs: 8

steady_state_hypothesis:
  title: All API pods ready and service endpoints populated
  probes[2]:
    - name: api-deployment-ready
      activity_type: probe
      provider:
        type: native
        plugin: tumult-kubernetes
        function: deployment_is_ready
        arguments:
          namespace: production
          name: api-server
      tolerance:
        type: exact
        value: true

    - name: api-service-has-endpoints
      activity_type: probe
      provider:
        type: native
        plugin: tumult-kubernetes
        function: service_has_endpoints
        arguments:
          namespace: production
          name: api-service
      tolerance:
        type: exact
        value: true

method[1]:
  - name: delete-api-pod
    activity_type: action
    provider:
      type: native
      plugin: tumult-kubernetes
      function: delete_pod
      arguments:
        namespace: production
        name: api-server-7b8c9d-xk2p1
        grace_period_seconds: 0      # immediate kill, no graceful shutdown
    pause_after_s: 5.0

rollbacks[1]:
  - name: ensure-deployment-scaled
    activity_type: action
    provider:
      type: native
      plugin: tumult-kubernetes
      function: scale_deployment
      arguments:
        namespace: production
        name: api-server
        replicas: 3

Scenario 2: Deployment Scale-to-Zero — Validating Health Check Propagation

Scale-to-zero chaos tests a different failure mode: not a sudden pod death, but a graceful drain. This validates that your load balancer (or Kubernetes service endpoint controller) correctly removes the service from rotation as pods go down.

title: Payments service survives scale-to-zero and recovery
description: |
  Scale the payments deployment to zero replicas and verify traffic
  is correctly shed before scaling back up to validate full recovery.
  Tests endpoint propagation, circuit breaker behavior, and graceful
  upstream handling.

tags[3]: kubernetes, scale-chaos, payments

estimate:
  expected_outcome: deviated
  expected_recovery_s: 45.0
  expected_degradation: severe
  expected_data_loss: false
  confidence: medium
  rationale: Scale-to-zero will cause HTTP 503s during the window; recovery depends on Kubernetes scheduler and readiness probes
  prior_runs: 2

steady_state_hypothesis:
  title: Payments API responds successfully
  probes[1]:
    - name: payments-health
      activity_type: probe
      provider:
        type: http
        method: GET
        url: http://payments-service.production.svc.cluster.local/health
        timeout_s: 5.0
      tolerance:
        type: exact
        value: 200

method[2]:
  - name: scale-payments-to-zero
    activity_type: action
    provider:
      type: native
      plugin: tumult-kubernetes
      function: scale_deployment
      arguments:
        namespace: production
        name: payments-api
        replicas: 0
    pause_after_s: 10.0

  - name: check-pods-terminated
    activity_type: probe
    provider:
      type: native
      plugin: tumult-kubernetes
      function: count_pods_in_phase
      arguments:
        namespace: production
        label_selector: app=payments-api
        phase: Running
    tolerance:
      type: exact
      value: 0

rollbacks[1]:
  - name: restore-payments-replicas
    activity_type: action
    provider:
      type: native
      plugin: tumult-kubernetes
      function: scale_deployment
      arguments:
        namespace: production
        name: payments-api
        replicas: 3

regulatory:
  frameworks[1]: DORA
  requirements[1]:
    - id: DORA-Art25
      description: ICT resilience testing
      evidence: Recovery from complete service outage within declared RTO

Scenario 3: Node Drain — Testing Cluster-Level Resilience

Node drain is a higher blast radius than pod deletion. Draining a node evicts all non-DaemonSet pods on that node, which may include pods from multiple deployments. This tests whether the cluster has sufficient capacity to accommodate all evicted workloads on remaining nodes.

title: Cluster survives node drain
description: |
  Drain one worker node and verify all workloads reschedule successfully.
  Tests node affinity rules, PodDisruptionBudgets, resource requests vs
  available capacity, and scheduling latency.

tags[3]: kubernetes, node-chaos, cluster

estimate:
  expected_outcome: recovered
  expected_recovery_s: 120.0
  expected_degradation: moderate
  expected_data_loss: false
  confidence: medium
  rationale: 3-node cluster; 1 node drain should be absorbed by remaining capacity
  prior_runs: 1

configuration:
  drain_target:
    type: env
    key: CHAOS_NODE_NAME

steady_state_hypothesis:
  title: All critical deployments healthy
  probes[3]:
    - name: api-ready
      activity_type: probe
      provider:
        type: native
        plugin: tumult-kubernetes
        function: all_pods_ready
        arguments:
          namespace: production
          label_selector: tier=api
      tolerance:
        type: exact
        value: true

    - name: worker-ready
      activity_type: probe
      provider:
        type: native
        plugin: tumult-kubernetes
        function: all_pods_ready
        arguments:
          namespace: production
          label_selector: tier=worker
      tolerance:
        type: exact
        value: true

    - name: db-ready
      activity_type: probe
      provider:
        type: native
        plugin: tumult-kubernetes
        function: all_pods_ready
        arguments:
          namespace: production
          label_selector: tier=database
      tolerance:
        type: exact
        value: true

method[1]:
  - name: drain-worker-node
    activity_type: action
    provider:
      type: native
      plugin: tumult-kubernetes
      function: drain_node
      arguments:
        name: ""
        grace_period_seconds: 30
    pause_after_s: 30.0

rollbacks[1]:
  - name: uncordon-node
    activity_type: action
    provider:
      type: native
      plugin: tumult-kubernetes
      function: uncordon_node
      arguments:
        name: ""

Scenario 4: Network Policy — Simulating Network Partitions

Network chaos at the Kubernetes level uses NetworkPolicy resources to create selective network partitions between services. This tests whether your services degrade gracefully when a dependency becomes unreachable — rather than cascading failures.

title: Checkout service degrades gracefully when inventory unreachable
description: |
  Apply a NetworkPolicy that blocks traffic from checkout to inventory.
  Verify checkout falls back to cached inventory and continues processing
  orders without complete failure.

tags[3]: kubernetes, network-chaos, checkout

method[1]:
  - name: partition-inventory
    activity_type: action
    provider:
      type: native
      plugin: tumult-kubernetes
      function: apply_network_policy
      arguments:
        namespace: production
        policy:
          apiVersion: networking.k8s.io/v1
          kind: NetworkPolicy
          metadata:
            name: tumult-partition-inventory
          spec:
            podSelector:
              matchLabels:
                app: inventory-service
            ingress:
              - from:
                  - podSelector:
                      matchLabels:
                        app: NOT-checkout-service
    pause_after_s: 15.0

rollbacks[1]:
  - name: remove-partition
    activity_type: action
    provider:
      type: native
      plugin: tumult-kubernetes
      function: delete_network_policy
      arguments:
        namespace: production
        name: tumult-partition-inventory

Label Selector Targeting

Rather than targeting specific pod names (which change on every deployment), tumult-kubernetes supports label selector targeting for most actions:

# Target any pod matching the label selector
- name: delete-api-pod-by-label
  activity_type: action
  provider:
    type: native
    plugin: tumult-kubernetes
    function: delete_pod
    arguments:
      namespace: production
      label_selector: app=api-server,version=v2
      # Deletes the first matching pod; use with care for multi-pod selections

This makes experiments stable across deployments. The experiment targets app=api-server — whatever pod has that label today.


Running Against Multiple Environments

Tumult experiments parameterize through configuration, making the same experiment runnable against staging and production with different parameters:

# Run against staging
CHAOS_NAMESPACE=staging \
  CHAOS_NODE_NAME=staging-worker-02 \
  tumult run node-drain.toon

# Run against production (with approval gate in CI)
CHAOS_NAMESPACE=production \
  CHAOS_NODE_NAME=prod-worker-05 \
  tumult run node-drain.toon --rollback-strategy always

The --rollback-strategy always flag ensures rollbacks execute regardless of outcome — essential for production chaos experiments where leaving the system in a modified state is unacceptable.


What to Watch in the Journal

After a Kubernetes chaos experiment, the journal contains:

status: completed
hypothesis_before_met: true
hypothesis_after_met: true

method_results[1]:
  - name: delete-api-pod
    status: succeeded
    duration_ms: 18
    output: "deleted pod api-server-7b8c9d-xk2p1"

hypothesis_after_results[2]:
  - name: api-deployment-ready
    status: succeeded
    duration_ms: 28734      # 28 seconds to full deployment recovery
    output: "true"
  - name: api-service-has-endpoints
    status: succeeded
    duration_ms: 31201
    output: "true"

The duration of the hypothesis_after probes tells you the actual recovery time — the time from the probe check starting until the deployment fully recovered. This is the real MTTR: not the time until the replacement pod was scheduled, but the time until it was ready to serve traffic.


Next in the series: Part 8 — Statistical Baselines: From Magic Numbers to Data-Derived Tolerances →


Tumult is open source under the Apache-2.0 license.