ADR-005: Embedded Analytics with Arrow, DuckDB, and Persistent Storage

Status

Accepted

Context

Tumult produces structured journal data from every experiment. Users need SQL analytics for trend analysis, compliance reporting, and operational insights — without requiring external database infrastructure. Analytics must work both ad-hoc (single query) and persistently (accumulated history across all runs).

Decision

Data Pipeline

Journal (.toon) → Arrow RecordBatch → DuckDB (SQL) → Parquet/CSV (export)

Apache Arrow as Interchange Format

Arrow RecordBatch is the in-memory columnar representation. Two schemas:

  • experiments — one row per journal (12 columns: id, title, status, timestamps, counts, hypothesis results, analysis metrics)
  • activity_results — one row per activity execution (9 columns including phase label)

Zero-copy ingestion via DuckDB’s appender-arrow feature.

DuckDB as Embedded SQL Engine

DuckDB (bundled build) compiles into the Tumult binary. No server installation, no connection management, no ports.

  • In-memory mode for ad-hoc queries (tumult analyze journals/)
  • Persistent mode at ~/.tumult/analytics.duckdb for accumulated history

Persistent Store

Every tumult run automatically ingests the journal into the persistent store (disable with --no-ingest). Key properties:

  • Incremental ingestion — dedup by experiment_id via unique index
  • WAL mode — crash safety for file-backed DuckDB (default behavior)
  • Schema versioningschema_meta table tracks version for future migrations
  • Retentiontumult store purge --older-than-days N prevents unbounded growth

Backup and Restore

  • tumult store backup exports both tables to ZSTD-compressed Parquet files
  • tumult import <dir> restores from Parquet backup
  • Round-trip integrity verified by tests

Export Formats

Format Compression Use Case
Parquet ZSTD Long-term storage, BI tools (Spark, Polars, pandas)
Arrow IPC None Language interop, streaming
CSV None Spreadsheets, simple downstream

Consequences

  • Binary size increases ~15-20MB for DuckDB bundled + Arrow
  • Users get SQL analytics without external dependencies
  • Persistent store grows over time — retention policy is essential
  • Parquet export enables integration with any data ecosystem
  • tumult analyze without a path queries accumulated history directly

Tumult is open source under the Apache-2.0 license.