Database Seeding Tool #

The BonsAI database seeding tool is a fast, SQL-based utility for populating the development database with test extraction data. It creates extractions and their associated data records for performance testing and development.

Overview #

The seed tool is located at /tools/seed and provides:

Pure SQL execution - No TypeScript runtime, extremely fast
Configurable via YAML - Edit .config.yaml to set defaults
CLI overrides - Override any setting via command line flags
Interactive entity selection - Choose which entity to seed
Cleanup support - Optionally delete existing seeded data first
No S3 required - Links to existing document pages in the database

Requirements #

Running database (via mise run dev)
At least one entity with a document in the database
mise command available

Quick Start #

Basic Usage #

# Interactive mode - select entity from list
mise db-seed

# Seed specific entity, skip prompts
mise db-seed -e <entity-uuid> -y

# Seed with cleanup (delete existing seeded data first)
mise db-seed --cleanup -y

Extraction Types #

The seed tool creates six extraction types:

Type	Description
AP Bill	Accounts Payable invoices
AR Invoice	Accounts Receivable invoices
Bank Statement	Bank statements with transactions
AP Credit Note	AP credit notes
AR Credit Note	AR credit notes
Direct Expense	Direct expense records

Each extraction includes:

Extraction record with status, timestamps, metadata
Associated data record (invoice_data, bank_statement_data, etc.)
Line items (1-100 per extraction, configurable)
Link to document page

Configuration #

Config File (`.config.yaml`) #

All seed settings are under the seed: section in .config.yaml:

seed:
  # Entity selection (leave empty for interactive)
  entity_id: ""

  # Document page ID (auto-detected if empty)
  document_page_id: ""

  # Number of extractions per type
  counts:
    ap_bill: 5000
    ar_invoice: 5000
    bank_statement: 5000
    ap_credit_note: 2500
    ar_credit_note: 2500
    direct_expense: 5000

  # Date range for created_at timestamps
  date_range:
    min_days_ago: 10
    max_days_ago: 1

  # Extraction statuses distribution (percentages, must sum to 100)
  status_distribution:
    VERIFIED: 60
    NEEDS_REVIEW: 25
    REVIEW_LATER: 10
    PENDING: 5

  # Currencies (weighted random selection)
  currencies:
    - code: USD
      weight: 50
    - code: JPY
      weight: 30
    - code: EUR
      weight: 10
    - code: GBP
      weight: 10

  # Data records per extraction
  data_count:
    min: 1
    max: 10

  # Line items per extraction
  line_items:
    min: 1
    max: 100

CLI Options #

Override any config setting via command line:

mise db-seed [OPTIONS]

Options:
  -e, --entity-id ID        Entity ID to seed
  -d, --document-page ID    Document page ID to link extractions
  -y, --yes                 Skip confirmation prompts
  --cleanup                 Delete existing seeded data first
  --ap-bills N              Number of AP bills (default: 50)
  --ar-invoices N           Number of AR invoices (default: 50)
  --bank-statements N       Number of bank statements (default: 20)
  --ap-credit-notes N       Number of AP credit notes (default: 10)
  --ar-credit-notes N       Number of AR credit notes (default: 10)
  --direct-expenses N       Number of direct expenses (default: 20)
  --min-days N              Oldest records N days ago (default: 90)
  --max-days N              Newest records N days ago (default: 1)
  -h, --help                Show help message

Usage Patterns #

Performance Testing #

Create large dataset for load testing:

# Edit .config.yaml with high counts, then:
mise db-seed -e <entity-uuid> -y --cleanup

Feature Development #

Create small dataset for quick iteration:

mise db-seed --ap-bills 10 --ar-invoices 10 -y

Balanced Test Data #

mise db-seed \
  --ap-bills 100 \
  --ar-invoices 100 \
  --bank-statements 50 \
  --ap-credit-notes 25 \
  --ar-credit-notes 25 \
  --direct-expenses 50 \
  -y

How It Works #

Architecture #

The seed tool uses pure SQL for maximum performance:

00_init.sql - Creates temporary functions for data generation
01_cleanup.sql - Deletes existing seeded data (if --cleanup)
02_seed_extractions.sql - Main seeding logic using generate_series
drop_seed_schema.sql - Cleans up temporary functions

Data Generation #

Uses PostgreSQL’s generate_series for bulk inserts
Random data generated via random() and array selections
UUIDs generated with gen_random_uuid()
Timestamps calculated relative to current time
All records tagged with seeded: true in metadata for easy cleanup

Document Page Linking #

Since we don’t have S3 access in local dev:

Tool finds an existing document_page for the entity
Falls back to any document_page if none found for entity
All extractions link to this page via extraction_document_page

Database Reset #

To completely reset the database (not just seeded data):

mise db-reset

This will:

Check Doppler config is dev_local (safety check)
Prompt for confirmation (type “yes”)
Stop database container and delete volume
Restart database and run migrations
Sync with Clerk

Troubleshooting #

No Entities Found #

# Sync with Clerk to create entities
mise run clerk-sync

No Document Pages Found #

The seed tool requires at least one document page. Options:

Upload a document via the webapp

Connection Errors #

Ensure database is running:

docker ps | grep database
# If not running:
mise run dev

Cleanup Fails #

Manually clean up seeded data:

mise db-sql "DELETE FROM extraction WHERE metadata->>'seeded' = 'true';"

File Structure #

tools/seed/
├── seed.sh                    # Main entry point
├── src/
│   └── db-sql.ts             # SQL execution utility
└── sql/
    ├── 00_init.sql           # Function definitions
    ├── 01_cleanup.sql        # Cleanup queries
    ├── 02_seed_extractions.sql # Main seeding logic
    ├── drop_seed_schema.sql  # Cleanup functions
    ├── get_entity.sql        # Entity lookup
    ├── list_entities.sql     # Entity listing
    ├── find_document_page.sql # Document page search
    └── ...

Command	Description
`mise db-seed`	Seed database with test data
`mise db-reset`	Reset database completely
`mise db-sql "..."`	Execute raw SQL
`mise run clerk-sync`	Sync with Clerk
`mise run dev`	Start development environment

Database Migration - Schema management
Development Workflow - Overall development process
E2E Testing - Using seeded data in tests