Database Seeding Tool

Database Seeding Tool #

The BonsAI database seeding tool is a fast, SQL-based utility for populating the development database with test extraction data. It creates extractions and their associated data records for performance testing and development.

Overview #

The seed tool is located at /tools/seed and provides:

  • Pure SQL execution - No TypeScript runtime, extremely fast
  • Configurable via YAML - Edit .config.yaml to set defaults
  • CLI overrides - Override any setting via command line flags
  • Interactive entity selection - Choose which entity to seed
  • Cleanup support - Optionally delete existing seeded data first
  • No S3 required - Links to existing document pages in the database

Requirements #

  • Running database (via mise run dev)
  • At least one entity with a document in the database
  • mise command available

Quick Start #

Basic Usage #

# Interactive mode - select entity from list
mise db-seed

# Seed specific entity, skip prompts
mise db-seed -e <entity-uuid> -y

# Seed with cleanup (delete existing seeded data first)
mise db-seed --cleanup -y

Extraction Types #

The seed tool creates six extraction types:

Type Description
AP Bill Accounts Payable invoices
AR Invoice Accounts Receivable invoices
Bank Statement Bank statements with transactions
AP Credit Note AP credit notes
AR Credit Note AR credit notes
Direct Expense Direct expense records

Each extraction includes:

  • Extraction record with status, timestamps, metadata
  • Associated data record (invoice_data, bank_statement_data, etc.)
  • Line items (1-100 per extraction, configurable)
  • Link to document page

Configuration #

Config File (.config.yaml) #

All seed settings are under the seed: section in .config.yaml:

seed:
  # Entity selection (leave empty for interactive)
  entity_id: ""

  # Document page ID (auto-detected if empty)
  document_page_id: ""

  # Number of extractions per type
  counts:
    ap_bill: 5000
    ar_invoice: 5000
    bank_statement: 5000
    ap_credit_note: 2500
    ar_credit_note: 2500
    direct_expense: 5000

  # Date range for created_at timestamps
  date_range:
    min_days_ago: 10
    max_days_ago: 1

  # Extraction statuses distribution (percentages, must sum to 100)
  status_distribution:
    VERIFIED: 60
    NEEDS_REVIEW: 25
    REVIEW_LATER: 10
    PENDING: 5

  # Currencies (weighted random selection)
  currencies:
    - code: USD
      weight: 50
    - code: JPY
      weight: 30
    - code: EUR
      weight: 10
    - code: GBP
      weight: 10

  # Data records per extraction
  data_count:
    min: 1
    max: 10

  # Line items per extraction
  line_items:
    min: 1
    max: 100

CLI Options #

Override any config setting via command line:

mise db-seed [OPTIONS]

Options:
  -e, --entity-id ID        Entity ID to seed
  -d, --document-page ID    Document page ID to link extractions
  -y, --yes                 Skip confirmation prompts
  --cleanup                 Delete existing seeded data first
  --ap-bills N              Number of AP bills (default: 50)
  --ar-invoices N           Number of AR invoices (default: 50)
  --bank-statements N       Number of bank statements (default: 20)
  --ap-credit-notes N       Number of AP credit notes (default: 10)
  --ar-credit-notes N       Number of AR credit notes (default: 10)
  --direct-expenses N       Number of direct expenses (default: 20)
  --min-days N              Oldest records N days ago (default: 90)
  --max-days N              Newest records N days ago (default: 1)
  -h, --help                Show help message

Usage Patterns #

Performance Testing #

Create large dataset for load testing:

# Edit .config.yaml with high counts, then:
mise db-seed -e <entity-uuid> -y --cleanup

Feature Development #

Create small dataset for quick iteration:

mise db-seed --ap-bills 10 --ar-invoices 10 -y

Balanced Test Data #

mise db-seed \
  --ap-bills 100 \
  --ar-invoices 100 \
  --bank-statements 50 \
  --ap-credit-notes 25 \
  --ar-credit-notes 25 \
  --direct-expenses 50 \
  -y

How It Works #

Architecture #

The seed tool uses pure SQL for maximum performance:

  1. 00_init.sql - Creates temporary functions for data generation
  2. 01_cleanup.sql - Deletes existing seeded data (if --cleanup)
  3. 02_seed_extractions.sql - Main seeding logic using generate_series
  4. drop_seed_schema.sql - Cleans up temporary functions

Data Generation #

  • Uses PostgreSQL’s generate_series for bulk inserts
  • Random data generated via random() and array selections
  • UUIDs generated with gen_random_uuid()
  • Timestamps calculated relative to current time
  • All records tagged with seeded: true in metadata for easy cleanup

Document Page Linking #

Since we don’t have S3 access in local dev:

  1. Tool finds an existing document_page for the entity
  2. Falls back to any document_page if none found for entity
  3. All extractions link to this page via extraction_document_page

Database Reset #

To completely reset the database (not just seeded data):

mise db-reset

This will:

  1. Check Doppler config is dev_local (safety check)
  2. Prompt for confirmation (type “yes”)
  3. Stop database container and delete volume
  4. Restart database and run migrations
  5. Sync with Clerk

Troubleshooting #

No Entities Found #

# Sync with Clerk to create entities
mise run clerk-sync

No Document Pages Found #

The seed tool requires at least one document page. Options:

  1. Upload a document via the webapp

Connection Errors #

Ensure database is running:

docker ps | grep database
# If not running:
mise run dev

Cleanup Fails #

Manually clean up seeded data:

mise db-sql "DELETE FROM extraction WHERE metadata->>'seeded' = 'true';"

File Structure #

tools/seed/
├── seed.sh                    # Main entry point
├── src/
│   └── db-sql.ts             # SQL execution utility
└── sql/
    ├── 00_init.sql           # Function definitions
    ├── 01_cleanup.sql        # Cleanup queries
    ├── 02_seed_extractions.sql # Main seeding logic
    ├── drop_seed_schema.sql  # Cleanup functions
    ├── get_entity.sql        # Entity lookup
    ├── list_entities.sql     # Entity listing
    ├── find_document_page.sql # Document page search
    └── ...
Command Description
mise db-seed Seed database with test data
mise db-reset Reset database completely
mise db-sql "..." Execute raw SQL
mise run clerk-sync Sync with Clerk
mise run dev Start development environment