Database Seeding Tool #
The BonsAI database seeding tool is a fast, SQL-based utility for populating the development database with test extraction data. It creates extractions and their associated data records for performance testing and development.
Overview #
The seed tool is located at /tools/seed and provides:
- Pure SQL execution - No TypeScript runtime, extremely fast
- Configurable via YAML - Edit
.config.yamlto set defaults - CLI overrides - Override any setting via command line flags
- Interactive entity selection - Choose which entity to seed
- Cleanup support - Optionally delete existing seeded data first
- No S3 required - Links to existing document pages in the database
Requirements #
- Running database (via
mise run dev) - At least one entity with a document in the database
misecommand available
Quick Start #
Basic Usage #
# Interactive mode - select entity from list
mise db-seed
# Seed specific entity, skip prompts
mise db-seed -e <entity-uuid> -y
# Seed with cleanup (delete existing seeded data first)
mise db-seed --cleanup -y
Extraction Types #
The seed tool creates six extraction types:
| Type | Description |
|---|---|
| AP Bill | Accounts Payable invoices |
| AR Invoice | Accounts Receivable invoices |
| Bank Statement | Bank statements with transactions |
| AP Credit Note | AP credit notes |
| AR Credit Note | AR credit notes |
| Direct Expense | Direct expense records |
Each extraction includes:
- Extraction record with status, timestamps, metadata
- Associated data record (invoice_data, bank_statement_data, etc.)
- Line items (1-100 per extraction, configurable)
- Link to document page
Configuration #
Config File (.config.yaml)
#
All seed settings are under the seed: section in .config.yaml:
seed:
# Entity selection (leave empty for interactive)
entity_id: ""
# Document page ID (auto-detected if empty)
document_page_id: ""
# Number of extractions per type
counts:
ap_bill: 5000
ar_invoice: 5000
bank_statement: 5000
ap_credit_note: 2500
ar_credit_note: 2500
direct_expense: 5000
# Date range for created_at timestamps
date_range:
min_days_ago: 10
max_days_ago: 1
# Extraction statuses distribution (percentages, must sum to 100)
status_distribution:
VERIFIED: 60
NEEDS_REVIEW: 25
REVIEW_LATER: 10
PENDING: 5
# Currencies (weighted random selection)
currencies:
- code: USD
weight: 50
- code: JPY
weight: 30
- code: EUR
weight: 10
- code: GBP
weight: 10
# Data records per extraction
data_count:
min: 1
max: 10
# Line items per extraction
line_items:
min: 1
max: 100
CLI Options #
Override any config setting via command line:
mise db-seed [OPTIONS]
Options:
-e, --entity-id ID Entity ID to seed
-d, --document-page ID Document page ID to link extractions
-y, --yes Skip confirmation prompts
--cleanup Delete existing seeded data first
--ap-bills N Number of AP bills (default: 50)
--ar-invoices N Number of AR invoices (default: 50)
--bank-statements N Number of bank statements (default: 20)
--ap-credit-notes N Number of AP credit notes (default: 10)
--ar-credit-notes N Number of AR credit notes (default: 10)
--direct-expenses N Number of direct expenses (default: 20)
--min-days N Oldest records N days ago (default: 90)
--max-days N Newest records N days ago (default: 1)
-h, --help Show help message
Usage Patterns #
Performance Testing #
Create large dataset for load testing:
# Edit .config.yaml with high counts, then:
mise db-seed -e <entity-uuid> -y --cleanup
Feature Development #
Create small dataset for quick iteration:
mise db-seed --ap-bills 10 --ar-invoices 10 -y
Balanced Test Data #
mise db-seed \
--ap-bills 100 \
--ar-invoices 100 \
--bank-statements 50 \
--ap-credit-notes 25 \
--ar-credit-notes 25 \
--direct-expenses 50 \
-y
How It Works #
Architecture #
The seed tool uses pure SQL for maximum performance:
00_init.sql- Creates temporary functions for data generation01_cleanup.sql- Deletes existing seeded data (if--cleanup)02_seed_extractions.sql- Main seeding logic usinggenerate_seriesdrop_seed_schema.sql- Cleans up temporary functions
Data Generation #
- Uses PostgreSQL’s
generate_seriesfor bulk inserts - Random data generated via
random()and array selections - UUIDs generated with
gen_random_uuid() - Timestamps calculated relative to current time
- All records tagged with
seeded: truein metadata for easy cleanup
Document Page Linking #
Since we don’t have S3 access in local dev:
- Tool finds an existing
document_pagefor the entity - Falls back to any
document_pageif none found for entity - All extractions link to this page via
extraction_document_page
Database Reset #
To completely reset the database (not just seeded data):
mise db-reset
This will:
- Check Doppler config is
dev_local(safety check) - Prompt for confirmation (type “yes”)
- Stop database container and delete volume
- Restart database and run migrations
- Sync with Clerk
Troubleshooting #
No Entities Found #
# Sync with Clerk to create entities
mise run clerk-sync
No Document Pages Found #
The seed tool requires at least one document page. Options:
- Upload a document via the webapp
Connection Errors #
Ensure database is running:
docker ps | grep database
# If not running:
mise run dev
Cleanup Fails #
Manually clean up seeded data:
mise db-sql "DELETE FROM extraction WHERE metadata->>'seeded' = 'true';"
File Structure #
tools/seed/
├── seed.sh # Main entry point
├── src/
│ └── db-sql.ts # SQL execution utility
└── sql/
├── 00_init.sql # Function definitions
├── 01_cleanup.sql # Cleanup queries
├── 02_seed_extractions.sql # Main seeding logic
├── drop_seed_schema.sql # Cleanup functions
├── get_entity.sql # Entity lookup
├── list_entities.sql # Entity listing
├── find_document_page.sql # Document page search
└── ...
Related Commands #
| Command | Description |
|---|---|
mise db-seed |
Seed database with test data |
mise db-reset |
Reset database completely |
mise db-sql "..." |
Execute raw SQL |
mise run clerk-sync |
Sync with Clerk |
mise run dev |
Start development environment |
Related Documentation #
- Database Migration - Schema management
- Development Workflow - Overall development process
- E2E Testing - Using seeded data in tests