Database Seeding Tool #
The BonsAI database seeding tool is a TypeScript utility for populating the development database with realistic test data. It creates organizations, users, entities, and various document types to support development and testing workflows.
Overview #
The seed tool is located at /tools/seed and provides:
- Automated creation of test organizations, users, and entities
- Multiple document types: invoices, bank statements, credit notes, and direct expenses
- Real file uploads to S3-compatible storage
- Proper relationships between all records
- Configurable data volumes via environment variables
- Support for incremental seeding of existing entities
Quick Start #
Using mise (Recommended) #
For Docker environment:
mise run seed-db
For local development:
mise run seed-db-local
Using pnpm #
From monorepo root:
pnpm seed-db
From the seed package:
cd tools/seed
pnpm seed
Document Types #
The seed tool supports four main document extraction types:
1. Invoices #
Creates complete invoice records with:
- Invoice header data (invoice number, date, due date, contact)
- Line items (1-5 per invoice)
- Random extraction statuses (PENDING, EXTRACTED, etc.)
- Links to document pages
Environment Variable: SEED_INVOICES_PER_ENTITY (default: 1000)
2. Bank Statements #
Creates bank statement records with:
- Account information (bank name, account number)
- Transaction data (3-10 transactions per statement)
- Currency and language information
- Extraction metadata
Environment Variable: SEED_BANK_STATEMENTS_PER_ENTITY (default: 1000)
3. Credit Notes #
Creates credit note records with:
- Credit note header data (number, date, contact, reason)
- Line items (1-5 per note)
- References to original invoices
- Tax calculations
Environment Variable: SEED_CREDIT_NOTES_PER_ENTITY (default: 1000)
4. Direct Expenses #
Creates direct expense records with:
- Expense header data (title, description, date, contact)
- Line items (1-5 per expense)
- Payment and spend type classification
- Tax information
Environment Variable: SEED_DIRECT_EXPENSES_PER_ENTITY (default: 1000)
Configuration #
Environment Variables #
Database Connection #
DB_HOST=localhost # Database host
DB_PORT=5432 # Database port
DB_USER=postgres # Database user
DB_PASSWORD=postgres # Database password
DB_NAME=bonsai # Database name
S3 Storage #
S3_ENDPOINT=http://localhost:4566 # S3 endpoint (LocalStack for dev)
AWS_REGION=us-east-1 # AWS region
AWS_ACCESS_KEY_ID=test # AWS access key
AWS_SECRET_ACCESS_KEY=test # AWS secret key
S3_BUCKET_NAME=bonsai-app # S3 bucket name
Organization Setup #
SEED_ORGANIZATIONS=2 # Number of organizations to create
SEED_USERS_PER_ORG=3 # Users per organization
SEED_ENTITIES_PER_ORG=2 # Entities per organization
Document Volumes #
SEED_DOCS_PER_ENTITY=1000 # Base documents (S3 uploads)
SEED_INVOICES_PER_ENTITY=1000 # Invoice extractions
SEED_BANK_STATEMENTS_PER_ENTITY=1000 # Bank statement extractions
SEED_CREDIT_NOTES_PER_ENTITY=1000 # Credit note extractions
SEED_DIRECT_EXPENSES_PER_ENTITY=1000 # Direct expense extractions
Usage Patterns #
Creating Test Data for a Specific Feature #
When working on invoice-related features:
SEED_DOCS_PER_ENTITY=10 \
SEED_INVOICES_PER_ENTITY=10 \
SEED_BANK_STATEMENTS_PER_ENTITY=0 \
SEED_CREDIT_NOTES_PER_ENTITY=0 \
SEED_DIRECT_EXPENSES_PER_ENTITY=0 \
pnpm seed
Creating Balanced Test Data #
For general development with all document types:
SEED_DOCS_PER_ENTITY=20 \
SEED_INVOICES_PER_ENTITY=15 \
SEED_BANK_STATEMENTS_PER_ENTITY=10 \
SEED_CREDIT_NOTES_PER_ENTITY=5 \
SEED_DIRECT_EXPENSES_PER_ENTITY=10 \
pnpm seed
Performance Testing #
For large dataset testing:
SEED_DOCS_PER_ENTITY=5000 \
SEED_INVOICES_PER_ENTITY=2000 \
SEED_BANK_STATEMENTS_PER_ENTITY=1500 \
SEED_CREDIT_NOTES_PER_ENTITY=1000 \
SEED_DIRECT_EXPENSES_PER_ENTITY=1500 \
pnpm seed
Automated Seeding (CI/CD) #
Skip interactive prompts:
pnpm seed -- --no-prompt
Architecture #
Key Components #
-
DatabaseSeeder (
database-seeder.ts)- Main orchestrator
- Handles workflow logic (create new vs. use existing entities)
- Manages transaction rollback on errors
-
DataSeeder (
data-seeder.ts)- Creates actual database records
- Implements methods for each entity type
- Handles S3 file uploads
-
DataGenerator (
data-generator.ts)- Generates realistic fake data using Faker.js
- Creates IDs, names, amounts, dates, etc.
- Ensures data consistency
-
EntityManager (
entity-manager.ts)- Manages existing entities
- Handles data cleanup
- Provides entity selection prompts
-
ReferenceDataManager (
reference-data-manager.ts)- Loads reference data (countries, currencies, languages)
- Caches reference data for performance
-
S3FileManager (
s3-file-manager.ts)- Uploads sample files to S3
- Ensures bucket exists
- Generates proper file paths
Data Flow #
- Check for existing organizations/entities
- If none exist, create from scratch:
- Organizations → Users → Entities → Documents → Document Types → Permissions
- If they exist, optionally:
- Select specific entity
- Process all entities
- Clear and reseed
- For each entity:
- Create base documents (uploaded to S3)
- Create extractions for each document type
- Create document type records (invoice, bank statement, etc.)
- Create associated data and line items
- Link documents via extraction_document_page
Database Schema #
Each document type follows a similar pattern:
extraction (type: INVOICE|BANK_STATEMENT|AR_CREDIT_NOTE|DIRECT_EXPENSE)
↓
[document_type] (invoice, bank_statement, credit_note, direct_expense)
↓
[document_type]_data (current version of data)
↓
[document_type]_line (line items)
Links to uploaded documents:
extraction → extraction_document_page → document_page → document
Sample Files #
The tool uses real sample files from the repository:
libs/python/bonsai-hinoki/hinoki/tests/data/invoice_abc_company.pnglibs/python/bonsai-hinoki/hinoki/tests/data/Sample Invoice.pdf
These files are uploaded to S3 and referenced by the created documents.
Common Use Cases #
Reset Database Completely #
# This will clear all data and recreate from scratch
pnpm seed -- --create-new --no-prompt
Add Data to Existing Entity #
# Interactive mode - select entity from list
pnpm seed
Process All Entities #
# Clear and reseed all existing entities
pnpm seed -- --process-all --no-prompt
Target Specific Entity #
# Seed a specific entity by ID
pnpm seed -- --entity-id <entity-uuid>
Troubleshooting #
Connection Issues #
If you see database connection errors:
- Ensure the database is running:
mise run dev - Check environment variables match your setup
- For Docker: use
DB_HOST=databaseinstead oflocalhost
S3 Upload Failures #
If documents fail to upload:
- Ensure LocalStack is running (part of
mise run dev) - Verify S3_ENDPOINT is correct
- Check that sample files exist in the expected location
TypeScript Errors #
If the build fails:
cd tools/seed
pnpm install
pnpm build
Out of Memory #
For very large datasets:
NODE_OPTIONS="--max-old-space-size=4096" pnpm seed
Development #
Adding New Document Types #
To add a new document type:
-
Add the count field to
types.ts:counts: { // ... newDocumentTypePerEntity: number; } -
Add environment variable to
config.ts:newDocumentTypePerEntity: parseInt( process.env.SEED_NEW_DOCUMENT_TYPE_PER_ENTITY || "1000", 10 ), -
Create generator methods in
data-generator.ts:generateNewDocumentType(entityId: string, extractionId: string) { ... } generateNewDocumentTypeData(...) { ... } generateNewDocumentTypeLine(...) { ... } -
Add creation method to
data-seeder.ts:async createNewDocumentTypes(): Promise<void> { ... } -
Call the method in
database-seeder.tsin all three locations:- After
createCreditNotes()in the main flow - In the
processAllEntitiesloop - In the single entity processing
- After
-
Update documentation in README.md and this file
Running Tests #
cd tools/seed
pnpm test
Code Style #
The project uses:
- ESLint for linting
- Prettier for formatting
- TypeScript strict mode
Format code:
pnpm format
Lint code:
pnpm lint
Best Practices #
- Start Small: Use small counts initially to verify everything works
- Use Environment Variables: Don’t hardcode counts in code
- Clear Before Reseed: The tool clears existing data by default
- Monitor Progress: Watch the console output for errors
- Test Incrementally: Test each document type independently
- Use mise Commands: Follow monorepo conventions with mise
Related Documentation #
- Tool README - Package-specific documentation
- Development Workflow - Overall development process
- Database Migrations - Database schema details
- E2E Testing - Using seeded data in tests