Database Seeding Tool

Database Seeding Tool #

The BonsAI database seeding tool is a TypeScript utility for populating the development database with realistic test data. It creates organizations, users, entities, and various document types to support development and testing workflows.

Overview #

The seed tool is located at /tools/seed and provides:

  • Automated creation of test organizations, users, and entities
  • Multiple document types: invoices, bank statements, credit notes, and direct expenses
  • Real file uploads to S3-compatible storage
  • Proper relationships between all records
  • Configurable data volumes via environment variables
  • Support for incremental seeding of existing entities

Quick Start #

For Docker environment:

mise run seed-db

For local development:

mise run seed-db-local

Using pnpm #

From monorepo root:

pnpm seed-db

From the seed package:

cd tools/seed
pnpm seed

Document Types #

The seed tool supports four main document extraction types:

1. Invoices #

Creates complete invoice records with:

  • Invoice header data (invoice number, date, due date, contact)
  • Line items (1-5 per invoice)
  • Random extraction statuses (PENDING, EXTRACTED, etc.)
  • Links to document pages

Environment Variable: SEED_INVOICES_PER_ENTITY (default: 1000)

2. Bank Statements #

Creates bank statement records with:

  • Account information (bank name, account number)
  • Transaction data (3-10 transactions per statement)
  • Currency and language information
  • Extraction metadata

Environment Variable: SEED_BANK_STATEMENTS_PER_ENTITY (default: 1000)

3. Credit Notes #

Creates credit note records with:

  • Credit note header data (number, date, contact, reason)
  • Line items (1-5 per note)
  • References to original invoices
  • Tax calculations

Environment Variable: SEED_CREDIT_NOTES_PER_ENTITY (default: 1000)

4. Direct Expenses #

Creates direct expense records with:

  • Expense header data (title, description, date, contact)
  • Line items (1-5 per expense)
  • Payment and spend type classification
  • Tax information

Environment Variable: SEED_DIRECT_EXPENSES_PER_ENTITY (default: 1000)

Configuration #

Environment Variables #

Database Connection #

DB_HOST=localhost          # Database host
DB_PORT=5432              # Database port
DB_USER=postgres          # Database user
DB_PASSWORD=postgres      # Database password
DB_NAME=bonsai            # Database name

S3 Storage #

S3_ENDPOINT=http://localhost:4566  # S3 endpoint (LocalStack for dev)
AWS_REGION=us-east-1               # AWS region
AWS_ACCESS_KEY_ID=test             # AWS access key
AWS_SECRET_ACCESS_KEY=test         # AWS secret key
S3_BUCKET_NAME=bonsai-app          # S3 bucket name

Organization Setup #

SEED_ORGANIZATIONS=2        # Number of organizations to create
SEED_USERS_PER_ORG=3       # Users per organization
SEED_ENTITIES_PER_ORG=2    # Entities per organization

Document Volumes #

SEED_DOCS_PER_ENTITY=1000              # Base documents (S3 uploads)
SEED_INVOICES_PER_ENTITY=1000          # Invoice extractions
SEED_BANK_STATEMENTS_PER_ENTITY=1000   # Bank statement extractions
SEED_CREDIT_NOTES_PER_ENTITY=1000      # Credit note extractions
SEED_DIRECT_EXPENSES_PER_ENTITY=1000   # Direct expense extractions

Usage Patterns #

Creating Test Data for a Specific Feature #

When working on invoice-related features:

SEED_DOCS_PER_ENTITY=10 \
SEED_INVOICES_PER_ENTITY=10 \
SEED_BANK_STATEMENTS_PER_ENTITY=0 \
SEED_CREDIT_NOTES_PER_ENTITY=0 \
SEED_DIRECT_EXPENSES_PER_ENTITY=0 \
pnpm seed

Creating Balanced Test Data #

For general development with all document types:

SEED_DOCS_PER_ENTITY=20 \
SEED_INVOICES_PER_ENTITY=15 \
SEED_BANK_STATEMENTS_PER_ENTITY=10 \
SEED_CREDIT_NOTES_PER_ENTITY=5 \
SEED_DIRECT_EXPENSES_PER_ENTITY=10 \
pnpm seed

Performance Testing #

For large dataset testing:

SEED_DOCS_PER_ENTITY=5000 \
SEED_INVOICES_PER_ENTITY=2000 \
SEED_BANK_STATEMENTS_PER_ENTITY=1500 \
SEED_CREDIT_NOTES_PER_ENTITY=1000 \
SEED_DIRECT_EXPENSES_PER_ENTITY=1500 \
pnpm seed

Automated Seeding (CI/CD) #

Skip interactive prompts:

pnpm seed -- --no-prompt

Architecture #

Key Components #

  1. DatabaseSeeder (database-seeder.ts)

    • Main orchestrator
    • Handles workflow logic (create new vs. use existing entities)
    • Manages transaction rollback on errors
  2. DataSeeder (data-seeder.ts)

    • Creates actual database records
    • Implements methods for each entity type
    • Handles S3 file uploads
  3. DataGenerator (data-generator.ts)

    • Generates realistic fake data using Faker.js
    • Creates IDs, names, amounts, dates, etc.
    • Ensures data consistency
  4. EntityManager (entity-manager.ts)

    • Manages existing entities
    • Handles data cleanup
    • Provides entity selection prompts
  5. ReferenceDataManager (reference-data-manager.ts)

    • Loads reference data (countries, currencies, languages)
    • Caches reference data for performance
  6. S3FileManager (s3-file-manager.ts)

    • Uploads sample files to S3
    • Ensures bucket exists
    • Generates proper file paths

Data Flow #

  1. Check for existing organizations/entities
  2. If none exist, create from scratch:
    • Organizations → Users → Entities → Documents → Document Types → Permissions
  3. If they exist, optionally:
    • Select specific entity
    • Process all entities
    • Clear and reseed
  4. For each entity:
    • Create base documents (uploaded to S3)
    • Create extractions for each document type
    • Create document type records (invoice, bank statement, etc.)
    • Create associated data and line items
    • Link documents via extraction_document_page

Database Schema #

Each document type follows a similar pattern:

extraction (type: INVOICE|BANK_STATEMENT|AR_CREDIT_NOTE|DIRECT_EXPENSE)
  ↓
[document_type] (invoice, bank_statement, credit_note, direct_expense)
  ↓
[document_type]_data (current version of data)
  ↓
[document_type]_line (line items)

Links to uploaded documents:

extraction → extraction_document_page → document_page → document

Sample Files #

The tool uses real sample files from the repository:

  • libs/python/bonsai-hinoki/hinoki/tests/data/invoice_abc_company.png
  • libs/python/bonsai-hinoki/hinoki/tests/data/Sample Invoice.pdf

These files are uploaded to S3 and referenced by the created documents.

Common Use Cases #

Reset Database Completely #

# This will clear all data and recreate from scratch
pnpm seed -- --create-new --no-prompt

Add Data to Existing Entity #

# Interactive mode - select entity from list
pnpm seed

Process All Entities #

# Clear and reseed all existing entities
pnpm seed -- --process-all --no-prompt

Target Specific Entity #

# Seed a specific entity by ID
pnpm seed -- --entity-id <entity-uuid>

Troubleshooting #

Connection Issues #

If you see database connection errors:

  1. Ensure the database is running: mise run dev
  2. Check environment variables match your setup
  3. For Docker: use DB_HOST=database instead of localhost

S3 Upload Failures #

If documents fail to upload:

  1. Ensure LocalStack is running (part of mise run dev)
  2. Verify S3_ENDPOINT is correct
  3. Check that sample files exist in the expected location

TypeScript Errors #

If the build fails:

cd tools/seed
pnpm install
pnpm build

Out of Memory #

For very large datasets:

NODE_OPTIONS="--max-old-space-size=4096" pnpm seed

Development #

Adding New Document Types #

To add a new document type:

  1. Add the count field to types.ts:

    counts: {
      // ...
      newDocumentTypePerEntity: number;
    }
    
  2. Add environment variable to config.ts:

    newDocumentTypePerEntity: parseInt(
      process.env.SEED_NEW_DOCUMENT_TYPE_PER_ENTITY || "1000",
      10
    ),
    
  3. Create generator methods in data-generator.ts:

    generateNewDocumentType(entityId: string, extractionId: string) { ... }
    generateNewDocumentTypeData(...) { ... }
    generateNewDocumentTypeLine(...) { ... }
    
  4. Add creation method to data-seeder.ts:

    async createNewDocumentTypes(): Promise<void> { ... }
    
  5. Call the method in database-seeder.ts in all three locations:

    • After createCreditNotes() in the main flow
    • In the processAllEntities loop
    • In the single entity processing
  6. Update documentation in README.md and this file

Running Tests #

cd tools/seed
pnpm test

Code Style #

The project uses:

  • ESLint for linting
  • Prettier for formatting
  • TypeScript strict mode

Format code:

pnpm format

Lint code:

pnpm lint

Best Practices #

  1. Start Small: Use small counts initially to verify everything works
  2. Use Environment Variables: Don’t hardcode counts in code
  3. Clear Before Reseed: The tool clears existing data by default
  4. Monitor Progress: Watch the console output for errors
  5. Test Incrementally: Test each document type independently
  6. Use mise Commands: Follow monorepo conventions with mise