Hinoki Overview

Hinoki ML Platform Overview #

Hinoki is the machine learning platform that powers the document intelligence capabilities of the BonsAI system.

Architecture #

Hinoki is built with a modular architecture consisting of several components:

  • Document Processor: Handles document preprocessing and optimization
  • OCR Engine: Extracts text from images
  • ML Models: Various models for entity extraction and document understanding
  • Training Pipeline: Infrastructure for training and improving models
  • Inference API: APIs for real-time model inference

Core Capabilities #

Hinoki provides the following core capabilities:

Document Understanding #

  • Document classification
  • Layout analysis
  • Table detection and extraction
  • Form field detection

Text Processing #

  • Named entity recognition
  • Key-value pair extraction
  • Contextual entity linking
  • Semantic understanding

Computer Vision #

  • Image preprocessing
  • Document normalization
  • Visual feature extraction
  • Object detection within documents

Machine Learning Models #

Hinoki uses a variety of machine learning models:

  • Classification Models: Determine document types and categories
  • Segmentation Models: Identify regions and layouts within documents
  • NER Models: Extract named entities from text
  • OCR Post-processing: Correct OCR errors and improve text quality
  • Relationship Models: Connect related entities within documents

Technology Stack #

Hinoki is built using:

  • Model Training: PyTorch, TensorFlow
  • Model Serving: ONNX Runtime, TensorRT
  • Data Storage: S3, PostgreSQL
  • Orchestration: Kubernetes
  • Language: Python, Rust

Development Workflow #

The Hinoki development workflow includes:

  1. Data Collection: Gathering and annotating training data
  2. Model Training: Training and fine-tuning models
  3. Evaluation: Assessing model performance
  4. Deployment: Deploying models to production
  5. Monitoring: Tracking model performance in production
  6. Retraining: Continuously improving models with new data

Integration #

Hinoki integrates with the rest of the BonsAI platform through:

  • API Integration: REST APIs for model inference
  • Event-Based Processing: Processing documents via message queues
  • Batch Processing: Handling bulk document processing jobs
  • Interactive Corrections: Learning from user corrections

Performance Monitoring #

Hinoki includes tools for monitoring model performance:

  • Accuracy Metrics: Tracking extraction accuracy
  • Latency Monitoring: Measuring processing time
  • Error Analysis: Identifying and categorizing errors
  • User Feedback Loop: Incorporating user corrections

Data Security #

Hinoki implements several measures to ensure data security:

  • Data Encryption: Encryption of data at rest and in transit
  • Access Control: Fine-grained access to models and data
  • Data Retention: Policies for data deletion and retention
  • Audit Logging: Tracking all access to sensitive data

Future Roadmap #

Planned improvements to the Hinoki platform include:

  • Enhanced multilingual support
  • Improvements to handwritten text recognition
  • More sophisticated table extraction
  • Self-supervised learning capabilities
  • Model customization for specific document types