System Architecture Overview ========================== Introduction ------------ Architectural Diagram (AS-IS) ---------------------------- .. mermaid:: graph TB subgraph "Web Layer" A["FastAPI REST API"] C["Middleware"] B["API Routers"] end subgraph "Service Layer" D["Resource Services"] E["Engine Services"] F["Shared Services"] G["Bulk Services"] end subgraph "Asynchronous Tasks" N[("Redis Broker")] M["Celery Workers"] end subgraph "AI Services" H["Haystack Pipelines"] HE["Haystack Enterprise"] end subgraph "Data Layer" L["Repositories"] K["SQLAlchemy ORM"] O[("PostgreSQL")] end Q["LLM APIs"] MR["Model Registry"] CL["EU Cellar"] %% Web Layer Flow A -->|"Request"| C C -->|"OIDC/JWT Auth"| B B -->|"Resource Ops"| D B -->|"Pipeline Ops"| E B -->|"Shared Ops"| F B -->|"Bulk Ops"| G %% Engine dispatches pipeline as task E -->|"Execute Pipeline"| N %% Broker delivers to workers N <-->|"Broker"| M %% Worker decides execution path M -->|"Execute Locally"| H M -->|"Deploy"| HE %% LLM calls via Haystack or Haystack Enterprise H -->|"LLM Calls"| Q HE -->|"LLM Calls"| Q %% Model Registry - bidirectional with both Haystack and Haystack Enterprise H <-->|"Register/Retrieve"| MR HE <-->|"Register/Retrieve"| MR %% Cellar - read only via SPARQL and REST D -->|"SPARQL/REST"| CL %% Data Access D -->|"CRUD"| L E -->|"CRUD"| L F -->|"CRUD"| L G -->|"CRUD"| L L -->|"ORM"| K K -->|"Query"| O %% Styling style O fill:#e1f5ff,stroke:#333 style K fill:#fff4e1,stroke:#333 style Q fill:#f0f8ff,stroke:#333 style N fill:#ffe0b2,stroke:#333 style H fill:#e8f5e9,stroke:#333 style HE fill:#bbdefb,stroke:#333 style MR fill:#f3e5f5,stroke:#333 style CL fill:#f0fff0,stroke:#333 Target Architectural Diagram (TO-BE) ----------------------------------- .. mermaid:: graph TB subgraph "Web Layer" A["FastAPI REST API"] C["Middleware"] B["API Routers"] end subgraph "Service Layer" D["Resource Services"] E["Engine Services"] F["Shared Services"] G["Bulk Services"] end subgraph "Data Layer" RR["Resource Repositories"] VTS[("Vector-Capable Triple Store")] SR["Repositories"] K["SQLAlchemy ORM"] O[("PostgreSQL")] end subgraph "Asynchronous Tasks" N[("Redis Broker")] M["Celery Workers"] end subgraph "AI Services" H["Haystack Pipelines"] HE["Haystack Enterprise"] end CL["EU Cellar"] Q["LLM APIs"] MR["Model Registry"] %% Web Layer Flow A -->|"Request"| C C -->|"OIDC/JWT Auth"| B B -->|"Resource Ops"| D B -->|"Pipeline Ops"| E B -->|"Shared Ops"| F B -->|"Bulk Ops"| G %% Data Access - Resource Services -> Resource Repositories -> Vector Triple Store -> Cellar D -->|"CRUD"| RR RR -->|"SPARQL"| VTS VTS <-->|"REST/SPARQL"| CL %% Data Access - Engine & Shared -> Repositories -> PostgreSQL E -->|"CRUD"| SR F -->|"CRUD"| SR G -->|"CRUD"| SR SR -->|"ORM"| K K -->|"Query"| O %% Engine dispatches pipeline as task E -->|"Execute Pipeline"| N %% Broker delivers to workers N <-->|"Broker"| M %% Worker decides execution path M -->|"Execute Locally"| H M -->|"Deploy"| HE %% LLM calls via Haystack or Haystack Enterprise H -->|"LLM Calls"| Q HE -->|"LLM Calls"| Q %% Model Registry - bidirectional with both Haystack and Haystack Enterprise H <-->|"Register/Retrieve"| MR HE <-->|"Register/Retrieve"| MR %% AI Services -> Vector-capable triple store H -->|"SPARQL/Vector Search"| VTS HE -->|"SPARQL/Vector Search"| VTS %% Styling style O fill:#e1f5ff,stroke:#333 style VTS fill:#e8f5e9,stroke:#333 style K fill:#fff4e1,stroke:#333 style Q fill:#f0f8ff,stroke:#333 style HE fill:#bbdefb,stroke:#333 style N fill:#ffe0b2,stroke:#333 style H fill:#e8f5e9,stroke:#333 style MR fill:#f3e5f5,stroke:#333 style CL fill:#f0fff0,stroke:#333 style RR fill:#fce4ec,stroke:#333 style SR fill:#ede7f6,stroke:#333 Layer Descriptions ------------------ Web Layer ~~~~~~~~~ **Responsibility**: HTTP request handling, response formatting, authentication **Components**: - ``api.py`` - FastAPI application initialization - ``routers/`` - Endpoint definitions organized by domain - ``dependencies.py`` - Dependency injection (auth, database sessions) - ``schemas/`` - Domain-organized Pydantic request/response models - Middleware - CORS, security headers, logging Service Layer ~~~~~~~~~~~~ **Responsibility**: Business logic, orchestration, data validation **Components**: Resource Services (``services/resources/``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ``document_collection_service.py`` - Document collection from EU Cellar - ``document_parsing_service.py`` - Document parsing with tulit - ``document_metadata_service.py`` - SPARQL-based document discovery, metadata enrichment, and CELEX metadata extraction - ``document_conversion.py`` - Document format conversion - ``document_utils.py`` - Document processing utilities - ``legal_resource_service.py`` - Legal resource CRUD - ``provision_service.py`` - Legal provision CRUD - ``classification_service.py`` - Legal Provision Classification CRUD - ``analysis_service.py`` - Analysis CRUD - ``category_service.py`` - Category CRUD - ``statement_service.py`` - Statement generation - ``exceptions.py`` - Resource service exceptions Engine Services (``services/engine/``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ``pipeline_service.py`` - Pipeline lookup, orchestration and execution - ``token_usage_service.py`` - LLM token usage tracking - ``embedding_service.py`` - Document embedding generation - ``haystack_enterprise_service.py`` - Haystack Enterprise API integration - ``exceptions.py`` - Engine service exceptions Haystack Integration (``services/engine/haystack/``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ``components/`` - Haystack components used in pipelines (retrievers, classifiers, parsers, custom processors) - ``base.py`` - Base component class - ``dependency_parsing_classifier.py`` - Dependency parsing classifier - ``json_extractor.py`` - JSON extractor - ``llm_analysis_parser.py`` - LLM analysis parser - ``llm_classification_parser.py`` - LLM classification parser - ``model_loader.py`` - Model loader - ``single_classifier.py`` - Single classifier - ``streaming.py`` - Utilities for streaming Haystack pipeline results All the workflows are implemented as Haystack pipelines composed from the above components and orchestrated by ``pipeline_service``. This consolidates Haystack-specific logic under ``services/engine/haystack/`` while keeping orchestration and execution concerns in ``services/engine``. Shared Services (``services/shared/``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ``user_service.py`` - User CRUD - ``refresh_token_service.py`` - Token lifecycle management - ``statistics_service.py`` - System statistics and dashboards - ``token_usage_service.py`` - LLM token usage tracking - ``task_service.py`` - Task lifecycle management - ``mlflow_service.py`` - MLflow tracking and GitLab Model Registry integration - ``classifier_discovery.py`` - Classifier discovery from registry and local filesystem - ``model_cache.py`` - Model loading with LRU cache and registry integration Bulk Services (``services/bulk/``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ``orchestrator.py`` - Bulk operation orchestration (create, execute, update, delete) - ``types.py`` - Bulk operation type definitions - ``resource_handlers/`` - Resource-specific bulk handlers - ``analysis.py`` - Analysis bulk operations - ``base.py`` - Base bulk handler class - ``classification.py`` - Classification bulk operations - ``legal_resource.py`` - Legal resource bulk operations Data Layer ~~~~~~~~~ **Responsibility**: Data access, ORM mapping, database operations **Components**: - ``db/models/`` - SQLAlchemy models organized by domain - ``db/repositories/`` - Data access patterns - ``db/migrations/`` - Alembic migration scripts - ``db/database.py`` - Database connection and session management - ``db/base.py`` - SQLAlchemy base model and type exports Task Queue & Workers ~~~~~~~~~~~~~~~~~~~ **Responsibility**: Asynchronous job processing, background tasks **Components**: - ``tasks/celery_worker.py`` - Celery application configuration - ``tasks/tasks.py`` - Task definitions - ``tasks/handler.py`` - Task execution handlers - ``tasks/factory.py`` - Task factory pattern - ``tasks/types.py`` - Task status and record types Authentication & Security ~~~~~~~~~~~~~~~~~~~~~~~ **Responsibility**: User authentication, authorization, security **Components**: - ``auth/security.py`` - JWT token generation/validation - ``auth/backends/`` - Authentication backends (JWT, OIDC) - ``auth/providers/`` - Identity providers (standard, EULogin) Utilities Layer ~~~~~~~~~~~~~~~ **Responsibility**: Cross-cutting concerns, helper functions **Components**: - ``utils/sparql_utils.py`` - SPARQL query execution - ``utils/refresh_token_utils.py`` - Token utilities - ``utils/serialization.py`` - Serialization helpers - ``utils/utils.py`` - General utilities Technology Integration Points ---------------------------- External APIs ~~~~~~~~~~~~ - **OpenAI-compatible API**: LLM-based text analysis and annotation - **SPARQL Endpoints**: Knowledge graph queries (e.g., Cellar) Databases ~~~~~~~~ - **PostgreSQL**: Primary data store Message Queues ~~~~~~~~~~~~~ - **Celery + Redis**: Asynchronous task processing Configuration Management ------------------------ Configuration is managed through: 1. **Environment Variables** (``.env`` file) 2. **Config JSON** (``config.json`` for paths) 3. **Database Configuration** (Alembic for migrations) Logging & Monitoring ------------------- Logging ~~~~~~~ - Structured logging to ``logs/ai4drpm.log`` - Log rotation (via ``logrotate.conf``) - Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL Deployment Architecture ---------------------- Docker Compose Deployment ~~~~~~~~~~~~~~~~~~~~~~~~ - Multi-container setup - Separate containers for: API, Worker, PostgreSQL, Redis - Volume mounts for persistence - Network isolation