Spaces:

ChinnaVemareddy23
/

DOCVISION

Sleeping

App Files Files Community

chinna vemareddy commited on 24 days ago

Commit

d56c6ae

0 Parent(s):

initia

Browse files

Files changed (16) hide show

.dockerignore +26 -0
.gitignore +98 -0
Dockerfile +68 -0
README.md +219 -0
experiments/exp_01.yaml +18 -0
experiments/exp_02.yaml +20 -0
frontend/app.py +207 -0
project.yaml +30 -0
reproducibility.md +76 -0
requirements.txt +34 -0
src/config.py +44 -0
src/main.py +245 -0
src/pdfconverter.py +63 -0
src/textextraction.py +77 -0
src/vision.py +195 -0
src/visual_cues.py +89 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,26 @@

+# Python
+__pycache__/
+*.pyc
+*.pyo
+.venv/
+venv/
+# Git
+.git/
+# Secrets
+.env
+# Generated / runtime files
+uploads/
+*.pdf
+*.png
+*.jpg
+*.jpeg
+# Streamlit
+.streamlit/
+# OS
+.DS_Store
+Thumbs.db

.gitignore ADDED Viewed

	@@ -0,0 +1,98 @@

+# ==============================
+# Python
+# ==============================
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+*.pyc
+# ==============================
+# Virtual Environments
+# ==============================
+.venv/
+venv/
+env/
+ENV/
+# ==============================
+# Environment / Secrets
+# ==============================
+.env
+.env.*
+!.env.example
+# ==============================
+# Docker
+# ==============================
+*.log
+docker-compose.override.yml
+# ==============================
+# Build / Distribution
+# ==============================
+build/
+dist/
+*.egg-info/
+# ==============================
+# IDE / Editor
+# ==============================
+.vscode/
+.idea/
+*.swp
+*.swo
+# ==============================
+# OS Files
+# ==============================
+.DS_Store
+Thumbs.db
+# ==============================
+# Streamlit
+# ==============================
+.streamlit/
+# ==============================
+# Cache / Runtime Data
+# ==============================
+.cache/
+logs/
+# ==============================
+# Project-specific
+# ==============================
+# ==============================
+# HuggingFace / Transformers Cache
+# ==============================
+.huggingface/
+.cache/huggingface/
+.cache/torch/
+.cache/transformers/
+# ==============================
+# Jupyter / Notebooks (if any)
+# ==============================
+.ipynb_checkpoints/
+# ==============================
+# Misc
+# ==============================
+*.tmp
+*.bak
+*.old
+uploads/
+.env
+__pycache__/
+*.png
+*.jpg
+*.jpeg
+*.pdf
+*.log

Dockerfile ADDED Viewed

	@@ -0,0 +1,68 @@

+# -----------------------------
+# Base Image
+# -----------------------------
+FROM python:3.10
+# -----------------------------
+# Environment
+# -----------------------------
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+# -----------------------------
+# System Dependencies
+# -----------------------------
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    libgl1 \
+    poppler-utils \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# -----------------------------
+# Working Directory
+# -----------------------------
+WORKDIR /app
+# -----------------------------
+# Upgrade pip tools
+# -----------------------------
+RUN pip install --upgrade pip setuptools wheel
+# -----------------------------
+# Install PyTorch CPU FIRST
+# -----------------------------
+RUN pip install --no-cache-dir \
+    torch==2.1.2+cpu \
+    torchvision==0.16.2+cpu \
+    torchaudio==2.1.2+cpu \
+    --index-url https://download.pytorch.org/whl/cpu
+# -----------------------------
+# Copy requirements
+# -----------------------------
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# -----------------------------
+# Copy project files
+# -----------------------------
+COPY . .
+# -----------------------------
+# Runtime directories
+# -----------------------------
+RUN mkdir -p uploads/images
+# -----------------------------
+# Hugging Face PUBLIC PORT
+# -----------------------------
+EXPOSE 7860
+# -----------------------------
+# Start FastAPI (internal) + Streamlit (public)
+# -----------------------------
+CMD ["bash", "-c", "uvicorn src.main:app --host 0.0.0.0 --port 8000 & exec streamlit run frontend/app.py --server.port=7860 --server.address=0.0.0.0 --server.headless=true --server.enableXsrfProtection=false"]

README.md ADDED Viewed

	@@ -0,0 +1,219 @@

+---
+title: DocVision IQ
+emoji: 🏆
+colorFrom: yellow
+colorTo: indigo
+sdk: docker
+app_port: 7860
+pinned: false
+---
+# 📄 DocVision IQ
+**Employee Name:** `<EmpID>`
+---
+## 1. Research Question / Hypothesis
+### Research Question
+Can a hybrid pipeline combining OCR, Vision Large Language Models (Vision LLMs), and visual cue detection accurately classify and understand real-world document images and PDFs in a production-style API system?
+### Hypothesis
+Integrating high-quality OCR (LlamaParse), Vision LLM reasoning, and logo/seal detection will improve document classification robustness compared to OCR-only or vision-only approaches, especially for visually distinctive documents.
+---
+## 2. Motivation and Relevance
+Organizations handle large volumes of unstructured documents such as invoices, identity cards, certificates, and contracts. Traditional OCR-only systems struggle with:
+- Diverse document layouts
+- Poor scan quality
+- Visual identifiers (logos, seals, emblems)
+- Contextual ambiguity
+**DocVision** addresses these challenges using a **multi-modal document understanding pipeline**, closely reflecting real-world enterprise document intelligence systems used in **fintech, compliance, onboarding, and automation workflows**.
+---
+## 3. System Architecture
+DocVision is implemented as a **modular, production-style system** with clear separation of concerns.
+### High-Level Architecture
+```
+User (UI / API)
+↓
+FastAPI Backend
+↓
+Validation & Hashing
+↓
+PDF → Image Conversion (PyMuPDF)
+↓
+OCR (LlamaParse)
+↓
+Vision LLM Classification (OpenRouter)
+↓
+Visual Cue Detection (Logos / Seals)
+↓
+Caching Layer
+↓
+Structured JSON Output
+```
+**Backend:** FastAPI (`src/main.py`)
+**Frontend:** Streamlit (`frontend/app.py`)
+**Deployment:** Docker
+**Experiments:** Declarative YAML (`experiments/`)
+## 4. Models and Versions Used
+### OCR
+- **LlamaParse (Llama Cloud Services)**
+  - Used for high-quality text extraction from images and PDFs
+### Vision LLM
+- **nvidia/nemotron-nano-12b-v2-vl** (via OpenRouter)
+  - Used for document classification and reasoning using combined text and image inputs
+### Visual Cue Detection
+- **ellabettison/Logo-Detection-finetune**
+  - Transformer-based object detection model for detecting logos and seals
+---
+## 5. Prompting and / or Fine-Tuning Strategy
+- **Zero-shot prompting** (no fine-tuning)
+- Carefully designed **instruction-based prompt** that:
+  - Enforces strict JSON output
+  - Prioritizes strong document-specific identifiers
+  - Includes explicit classification constraints (e.g., Aadhaar rules)
+  - Combines OCR text with image context
+**Rationale:**
+Zero-shot prompting ensures better generalization and aligns with real-world Vision LLM API usage without introducing dataset-specific bias.
+---
+## 6. Evaluation Protocol
+Evaluation is performed using a combination of **automated** and **human** methods.
+### Automated Evaluation
+- JSON schema validation
+- Rule-based checks (e.g., Aadhaar number presence)
+- Field extraction completeness
+- End-to-end latency measurement
+### Human Evaluation
+- Manual inspection of document type correctness
+- Assessment of reasoning quality and plausibility
+- Evaluation of visual cue relevance
+Experiments are defined declaratively in YAML files
+(`experiments/exp_01.yaml`, `experiments/exp_02.yaml`) to ensure reproducibility.
+---
+## 7. Key Results
+- Consistent document classification across common document categories
+- Improved robustness when visual cue detection is enabled
+- Stable performance on scanned images and PDFs
+- Deterministic preprocessing and bounded runtime
+(Refer to the **Experiments** and **Reproducibility Statement** for detailed analysis.)
+---
+## 8. Known Limitations and Ethical Considerations
+### Limitations
+- Performance degrades on extremely low-resolution or heavily occluded documents
+- Dependence on external APIs for OCR and Vision LLM inference
+- Field-level extraction accuracy is not benchmarked against labeled datasets
+### Ethical Considerations
+- Handles potentially sensitive personal documents
+- No data is permanently stored beyond processing
+- API keys are required and must be securely managed
+- System outputs should not be used for identity verification without human review
+---
+## 9. Exact Instructions to Reproduce Results
+### 9.1 Prerequisites
+- Python 3.10+
+- Docker installed
+- Internet access (required for external APIs)
+### 9.2 Environment Configuration
+Create a `.env` file in the project root to securely store API credentials:
+```env
+LLAMA_API_KEY=llx-xxxxxxxxxxxxxxxxxxxxxxxx
+OPENAI_API_KEY=sk-or-xxxxxxxxxxxxxxxxxxxx
+```
+### 9.3 Project Structure
+```text
+DOCVISION_IQ/
+│
+├── experiments/
+│   ├── exp_01.yaml
+│   └── exp_02.yaml
+│
+├── frontend/
+│   └── app.py
+│
+├── src/
+│   ├── config.py
+│   ├── main.py
+│   ├── pdfconverter.py
+│   ├── textextraction.py
+│   ├── vision.py
+│   └── visual_cues.py
+│
+├── uploads/
+│
+├── Dockerfile
+├── project.yaml
+├── reproducibility.md
+├── requirements.txt
+└── README.md
+```
+### 9.4 Docker Execution
+Build the Docker image:
+```bash
+docker build -t docvision .
+docker run -p 8000:8000 -p 8501:8501 --env-file .env docvision
+```
+### 9.5 Access
+| Component | URL |
+|----------|-----|
+| **Streamlit UI** | http://localhost:8501 |
+| **FastAPI Docs** | http://localhost:8000/docs |
+---
+## 👨‍💻 Author
+**DocVision**
+An end-to-end AI-powered document understanding system built for
+**real-world applications, interviews, and scalable deployments**.

experiments/exp_01.yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+name: docvision_ocr_vision_baseline
+task: document_classification
+model: nvidia/nemotron-nano-12b-v2-vl:free
+ocr_engine: llamaparse
+use_visual_cues: false
+max_pages: 1
+image_resize: [1024, 1024]
+temperature: 0.1
+seed: 42
+description: >
+  Baseline DocVision experiment using OCR (LlamaParse) combined with
+  a Vision LLM for document classification. Visual cue detection
+  (logos/seals) is disabled to evaluate the contribution of textual
+  and layout understanding alone.

experiments/exp_02.yaml ADDED Viewed

	@@ -0,0 +1,20 @@

+name: docvision_full_pipeline
+task: document_classification
+model: nvidia/nemotron-nano-12b-v2-vl:free
+ocr_engine: llamaparse
+use_visual_cues: true
+logo_detection_model: ellabettison/Logo-Detection-finetune
+max_pages: 1
+max_logos_per_page: 4
+image_resize: [1024, 1024]
+temperature: 0.1
+seed: 42
+description: >
+  Full DocVision pipeline experiment combining OCR, Vision LLM
+  reasoning, and visual cue detection. Logos and seals extracted
+  from documents are used to support document classification
+  and improve robustness on visually distinctive documents.

frontend/app.py ADDED Viewed

	@@ -0,0 +1,207 @@

+import streamlit as st
+import requests
+import base64
+from PIL import Image
+import io
+import os
+# --------------------------------------------------
+# INTERNAL API (HF SAFE)
+# --------------------------------------------------
+API_URL = "http://localhost:8000"
+# --------------------------------------------------
+# PAGE CONFIG
+# --------------------------------------------------
+st.set_page_config(
+    page_title="DocVision IQ",
+    layout="wide",
+    initial_sidebar_state="collapsed"
+)
+# --------------------------------------------------
+# LIMITS (MUST MATCH BACKEND)
+# --------------------------------------------------
+MAX_TOTAL_FILES = 5
+MAX_IMAGES = 5
+MAX_PDFS = 3
+# --------------------------------------------------
+# STYLES
+# --------------------------------------------------
+st.markdown("""
+<style>
+.stApp {
+    background-color: #ffffff;
+    color: #1a1a1a;
+}
+.file-header {
+    background-color: #000000;
+    color: #ffffff;
+    padding: 12px 16px;
+    border-radius: 10px;
+    font-weight: bold;
+    font-size: 16px;
+    margin-bottom: 12px;
+}
+.section-gap {
+    margin-bottom: 18px;
+}
+.logo-card {
+    background-color: #fafafa;
+    border-radius: 10px;
+    padding: 12px;
+    border: 1px solid #cccccc;
+    text-align: center;
+}
+.confidence {
+    color: #0b6623;
+    font-size: 12px;
+    margin-top: 6px;
+}
+</style>
+""", unsafe_allow_html=True)
+# --------------------------------------------------
+# HEADER
+# --------------------------------------------------
+st.title("📄 DocVision IQ")
+st.caption("AI-powered Document Understanding & Visual Cue Extraction")
+# --------------------------------------------------
+# FILE UPLOAD
+# --------------------------------------------------
+uploaded_files = st.file_uploader(
+    "Upload Images or PDFs",
+    type=["png", "jpg", "jpeg", "pdf"],
+    accept_multiple_files=True
+)
+extract_visual = st.checkbox("🧿 Extract Visual Cues (Logos / Seals)")
+# --------------------------------------------------
+# CLIENT-SIDE VALIDATION
+# --------------------------------------------------
+if uploaded_files:
+    total = len(uploaded_files)
+    pdfs = sum(1 for f in uploaded_files if f.name.lower().endswith(".pdf"))
+    images = total - pdfs
+    if total > MAX_TOTAL_FILES:
+        st.error(f"❌ Maximum {MAX_TOTAL_FILES} files allowed")
+        st.stop()
+    if pdfs > MAX_PDFS:
+        st.error(f"❌ Maximum {MAX_PDFS} PDFs allowed")
+        st.stop()
+    if images > MAX_IMAGES:
+        st.error(f"❌ Maximum {MAX_IMAGES} images allowed")
+        st.stop()
+    files = [
+        ("files", (file.name, file.getvalue(), file.type))
+        for file in uploaded_files
+    ]
+    # --------------------------------------------------
+    # ANALYZE DOCUMENTS
+    # --------------------------------------------------
+    with st.spinner("🔍 Analyzing documents..."):
+        response = requests.post(
+            f"{API_URL}/analyze",
+            files=files,
+            timeout=300
+        )
+    if response.status_code != 200:
+        try:
+            st.error(response.json().get("message", "Analyze API failed"))
+        except Exception:
+            st.error("Analyze API failed")
+        st.stop()
+    analysis_data = response.json()
+    # --------------------------------------------------
+    # VISUAL CUES
+    # --------------------------------------------------
+    visual_map = {}
+    if extract_visual:
+        with st.spinner("🧿 Extracting visual cues..."):
+            visual_response = requests.post(
+                f"{API_URL}/visual_cues",
+                files=files,
+                timeout=300
+            )
+        if visual_response.status_code != 200:
+            try:
+                st.error(visual_response.json().get("message", "Visual cues API failed"))
+            except Exception:
+                st.error("Visual cues API failed")
+            st.stop()
+        visual_data = visual_response.json()
+        for item in visual_data:
+            visual_map[item["file"]] = item.get("visual_cues", [])
+    # --------------------------------------------------
+    # RENDER RESULTS
+    # --------------------------------------------------
+    for item in analysis_data:
+        filename = item.get("file", "Unknown File")
+        st.markdown(
+            f"<div class='file-header'>📄 {filename}</div>",
+            unsafe_allow_html=True
+        )
+        if "error" in item:
+            st.error(item["error"])
+            continue
+        st.markdown(
+            f"<div class='section-gap'><strong>📌 Document Type:</strong> {item.get('document_type')}</div>",
+            unsafe_allow_html=True
+        )
+        st.markdown(
+            f"<div class='section-gap'><strong>Reasoning:</strong><br>{item.get('reasoning')}</div>",
+            unsafe_allow_html=True
+        )
+        st.markdown(
+            "<div class='section-gap'><strong>Extracted Text Fields:</strong></div>",
+            unsafe_allow_html=True
+        )
+        fields = item.get("extracted_textfields", {})
+        if not fields:
+            st.info("No text fields extracted")
+        else:
+            for k, v in fields.items():
+                st.markdown(f"- **{k}**: {v}")
+        if extract_visual and filename in visual_map:
+            logos = []
+            for page in visual_map[filename]:
+                logos.extend(page.get("logos", []))
+            if logos:
+                cols = st.columns(min(len(logos), 4))
+                for col, logo in zip(cols, logos):
+                    with col:
+                        img = Image.open(
+                            io.BytesIO(base64.b64decode(logo["image_base64"]))
+                        )
+                        st.markdown("<div class='logo-card'>", unsafe_allow_html=True)
+                        st.image(img)
+                        st.markdown(
+                            f"<div class='confidence'>Confidence: {logo['confidence']}</div>",
+                            unsafe_allow_html=True
+                        )
+                        st.markdown("</div>", unsafe_allow_html=True)
+            else:
+                st.info("No visual cues found")

project.yaml ADDED Viewed

	@@ -0,0 +1,30 @@

+project_name: "DocVision"
+employee_name: "<EmpID>"
+models_used:
+  - nvidia/nemotron-nano-12b-v2-vl
+  - ellabettison/Logo-Detection-finetune
+  - LlamaParse
+data:
+  - real_world_documents
+  - scanned_images
+  - pdf_documents
+tools:
+  - fastapi
+  - streamlit
+  - llamaparse
+  - openrouter
+  - transformers
+  - pymupdf
+  - pillow
+  - docker
+evaluation:
+  - automated
+  - human
+compute:
+  platform: cloud
+  gpu: none

reproducibility.md ADDED Viewed

	@@ -0,0 +1,76 @@

+# Reproducibility Statement
+This document describes the assumptions, constraints, and known sources of
+variation affecting the reproducibility of the DocVision system.
+---
+## Hardware Assumptions
+- CPU-only execution environment
+- No GPU required
+- Minimum 8 GB RAM recommended
+- Stable internet connectivity required for external API access
+- Tested on cloud-based Linux and local desktop environments
+---
+## Runtime Estimates
+Approximate end-to-end runtime per document (single page):
+| Pipeline Stage            | Avg Time (seconds) |
+|---------------------------|--------------------|
+| OCR (LlamaParse)          | 3.0 – 5.0          |
+| Vision LLM Inference      | 1.5 – 3.5          |
+| Logo Detection (CPU)      | < 10               |
+| **Total (Full Pipeline)** | **14.5 – 18.0**    |
+Multi-page PDFs process only the first page by default, keeping runtime bounded.
+Actual latency may vary depending on API load and network conditions.
+---
+## Random Seed Handling
+- Temperature for Vision LLM inference is fixed at `0.1`
+- Experiments declare a fixed seed value (`seed: 42`) for documentation purposes
+- No stochastic sampling is intentionally introduced in the pipeline
+- Deterministic preprocessing steps (PDF rendering, resizing, hashing)
+---
+## Known Sources of Nondeterminism
+Despite fixed configuration, some nondeterminism remains due to:
+- Cloud-hosted Vision LLM inference
+- OCR service variability across identical inputs
+- Network latency fluctuations
+- Concurrent execution order of asynchronous tasks
+These factors may cause minor variations in extracted fields or reasoning text,
+but document classification remains stable in most cases.
+---
+## Cost Considerations
+- OCR and Vision inference rely on external APIs (LlamaParse, OpenRouter)
+- Cost scales linearly with number of processed documents
+- No GPU compute cost is incurred
+- Logo detection and preprocessing run locally on CPU
+- Caching mechanisms reduce redundant API calls for repeated inputs
+Users should be aware of API usage limits and associated costs when processing
+large batches of documents.
+---
+## Summary
+DocVision prioritizes reproducibility through deterministic preprocessing,
+fixed inference parameters, and declarative experiment definitions. Remaining
+sources of nondeterminism stem primarily from external AI services rather than
+internal system design.

requirements.txt ADDED Viewed

	@@ -0,0 +1,34 @@

+# -----------------------------
+# Core Backend
+# -----------------------------
+fastapi==0.110.0
+uvicorn==0.29.0
+python-multipart==0.0.9
+python-dotenv==1.0.1
+requests==2.31.0
+# -----------------------------
+# Vision / OCR
+# -----------------------------
+pillow==10.1.0
+pymupdf==1.23.8
+# -----------------------------
+# LlamaIndex
+# -----------------------------
+llama-index==0.12.4
+llama-cloud-services==0.6.91
+# -----------
+# OpenAI SDK
+# -----------------------------
+openai==1.59.7
+# -----------------------------
+# Transformers (CPU)
+# -----------------------------
+transformers==4.36.2
+# -----------------------------
+# Frontend
+# -----------------------------
+streamlit==1.31.1

src/config.py ADDED Viewed

	@@ -0,0 +1,44 @@

+# Maximum number of concurrent OCR requests
+MAX_CONCURRENT_OCR: int = 5
+# Vision LLM model name (OpenRouter)
+VISION_MODEL_NAME: str = "nvidia/nemotron-nano-12b-v2-vl:free"
+# Logo detection model
+LOGO_DETECTION_MODEL: str = "ellabettison/Logo-Detection-finetune"
+# PDF to image conversion settings
+PDF_IMAGE_DPI: int = 302
+PDF_IMAGE_BASE_DIR: str = "uploads/images"
+# file upload limitations
+UPLOAD_DIR: str = "uploads"
+ALLOWED_EXTENSIONS = {".png", ".jpg", ".jpeg", ".pdf"}
+MAX_TOTAL_FILES: int = 5
+MAX_PDFS: int = 3
+MAX_IMAGES: int = 5
+MAX_IMAGE_MB: int = 5
+MAX_PDF_MB: int = 10
+# -------------------------------
+# IMAGE VALIDATION
+# -------------------------------
+MIN_WIDTH: int = 300
+MIN_HEIGHT: int = 300
+MAX_WIDTH: int = 6000
+MAX_HEIGHT: int = 6000
+# -------------------------------
+# VISUAL CUES
+# -------------------------------
+MAX_VISUAL_PAGES: int = 1
+MAX_LOGOS_PER_PAGE: int = 4
+MAX_IMAGE_RESIZE = (1024, 1024)

src/main.py ADDED Viewed

	@@ -0,0 +1,245 @@

+import os
+import io
+import hashlib
+import asyncio
+from typing import List, Dict, Any
+from fastapi import FastAPI, UploadFile, File
+from fastapi.responses import JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+from PIL import Image
+from src.pdfconverter import pdf_to_images
+from src.vision import classify_image
+from src.visual_cues import detect_logos_from_bytes
+from src.config import (
+    UPLOAD_DIR,
+    ALLOWED_EXTENSIONS,
+    MAX_TOTAL_FILES,
+    MAX_PDFS,
+    MAX_IMAGES,
+    MAX_IMAGE_MB,
+    MAX_PDF_MB,
+    MIN_WIDTH,
+    MIN_HEIGHT,
+    MAX_WIDTH,
+    MAX_HEIGHT,
+    MAX_VISUAL_PAGES,
+    MAX_LOGOS_PER_PAGE,
+    MAX_IMAGE_RESIZE,
+)
+# --------------------------------------------------
+# FASTAPI APPLICATION
+# --------------------------------------------------
+app = FastAPI(title="DocVision API")
+# --------------------------------------------------
+# CORS (REQUIRED FOR HUGGING FACE)
+# --------------------------------------------------
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# --------------------------------------------------
+# HEALTH CHECK
+# --------------------------------------------------
+@app.get("/")
+def health() -> Dict[str, str]:
+    """Health check endpoint for routing and monitoring."""
+    return {"status": "ok"}
+# --------------------------------------------------
+# DIRECTORIES
+# --------------------------------------------------
+os.makedirs(UPLOAD_DIR, exist_ok=True)
+# --------------------------------------------------
+# IN-MEMORY CACHES
+# --------------------------------------------------
+TEXT_CACHE: Dict[str, Dict[str, Any]] = {}
+VISUAL_CACHE: Dict[str, Dict[str, Any]] = {}
+# --------------------------------------------------
+# HELPER FUNCTIONS
+# --------------------------------------------------
+def file_hash(data: bytes) -> str:
+    """Generate a deterministic hash for file contents."""
+    return hashlib.md5(data).hexdigest()
+def read_file(file: UploadFile) -> bytes:
+    """Read file contents without consuming the stream."""
+    data = file.file.read()
+    file.file.seek(0)
+    return data
+def validate_file(file: UploadFile, contents: bytes) -> str | None:
+    """
+    Validate file type, size, and image resolution.
+    Returns an error message if invalid, otherwise None.
+    """
+    ext = os.path.splitext(file.filename)[1].lower()
+    size_mb = len(contents) / (1024 * 1024)
+    if ext not in ALLOWED_EXTENSIONS:
+        return "Unsupported file format"
+    if ext == ".pdf" and size_mb > MAX_PDF_MB:
+        return f"PDF exceeds {MAX_PDF_MB} MB"
+    if ext != ".pdf" and size_mb > MAX_IMAGE_MB:
+        return f"Image exceeds {MAX_IMAGE_MB} MB"
+    if ext != ".pdf":
+        try:
+            image = Image.open(io.BytesIO(contents))
+            width, height = image.size
+            if width < MIN_WIDTH or height < MIN_HEIGHT:
+                return f"Image too small ({width}x{height})"
+            if width > MAX_WIDTH or height > MAX_HEIGHT:
+                return f"Image too large ({width}x{height})"
+        except Exception:
+            return "Invalid image file"
+    return None
+# --------------------------------------------------
+# DOCUMENT ANALYSIS ENDPOINT
+# --------------------------------------------------
+@app.post("/analyze")
+async def analyze(files: List[UploadFile] = File(...)) -> JSONResponse:
+    """
+    Perform OCR + Vision-based document classification.
+    """
+    if len(files) > MAX_TOTAL_FILES:
+        return JSONResponse(
+            {"error": f"Maximum {MAX_TOTAL_FILES} files allowed"},
+            status_code=400,
+        )
+    pdf_count = sum(f.filename.lower().endswith(".pdf") for f in files)
+    img_count = len(files) - pdf_count
+    async def process_file(file: UploadFile) -> Dict[str, Any]:
+        contents = read_file(file)
+        fid = f"{file.filename}_{file_hash(contents)}"
+        if file.filename.lower().endswith(".pdf") and pdf_count > MAX_PDFS:
+            return {"file": file.filename, "error": f"Maximum {MAX_PDFS} PDFs allowed"}
+        if not file.filename.lower().endswith(".pdf") and img_count > MAX_IMAGES:
+            return {"file": file.filename, "error": f"Maximum {MAX_IMAGES} images allowed"}
+        if fid in TEXT_CACHE:
+            return TEXT_CACHE[fid]
+        error = validate_file(file, contents)
+        if error:
+            return {"file": file.filename, "error": error}
+        path = os.path.join(UPLOAD_DIR, file.filename)
+        with open(path, "wb") as f:
+            f.write(contents)
+        try:
+            if file.filename.lower().endswith(".pdf"):
+                pdf_name = await asyncio.to_thread(pdf_to_images, path)
+                base_dir = os.path.join("uploads", "images", pdf_name)
+                first_page = sorted(os.listdir(base_dir))[0]
+                analysis = await classify_image(os.path.join(base_dir, first_page))
+            else:
+                analysis = await classify_image(path)
+            result = {
+                "file": file.filename,
+                "document_type": analysis.get("document_type"),
+                "reasoning": analysis.get("reasoning"),
+                "extracted_textfields": analysis.get("extracted_textfields", {}),
+            }
+            TEXT_CACHE[fid] = result
+            return result
+        except Exception as exc:
+            return {"file": file.filename, "error": f"Processing failed: {exc}"}
+    results = await asyncio.gather(*[process_file(f) for f in files])
+    return JSONResponse(content=results)
+# --------------------------------------------------
+# VISUAL CUES ENDPOINT
+# --------------------------------------------------
+@app.post("/visual_cues")
+async def visual_cues(files: List[UploadFile] = File(...)) -> JSONResponse:
+    """
+    Detect logos, seals, and visual symbols from documents.
+    """
+    async def process_visual(file: UploadFile) -> Dict[str, Any]:
+        contents = read_file(file)
+        fid = f"{file.filename}_{file_hash(contents)}"
+        if fid in VISUAL_CACHE:
+            return VISUAL_CACHE[fid]
+        error = validate_file(file, contents)
+        if error:
+            return {"file": file.filename, "error": error}
+        path = os.path.join(UPLOAD_DIR, file.filename)
+        with open(path, "wb") as f:
+            f.write(contents)
+        visuals = []
+        try:
+            if file.filename.lower().endswith(".pdf"):
+                pdf_name = await asyncio.to_thread(pdf_to_images, path)
+                base_dir = os.path.join("uploads", "images", pdf_name)
+                for img_name in sorted(os.listdir(base_dir))[:MAX_VISUAL_PAGES]:
+                    with open(os.path.join(base_dir, img_name), "rb") as img_file:
+                        logos = await asyncio.to_thread(
+                            detect_logos_from_bytes,
+                            img_file.read(),
+                            MAX_IMAGE_RESIZE,
+                            MAX_LOGOS_PER_PAGE,
+                        )
+                    visuals.append({"page": img_name, "logos": logos})
+            else:
+                logos = await asyncio.to_thread(
+                    detect_logos_from_bytes,
+                    contents,
+                    MAX_IMAGE_RESIZE,
+                    MAX_LOGOS_PER_PAGE,
+                )
+                visuals.append({"page": "image", "logos": logos})
+            result = {"file": file.filename, "visual_cues": visuals}
+            VISUAL_CACHE[fid] = result
+            return result
+        except Exception as exc:
+            return {"file": file.filename, "error": f"Visual processing failed: {exc}"}
+    results = await asyncio.gather(*[process_visual(f) for f in files])
+    return JSONResponse(content=results)

src/pdfconverter.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import os
+from typing import Optional
+import fitz  # PyMuPDF
+from src.config import PDF_IMAGE_DPI, PDF_IMAGE_BASE_DIR
+# --------------------------------------------------
+# PDF TO IMAGE CONVERSION
+# --------------------------------------------------
+def pdf_to_images(
+    pdf_path: str,
+    base_dir: Optional[str] = None
+) -> str:
+    """
+    Convert a multi-page PDF into individual PNG images.
+    Each page of the PDF is rendered at a fixed DPI and
+    saved as a separate image file inside a directory
+    named after the PDF.
+    Parameters
+    ----------
+    pdf_path : str
+        Path to the input PDF file.
+    base_dir : str, optional
+        Base directory where page images will be stored.
+        Defaults to the configured PDF_IMAGE_BASE_DIR.
+    Returns
+    -------
+    str
+        Name of the PDF file (without extension), used
+        as the output folder name.
+    """
+    # Resolve base output directory
+    output_base: str = base_dir or PDF_IMAGE_BASE_DIR
+    # Extract PDF name (without extension)
+    pdf_name: str = os.path.splitext(os.path.basename(pdf_path))[0]
+    # Create output directory for this PDF
+    output_dir: str = os.path.join(output_base, pdf_name)
+    os.makedirs(output_dir, exist_ok=True)
+    # Open PDF document
+    document = fitz.open(pdf_path)
+    # Render each page as a high-resolution PNG image
+    for page_index, page in enumerate(document, start=1):
+        pixmap = page.get_pixmap(dpi=PDF_IMAGE_DPI)
+        pixmap.save(
+            os.path.join(output_dir, f"page_{page_index}.png")
+        )
+    # Close document to release resources
+    document.close()
+    return pdf_name

src/textextraction.py ADDED Viewed

	@@ -0,0 +1,77 @@

+import os
+import asyncio
+from typing import Dict
+from llama_cloud_services import LlamaParse
+from llama_index.core import SimpleDirectoryReader
+from src.config import MAX_CONCURRENT_OCR
+from dotenv import load_dotenv
+load_dotenv()
+# --------------------------------------------------
+# CONCURRENCY CONTROL
+# --------------------------------------------------
+# Limits the number of simultaneous OCR operations
+ocr_semaphore: asyncio.Semaphore = asyncio.Semaphore(
+    MAX_CONCURRENT_OCR
+)
+# --------------------------------------------------
+# OCR PARSER INITIALIZATION
+# --------------------------------------------------
+# LlamaParse internally reads the API key from environment
+parser: LlamaParse = LlamaParse(
+    api_key=os.getenv("LLAMA_API_KEY"),
+    result_type="text"
+)
+# --------------------------------------------------
+# FILE EXTENSION HANDLERS
+# --------------------------------------------------
+file_extractor: Dict[str, LlamaParse] = {
+    ".jpg": parser,
+    ".jpeg": parser,
+    ".png": parser,
+    ".pdf": parser,
+}
+# --------------------------------------------------
+# ASYNCHRONOUS OCR EXTRACTION
+# --------------------------------------------------
+async def extract_text_from_image_async(file_path: str) -> str:
+    """
+    Extract text asynchronously from an image or PDF using LlamaParse.
+    Concurrency is limited using a semaphore to prevent excessive
+    parallel OCR requests.
+    Parameters
+    ----------
+    file_path : str
+        Path to the image or PDF file.
+    Returns
+    -------
+    str
+        Extracted text content, or an empty string on failure.
+    """
+    async with ocr_semaphore:
+        try:
+            documents = await asyncio.to_thread(
+                lambda: SimpleDirectoryReader(
+                    input_files=[file_path],
+                    file_extractor=file_extractor
+                ).load_data()
+            )
+            return "\n".join(doc.text for doc in documents).strip()
+        except Exception as exc:
+            print(f"OCR failed for {file_path}: {exc}")
+            return ""

src/vision.py ADDED Viewed

	@@ -0,0 +1,195 @@

+import os
+import base64
+import json
+import re
+from typing import Dict, Any
+from dotenv import load_dotenv
+from openai import OpenAI
+from src.textextraction import extract_text_from_image_async
+from src.config import VISION_MODEL_NAME
+# --------------------------------------------------
+# ENVIRONMENT SETUP
+# --------------------------------------------------
+# Load environment variables for API keys
+load_dotenv()
+# --------------------------------------------------
+# OPENROUTER CLIENT INITIALIZATION
+# --------------------------------------------------
+client: OpenAI = OpenAI(
+    base_url="https://openrouter.ai/api/v1",
+    api_key=os.getenv("OPENAI_API_KEY")
+)
+# --------------------------------------------------
+# STRICT JSON PROMPT (UNCHANGED)
+# --------------------------------------------------
+PROMPT = """
+You are an intelligent document understanding system.
+**Prompt:** You are an advanced document classification AI tasked with accurately identifying the specific type of document presented to Your objective is to analyze the visual layout and textual content of the document while adhering to the following guidelines:
+1.**Visual Layout Analysis**: Examine the structural elements of the document such as logos, headers, footers, and any unique formatting that may indicate the document type.Pay attention to layout patterns that are characteristic of each document type.
+2.**Textual Evidence Extraction**: Extract and analyze textual information from the document.Look for key phrases, terms, and identifiers that are strongly associated with each document type.This includes: - For passports: Look for terms like "Passport", "Nationality", "Date of Birth", and country-specific formatting.- For aadhaar cards: Identify "Aadhaar Number", "Biometric Data", and any unique UIDAI branding.- For pan cards: Search for "Permanent Account Number" and tax-related keywords.- For contracts: Identify terms like "Agreement", "Parties", "Terms and Conditions".- For invoices: Look for "Invoice Number", "Billing Address", "Total Amount".
+3.**Prioritization of Strong Identifiers**: Focus on strong, document-specific identifiers rather than generic text.If there is ambiguity or multiple potential matches, prioritize the identifiers that are most definitive and unique to the document type.
+4.**Avoiding Guesses**: Do not make assumptions or guesses if the evidence is insufficient.If the document does not clearly fit into one of the specified categories based on the analysis, return a response indicating that the document type cannot be determined with confidence.
+5.**Output Format**: Provide your classification result in the following format: - "Document Type: [identified type]".
+important: Your analysis must be thorough and based on concrete evidence from the document's content and layout.
+there is a difference between invoice and receipt.etc
+extract exach and every information from the document.
+and provide correct field names and values
+important:Never classify a document as aadhaar_card unless a clear 12-digit Aadhaar number OR UIDAI reference is present.
+extract fields each and every important information.
+this is normal reasoning brief 2 to 3 lines explanation of how the document type was determined,Highlight underline
+take document type decision based on the normal reasoning.
+OUTPUT FORMAT (JSON – STRICT, ADDITIONAL):
+Return VALID JSON ONLY.
+Do not include any text outside JSON.
+Do not include markdown or code blocks.
+{
+  "document_type": "<Document Type>",
+  "reasoning": "<brief 2 to 3 lines explanation of how the document type was determined, highlighting the key visual or textual features that influenced the decision>",
+  "extracted_textfields": {
+    "<field_name>": "<value>",
+    "<field_name>": "<value>"
+  }
+}
+"""
+# --------------------------------------------------
+# HELPER FUNCTIONS
+# --------------------------------------------------
+def image_to_base64(path: str) -> str:
+    """
+    Convert an image file to a base64-encoded string.
+    Parameters
+    ----------
+    path : str
+        Path to the image file.
+    Returns
+    -------
+    str
+        Base64-encoded image content.
+    """
+    with open(path, "rb") as file:
+        return base64.b64encode(file.read()).decode()
+def extract_json_from_text(text: str) -> Dict[str, Any]:
+    """
+    Safely extract JSON content from model output.
+    This function handles cases where the model may
+    accidentally include extra text around the JSON.
+    Parameters
+    ----------
+    text : str
+        Raw model output.
+    Returns
+    -------
+    dict
+        Parsed JSON object.
+    Raises
+    ------
+    ValueError
+        If valid JSON cannot be extracted.
+    """
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        match = re.search(r"\{[\s\S]*\}", text)
+        if match:
+            return json.loads(match.group())
+        raise ValueError("LLM did not return valid JSON")
+# --------------------------------------------------
+# ASYNC DOCUMENT CLASSIFICATION
+# --------------------------------------------------
+async def classify_image(image_path: str) -> Dict[str, Any]:
+    """
+    Perform document classification using OCR + Vision LLM.
+    Steps:
+    1. Extract text using OCR
+    2. Encode image as base64
+    3. Send text + image to Vision LLM
+    4. Parse and normalize JSON output
+    Parameters
+    ----------
+    image_path : str
+        Path to the image or PDF page.
+    Returns
+    -------
+    dict
+        Structured classification result with document type,
+        reasoning, and extracted fields.
+    """
+    # OCR extraction
+    ocr_text: str = await extract_text_from_image_async(image_path)
+    # Image encoding
+    image_base64: str = image_to_base64(image_path)
+    # Vision LLM request
+    response = client.chat.completions.create(
+        model=VISION_MODEL_NAME,
+        temperature=0.1,
+        messages=[
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "text",
+                        "text": PROMPT + "\n\nOCR TEXT:\n" + ocr_text
+                    },
+                    {
+                        "type": "image_url",
+                        "image_url": {
+                            "url": f"data:image/png;base64,{image_base64}"
+                        }
+                    }
+                ]
+            }
+        ]
+    )
+    raw_output: str = response.choices[0].message.content.strip()
+    # Safe JSON parsing
+    try:
+        result: Dict[str, Any] = extract_json_from_text(raw_output)
+    except Exception:
+        result = {
+            "document_type": "unknown",
+            "reasoning": "Model output could not be parsed as JSON",
+            "extracted_textfields": {}
+        }
+    # Ensure required keys are always present
+    return {
+        "document_type": result.get("document_type", "unknown"),
+        "reasoning": result.get("reasoning", ""),
+        "extracted_textfields": result.get("extracted_textfields", {}),
+    }

src/visual_cues.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import io
+import base64
+from typing import List, Dict, Tuple
+from PIL import Image
+from transformers import pipeline
+from src.config import LOGO_DETECTION_MODEL
+# --------------------------------------------------
+# MODEL INITIALIZATION (LOAD ONCE)
+# --------------------------------------------------
+# Object detection pipeline for logo / seal detection
+detector = pipeline(
+    task="object-detection",
+    model=LOGO_DETECTION_MODEL,
+    device=-1  # CPU
+)
+# --------------------------------------------------
+# LOGO DETECTION
+# --------------------------------------------------
+def detect_logos_from_bytes(
+    image_bytes: bytes,
+    resize: Tuple[int, int] = (1024, 1024),
+    max_logos: int = 3
+) -> List[Dict[str, str | float]]:
+    """
+    Detect logos or visual emblems from raw image bytes.
+    The function resizes the image for faster inference,
+    detects logo regions, crops them, and returns the
+    cropped logo images encoded in base64 along with
+    confidence scores.
+    Parameters
+    ----------
+    image_bytes : bytes
+        Raw image data.
+    resize : tuple[int, int], optional
+        Maximum image size for inference (default: 1024x1024).
+    max_logos : int, optional
+        Maximum number of detected logos to return.
+    Returns
+    -------
+    list[dict]
+        List of detected logos with:
+        - confidence: float
+        - image_base64: str
+    """
+    # Load image from bytes
+    image: Image.Image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
+    # Resize image for performance optimization
+    image.thumbnail(resize)
+    # Run object detection
+    detections = detector(image)
+    results: List[Dict[str, str | float]] = []
+    # Process top detections only
+    for det in detections[:max_logos]:
+        box = det["box"]
+        score: float = float(det["score"])
+        xmin: int = int(box["xmin"])
+        ymin: int = int(box["ymin"])
+        xmax: int = int(box["xmax"])
+        ymax: int = int(box["ymax"])
+        # Crop detected logo region
+        cropped = image.crop((xmin, ymin, xmax, ymax))
+        # Convert cropped logo to base64
+        buffer = io.BytesIO()
+        cropped.save(buffer, format="PNG")
+        results.append({
+            "confidence": round(score, 3),
+            "image_base64": base64.b64encode(buffer.getvalue()).decode()
+        })
+    return results