DOCVISION / README.md
chinna vemareddy
updated files
caeb795
metadata
title: DocVision IQ
emoji: πŸ†
colorFrom: yellow
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

πŸ“„ DocVision IQ

Employee Name: <EmpID>


1. Research Question / Hypothesis

Research Question

Can a hybrid pipeline combining OCR, Vision Large Language Models (Vision LLMs), and visual cue detection accurately classify and understand real-world document images and PDFs in a production-style API system?

Hypothesis

Integrating high-quality OCR (LlamaParse), Vision LLM reasoning, and logo/seal detection will improve document classification robustness compared to OCR-only or vision-only approaches, especially for visually distinctive documents.


2. Motivation and Relevance

Organizations handle large volumes of unstructured documents such as invoices, identity cards, certificates, and contracts. Traditional OCR-only systems struggle with:

  • Diverse document layouts
  • Poor scan quality
  • Visual identifiers (logos, seals, emblems)
  • Contextual ambiguity

DocVision addresses these challenges using a multi-modal document understanding pipeline, closely reflecting real-world enterprise document intelligence systems used in fintech, compliance, onboarding, and automation workflows.


3. System Architecture

DocVision is implemented as a modular, production-style system with clear separation of concerns.

High-Level Architecture


User (UI / API)
↓
FastAPI Backend
↓
Validation & Hashing
↓
PDF β†’ Image Conversion (PyMuPDF)
↓
OCR (LlamaParse)
↓
Vision LLM Classification (OpenRouter)
↓
Visual Cue Detection (Logos / Seals)
↓
Caching Layer
↓
Structured JSON Output

Backend: FastAPI (src/main.py)
Frontend: Streamlit (frontend/app.py)
Deployment: Docker
Experiments: Declarative YAML (experiments/)


4. Models and Versions Used

OCR

  • LlamaParse (Llama Cloud Services)
    • Used for high-quality text extraction from images and PDFs

Vision LLM

  • nvidia/nemotron-nano-12b-v2-vl (via OpenRouter)
    • Used for document classification and reasoning using combined text and image inputs

Visual Cue Detection

  • ellabettison/Logo-Detection-finetune
    • Transformer-based object detection model for detecting logos and seals

5. Prompting and / or Fine-Tuning Strategy

  • Zero-shot prompting (no fine-tuning)
  • Carefully designed instruction-based prompt that:
    • Enforces strict JSON output
    • Prioritizes strong document-specific identifiers
    • Includes explicit classification constraints (e.g., Aadhaar rules)
    • Combines OCR text with image context

Rationale:
Zero-shot prompting ensures better generalization and aligns with real-world Vision LLM API usage without introducing dataset-specific bias.


6. Evaluation Protocol

Evaluation is performed using a combination of automated and human methods.

Automated Evaluation

  • JSON schema validation
  • Rule-based checks (e.g., Aadhaar number presence)
  • Field extraction completeness
  • End-to-end latency measurement

Human Evaluation

  • Manual inspection of document type correctness
  • Assessment of reasoning quality and plausibility
  • Evaluation of visual cue relevance

Experiments are defined declaratively in YAML files
(experiments/exp_01.yaml, experiments/exp_02.yaml) to ensure reproducibility.


7. Key Results

  • Consistent document classification across common document categories
  • Improved robustness when visual cue detection is enabled
  • Stable performance on scanned images and PDFs
  • Deterministic preprocessing and bounded runtime

(Refer to the Experiments and Reproducibility Statement for detailed analysis.)


8. Known Limitations and Ethical Considerations

Limitations

  • Performance degrades on extremely low-resolution or heavily occluded documents
  • Dependence on external APIs for OCR and Vision LLM inference
  • Field-level extraction accuracy is not benchmarked against labeled datasets

Ethical Considerations

  • Handles potentially sensitive personal documents
  • No data is permanently stored beyond processing
  • API keys are required and must be securely managed
  • System outputs should not be used for identity verification without human review

9. Exact Instructions to Reproduce Results

9.1 Prerequisites

  • Python 3.10+
  • Docker installed
  • Internet access (required for external APIs)

9.2 Environment Configuration

Create a .env file in the project root to securely store API credentials:

LLAMA_API_KEY=llx-xxxxxxxxxxxxxxxxxxxxxxxx
OPENAI_API_KEY=sk-or-xxxxxxxxxxxxxxxxxxxx

9.3 Project Structure

DOCVISION_IQ/
β”‚
β”œβ”€β”€ experiments/
β”‚   β”œβ”€β”€ exp_01.yaml
β”‚   └── exp_02.yaml
β”‚
β”œβ”€β”€ frontend/
β”‚   └── app.py
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_exploration.ipynb
β”‚   └── 02_evalution.ipynb
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py
β”‚   β”œβ”€β”€ main.py
β”‚   β”œβ”€β”€ pdfconverter.py
β”‚   β”œβ”€β”€ textextraction.py
β”‚   β”œβ”€β”€ vision.py
β”‚   └── visual_cues.py
β”‚
β”œβ”€β”€ uploads/
β”‚
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ project.yaml
β”œβ”€β”€ reproducibility.md
β”œβ”€β”€ requirements.txt
└── README.md

9.4 Docker Execution

Build the Docker image:

docker build -t docvision .
docker run -p 8000:8000 -p 8501:8501 --env-file .env docvision

9.5 Access

Component URL
Streamlit UI http://localhost:8501
FastAPI Docs http://localhost:8000/docs

πŸ‘¨β€πŸ’» Author

DocVision
An end-to-end AI-powered document understanding system built for
real-world applications, interviews, and scalable deployments.