Spaces:

ChinnaVemareddy23
/

DOCVISION

Sleeping

App Files Files Community

DOCVISION / README.md

chinna vemareddy

updated files

caeb795 15 days ago

preview code

raw

history blame contribute delete

5.76 kB

metadata

title: DocVision IQ
emoji: 🏆
colorFrom: yellow
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

📄 DocVision IQ

Employee Name: <EmpID>

1. Research Question / Hypothesis

Research Question

Can a hybrid pipeline combining OCR, Vision Large Language Models (Vision LLMs), and visual cue detection accurately classify and understand real-world document images and PDFs in a production-style API system?

Hypothesis

Integrating high-quality OCR (LlamaParse), Vision LLM reasoning, and logo/seal detection will improve document classification robustness compared to OCR-only or vision-only approaches, especially for visually distinctive documents.

2. Motivation and Relevance

Organizations handle large volumes of unstructured documents such as invoices, identity cards, certificates, and contracts. Traditional OCR-only systems struggle with:

Diverse document layouts
Poor scan quality
Visual identifiers (logos, seals, emblems)
Contextual ambiguity

DocVision addresses these challenges using a multi-modal document understanding pipeline, closely reflecting real-world enterprise document intelligence systems used in fintech, compliance, onboarding, and automation workflows.

3. System Architecture

DocVision is implemented as a modular, production-style system with clear separation of concerns.

High-Level Architecture


User (UI / API)
↓
FastAPI Backend
↓
Validation & Hashing
↓
PDF → Image Conversion (PyMuPDF)
↓
OCR (LlamaParse)
↓
Vision LLM Classification (OpenRouter)
↓
Visual Cue Detection (Logos / Seals)
↓
Caching Layer
↓
Structured JSON Output

Backend: FastAPI (src/main.py)
Frontend: Streamlit (frontend/app.py)
Deployment: Docker
Experiments: Declarative YAML (experiments/)

4. Models and Versions Used

OCR

LlamaParse (Llama Cloud Services)
- Used for high-quality text extraction from images and PDFs

Vision LLM

nvidia/nemotron-nano-12b-v2-vl (via OpenRouter)
- Used for document classification and reasoning using combined text and image inputs

Visual Cue Detection

ellabettison/Logo-Detection-finetune
- Transformer-based object detection model for detecting logos and seals

5. Prompting and / or Fine-Tuning Strategy

Zero-shot prompting (no fine-tuning)
Carefully designed instruction-based prompt that:
- Enforces strict JSON output
- Prioritizes strong document-specific identifiers
- Includes explicit classification constraints (e.g., Aadhaar rules)
- Combines OCR text with image context

Rationale:
Zero-shot prompting ensures better generalization and aligns with real-world Vision LLM API usage without introducing dataset-specific bias.

6. Evaluation Protocol

Evaluation is performed using a combination of automated and human methods.

Automated Evaluation

JSON schema validation
Rule-based checks (e.g., Aadhaar number presence)
Field extraction completeness
End-to-end latency measurement

Human Evaluation

Manual inspection of document type correctness
Assessment of reasoning quality and plausibility
Evaluation of visual cue relevance

Experiments are defined declaratively in YAML files
(experiments/exp_01.yaml, experiments/exp_02.yaml) to ensure reproducibility.

7. Key Results

Consistent document classification across common document categories
Improved robustness when visual cue detection is enabled
Stable performance on scanned images and PDFs
Deterministic preprocessing and bounded runtime

(Refer to the Experiments and Reproducibility Statement for detailed analysis.)

8. Known Limitations and Ethical Considerations

Limitations

Performance degrades on extremely low-resolution or heavily occluded documents
Dependence on external APIs for OCR and Vision LLM inference
Field-level extraction accuracy is not benchmarked against labeled datasets

Ethical Considerations

Handles potentially sensitive personal documents
No data is permanently stored beyond processing
API keys are required and must be securely managed
System outputs should not be used for identity verification without human review

9. Exact Instructions to Reproduce Results

9.1 Prerequisites

Python 3.10+
Docker installed
Internet access (required for external APIs)

9.2 Environment Configuration

Create a .env file in the project root to securely store API credentials:

LLAMA_API_KEY=llx-xxxxxxxxxxxxxxxxxxxxxxxx
OPENAI_API_KEY=sk-or-xxxxxxxxxxxxxxxxxxxx

9.3 Project Structure

DOCVISION_IQ/
│
├── experiments/
│   ├── exp_01.yaml
│   └── exp_02.yaml
│
├── frontend/
│   └── app.py
│
├── notebooks/
│   ├── 01_exploration.ipynb
│   └── 02_evalution.ipynb
│
├── src/
│   ├── config.py
│   ├── main.py
│   ├── pdfconverter.py
│   ├── textextraction.py
│   ├── vision.py
│   └── visual_cues.py
│
├── uploads/
│
├── Dockerfile
├── project.yaml
├── reproducibility.md
├── requirements.txt
└── README.md

9.4 Docker Execution

Build the Docker image:

docker build -t docvision .
docker run -p 8000:8000 -p 8501:8501 --env-file .env docvision

9.5 Access

Component	URL
Streamlit UI	http://localhost:8501
FastAPI Docs	http://localhost:8000/docs

👨‍💻 Author

DocVision
An end-to-end AI-powered document understanding system built for
real-world applications, interviews, and scalable deployments.