---
title: DocVision IQ
emoji: 🏆
colorFrom: yellow
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
---


# 📄 DocVision IQ

**Employee Name:** `<EmpID>`

---

## 1. Research Question / Hypothesis

### Research Question
Can a hybrid pipeline combining OCR, Vision Large Language Models (Vision LLMs), and visual cue detection accurately classify and understand real-world document images and PDFs in a production-style API system?

### Hypothesis
Integrating high-quality OCR (LlamaParse), Vision LLM reasoning, and logo/seal detection will improve document classification robustness compared to OCR-only or vision-only approaches, especially for visually distinctive documents.

---

## 2. Motivation and Relevance

Organizations handle large volumes of unstructured documents such as invoices, identity cards, certificates, and contracts. Traditional OCR-only systems struggle with:

- Diverse document layouts  
- Poor scan quality  
- Visual identifiers (logos, seals, emblems)  
- Contextual ambiguity  

**DocVision** addresses these challenges using a **multi-modal document understanding pipeline**, closely reflecting real-world enterprise document intelligence systems used in **fintech, compliance, onboarding, and automation workflows**.

---

## 3. System Architecture

DocVision is implemented as a **modular, production-style system** with clear separation of concerns.

### High-Level Architecture

```

User (UI / API)
↓
FastAPI Backend
↓
Validation & Hashing
↓
PDF → Image Conversion (PyMuPDF)
↓
OCR (LlamaParse)
↓
Vision LLM Classification (OpenRouter)
↓
Visual Cue Detection (Logos / Seals)
↓
Caching Layer
↓
Structured JSON Output
```

**Backend:** FastAPI (`src/main.py`)  
**Frontend:** Streamlit (`frontend/app.py`)  
**Deployment:** Docker  
**Experiments:** Declarative YAML (`experiments/`)

---

## 4. Models and Versions Used

### OCR
- **LlamaParse (Llama Cloud Services)**
  - Used for high-quality text extraction from images and PDFs

### Vision LLM
- **nvidia/nemotron-nano-12b-v2-vl** (via OpenRouter)
  - Used for document classification and reasoning using combined text and image inputs

### Visual Cue Detection
- **ellabettison/Logo-Detection-finetune**
  - Transformer-based object detection model for detecting logos and seals

---

## 5. Prompting and / or Fine-Tuning Strategy

- **Zero-shot prompting** (no fine-tuning)
- Carefully designed **instruction-based prompt** that:
  - Enforces strict JSON output
  - Prioritizes strong document-specific identifiers
  - Includes explicit classification constraints (e.g., Aadhaar rules)
  - Combines OCR text with image context

**Rationale:**  
Zero-shot prompting ensures better generalization and aligns with real-world Vision LLM API usage without introducing dataset-specific bias.

---

## 6. Evaluation Protocol

Evaluation is performed using a combination of **automated** and **human** methods.

### Automated Evaluation
- JSON schema validation
- Rule-based checks (e.g., Aadhaar number presence)
- Field extraction completeness
- End-to-end latency measurement

### Human Evaluation
- Manual inspection of document type correctness
- Assessment of reasoning quality and plausibility
- Evaluation of visual cue relevance

Experiments are defined declaratively in YAML files  
(`experiments/exp_01.yaml`, `experiments/exp_02.yaml`) to ensure reproducibility.

---

## 7. Key Results

- Consistent document classification across common document categories
- Improved robustness when visual cue detection is enabled
- Stable performance on scanned images and PDFs
- Deterministic preprocessing and bounded runtime

(Refer to the **Experiments** and **Reproducibility Statement** for detailed analysis.)

---

## 8. Known Limitations and Ethical Considerations

### Limitations
- Performance degrades on extremely low-resolution or heavily occluded documents
- Dependence on external APIs for OCR and Vision LLM inference
- Field-level extraction accuracy is not benchmarked against labeled datasets

### Ethical Considerations
- Handles potentially sensitive personal documents
- No data is permanently stored beyond processing
- API keys are required and must be securely managed
- System outputs should not be used for identity verification without human review

---

## 9. Exact Instructions to Reproduce Results

### 9.1 Prerequisites
- Python 3.10+
- Docker installed
- Internet access (required for external APIs)

### 9.2 Environment Configuration

Create a `.env` file in the project root to securely store API credentials:

```env
LLAMA_API_KEY=llx-xxxxxxxxxxxxxxxxxxxxxxxx
OPENAI_API_KEY=sk-or-xxxxxxxxxxxxxxxxxxxx
```

### 9.3 Project Structure

```text
DOCVISION_IQ/
│
├── experiments/
│   ├── exp_01.yaml
│   └── exp_02.yaml
│
├── frontend/
│   └── app.py
│
├── notebooks/
│   ├── 01_exploration.ipynb
│   └── 02_evalution.ipynb
│
├── src/
│   ├── config.py
│   ├── main.py
│   ├── pdfconverter.py
│   ├── textextraction.py
│   ├── vision.py
│   └── visual_cues.py
│
├── uploads/
│
├── Dockerfile
├── project.yaml
├── reproducibility.md
├── requirements.txt
└── README.md
```

### 9.4 Docker Execution

Build the Docker image:

```bash
docker build -t docvision .
docker run -p 8000:8000 -p 8501:8501 --env-file .env docvision
```

### 9.5 Access

| Component | URL |
|----------|-----|
| **Streamlit UI** | http://localhost:8501 |
| **FastAPI Docs** | http://localhost:8000/docs |

---

## 👨‍💻 Author

**DocVision**  
An end-to-end AI-powered document understanding system built for  
**real-world applications, interviews, and scalable deployments**.