Spaces:
Sleeping
Sleeping
| title: DocVision IQ | |
| emoji: π | |
| colorFrom: yellow | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # π DocVision IQ | |
| **Employee Name:** `<EmpID>` | |
| --- | |
| ## 1. Research Question / Hypothesis | |
| ### Research Question | |
| Can a hybrid pipeline combining OCR, Vision Large Language Models (Vision LLMs), and visual cue detection accurately classify and understand real-world document images and PDFs in a production-style API system? | |
| ### Hypothesis | |
| Integrating high-quality OCR (LlamaParse), Vision LLM reasoning, and logo/seal detection will improve document classification robustness compared to OCR-only or vision-only approaches, especially for visually distinctive documents. | |
| --- | |
| ## 2. Motivation and Relevance | |
| Organizations handle large volumes of unstructured documents such as invoices, identity cards, certificates, and contracts. Traditional OCR-only systems struggle with: | |
| - Diverse document layouts | |
| - Poor scan quality | |
| - Visual identifiers (logos, seals, emblems) | |
| - Contextual ambiguity | |
| **DocVision** addresses these challenges using a **multi-modal document understanding pipeline**, closely reflecting real-world enterprise document intelligence systems used in **fintech, compliance, onboarding, and automation workflows**. | |
| --- | |
| ## 3. System Architecture | |
| DocVision is implemented as a **modular, production-style system** with clear separation of concerns. | |
| ### High-Level Architecture | |
| ``` | |
| User (UI / API) | |
| β | |
| FastAPI Backend | |
| β | |
| Validation & Hashing | |
| β | |
| PDF β Image Conversion (PyMuPDF) | |
| β | |
| OCR (LlamaParse) | |
| β | |
| Vision LLM Classification (OpenRouter) | |
| β | |
| Visual Cue Detection (Logos / Seals) | |
| β | |
| Caching Layer | |
| β | |
| Structured JSON Output | |
| ``` | |
| **Backend:** FastAPI (`src/main.py`) | |
| **Frontend:** Streamlit (`frontend/app.py`) | |
| **Deployment:** Docker | |
| **Experiments:** Declarative YAML (`experiments/`) | |
| --- | |
| ## 4. Models and Versions Used | |
| ### OCR | |
| - **LlamaParse (Llama Cloud Services)** | |
| - Used for high-quality text extraction from images and PDFs | |
| ### Vision LLM | |
| - **nvidia/nemotron-nano-12b-v2-vl** (via OpenRouter) | |
| - Used for document classification and reasoning using combined text and image inputs | |
| ### Visual Cue Detection | |
| - **ellabettison/Logo-Detection-finetune** | |
| - Transformer-based object detection model for detecting logos and seals | |
| --- | |
| ## 5. Prompting and / or Fine-Tuning Strategy | |
| - **Zero-shot prompting** (no fine-tuning) | |
| - Carefully designed **instruction-based prompt** that: | |
| - Enforces strict JSON output | |
| - Prioritizes strong document-specific identifiers | |
| - Includes explicit classification constraints (e.g., Aadhaar rules) | |
| - Combines OCR text with image context | |
| **Rationale:** | |
| Zero-shot prompting ensures better generalization and aligns with real-world Vision LLM API usage without introducing dataset-specific bias. | |
| --- | |
| ## 6. Evaluation Protocol | |
| Evaluation is performed using a combination of **automated** and **human** methods. | |
| ### Automated Evaluation | |
| - JSON schema validation | |
| - Rule-based checks (e.g., Aadhaar number presence) | |
| - Field extraction completeness | |
| - End-to-end latency measurement | |
| ### Human Evaluation | |
| - Manual inspection of document type correctness | |
| - Assessment of reasoning quality and plausibility | |
| - Evaluation of visual cue relevance | |
| Experiments are defined declaratively in YAML files | |
| (`experiments/exp_01.yaml`, `experiments/exp_02.yaml`) to ensure reproducibility. | |
| --- | |
| ## 7. Key Results | |
| - Consistent document classification across common document categories | |
| - Improved robustness when visual cue detection is enabled | |
| - Stable performance on scanned images and PDFs | |
| - Deterministic preprocessing and bounded runtime | |
| (Refer to the **Experiments** and **Reproducibility Statement** for detailed analysis.) | |
| --- | |
| ## 8. Known Limitations and Ethical Considerations | |
| ### Limitations | |
| - Performance degrades on extremely low-resolution or heavily occluded documents | |
| - Dependence on external APIs for OCR and Vision LLM inference | |
| - Field-level extraction accuracy is not benchmarked against labeled datasets | |
| ### Ethical Considerations | |
| - Handles potentially sensitive personal documents | |
| - No data is permanently stored beyond processing | |
| - API keys are required and must be securely managed | |
| - System outputs should not be used for identity verification without human review | |
| --- | |
| ## 9. Exact Instructions to Reproduce Results | |
| ### 9.1 Prerequisites | |
| - Python 3.10+ | |
| - Docker installed | |
| - Internet access (required for external APIs) | |
| ### 9.2 Environment Configuration | |
| Create a `.env` file in the project root to securely store API credentials: | |
| ```env | |
| LLAMA_API_KEY=llx-xxxxxxxxxxxxxxxxxxxxxxxx | |
| OPENAI_API_KEY=sk-or-xxxxxxxxxxxxxxxxxxxx | |
| ``` | |
| ### 9.3 Project Structure | |
| ```text | |
| DOCVISION_IQ/ | |
| β | |
| βββ experiments/ | |
| β βββ exp_01.yaml | |
| β βββ exp_02.yaml | |
| β | |
| βββ frontend/ | |
| β βββ app.py | |
| β | |
| βββ notebooks/ | |
| β βββ 01_exploration.ipynb | |
| β βββ 02_evalution.ipynb | |
| β | |
| βββ src/ | |
| β βββ config.py | |
| β βββ main.py | |
| β βββ pdfconverter.py | |
| β βββ textextraction.py | |
| β βββ vision.py | |
| β βββ visual_cues.py | |
| β | |
| βββ uploads/ | |
| β | |
| βββ Dockerfile | |
| βββ project.yaml | |
| βββ reproducibility.md | |
| βββ requirements.txt | |
| βββ README.md | |
| ``` | |
| ### 9.4 Docker Execution | |
| Build the Docker image: | |
| ```bash | |
| docker build -t docvision . | |
| docker run -p 8000:8000 -p 8501:8501 --env-file .env docvision | |
| ``` | |
| ### 9.5 Access | |
| | Component | URL | | |
| |----------|-----| | |
| | **Streamlit UI** | http://localhost:8501 | | |
| | **FastAPI Docs** | http://localhost:8000/docs | | |
| --- | |
| ## π¨βπ» Author | |
| **DocVision** | |
| An end-to-end AI-powered document understanding system built for | |
| **real-world applications, interviews, and scalable deployments**. | |