chinna vemareddy commited on
Commit
d56c6ae
·
0 Parent(s):
.dockerignore ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.pyc
4
+ *.pyo
5
+ .venv/
6
+ venv/
7
+
8
+ # Git
9
+ .git/
10
+
11
+ # Secrets
12
+ .env
13
+
14
+ # Generated / runtime files
15
+ uploads/
16
+ *.pdf
17
+ *.png
18
+ *.jpg
19
+ *.jpeg
20
+
21
+ # Streamlit
22
+ .streamlit/
23
+
24
+ # OS
25
+ .DS_Store
26
+ Thumbs.db
.gitignore ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ==============================
2
+ # Python
3
+ # ==============================
4
+ __pycache__/
5
+ *.py[cod]
6
+ *.pyo
7
+ *.pyd
8
+ *.pyc
9
+
10
+ # ==============================
11
+ # Virtual Environments
12
+ # ==============================
13
+ .venv/
14
+ venv/
15
+ env/
16
+ ENV/
17
+
18
+ # ==============================
19
+ # Environment / Secrets
20
+ # ==============================
21
+ .env
22
+ .env.*
23
+ !.env.example
24
+
25
+ # ==============================
26
+ # Docker
27
+ # ==============================
28
+ *.log
29
+ docker-compose.override.yml
30
+
31
+ # ==============================
32
+ # Build / Distribution
33
+ # ==============================
34
+ build/
35
+ dist/
36
+ *.egg-info/
37
+
38
+ # ==============================
39
+ # IDE / Editor
40
+ # ==============================
41
+ .vscode/
42
+ .idea/
43
+ *.swp
44
+ *.swo
45
+
46
+ # ==============================
47
+ # OS Files
48
+ # ==============================
49
+ .DS_Store
50
+ Thumbs.db
51
+
52
+ # ==============================
53
+ # Streamlit
54
+ # ==============================
55
+ .streamlit/
56
+
57
+ # ==============================
58
+ # Cache / Runtime Data
59
+ # ==============================
60
+ .cache/
61
+ logs/
62
+
63
+ # ==============================
64
+ # Project-specific
65
+ # ==============================
66
+
67
+
68
+ # ==============================
69
+ # HuggingFace / Transformers Cache
70
+ # ==============================
71
+ .huggingface/
72
+ .cache/huggingface/
73
+ .cache/torch/
74
+ .cache/transformers/
75
+
76
+ # ==============================
77
+ # Jupyter / Notebooks (if any)
78
+ # ==============================
79
+ .ipynb_checkpoints/
80
+
81
+ # ==============================
82
+ # Misc
83
+ # ==============================
84
+ *.tmp
85
+ *.bak
86
+ *.old
87
+
88
+
89
+
90
+
91
+ uploads/
92
+ .env
93
+ __pycache__/
94
+ *.png
95
+ *.jpg
96
+ *.jpeg
97
+ *.pdf
98
+ *.log
Dockerfile ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ # -----------------------------
4
+ # Base Image
5
+ # -----------------------------
6
+ FROM python:3.10
7
+
8
+ # -----------------------------
9
+ # Environment
10
+ # -----------------------------
11
+ ENV PYTHONDONTWRITEBYTECODE=1
12
+ ENV PYTHONUNBUFFERED=1
13
+
14
+ # -----------------------------
15
+ # System Dependencies
16
+ # -----------------------------
17
+ RUN apt-get update && apt-get install -y \
18
+ build-essential \
19
+ libgl1 \
20
+ poppler-utils \
21
+ curl \
22
+ && rm -rf /var/lib/apt/lists/*
23
+
24
+ # -----------------------------
25
+ # Working Directory
26
+ # -----------------------------
27
+ WORKDIR /app
28
+
29
+ # -----------------------------
30
+ # Upgrade pip tools
31
+ # -----------------------------
32
+ RUN pip install --upgrade pip setuptools wheel
33
+
34
+ # -----------------------------
35
+ # Install PyTorch CPU FIRST
36
+ # -----------------------------
37
+ RUN pip install --no-cache-dir \
38
+ torch==2.1.2+cpu \
39
+ torchvision==0.16.2+cpu \
40
+ torchaudio==2.1.2+cpu \
41
+ --index-url https://download.pytorch.org/whl/cpu
42
+
43
+ # -----------------------------
44
+ # Copy requirements
45
+ # -----------------------------
46
+ COPY requirements.txt .
47
+ RUN pip install --no-cache-dir -r requirements.txt
48
+
49
+ # -----------------------------
50
+ # Copy project files
51
+ # -----------------------------
52
+ COPY . .
53
+
54
+ # -----------------------------
55
+ # Runtime directories
56
+ # -----------------------------
57
+ RUN mkdir -p uploads/images
58
+
59
+ # -----------------------------
60
+ # Hugging Face PUBLIC PORT
61
+ # -----------------------------
62
+ EXPOSE 7860
63
+
64
+ # -----------------------------
65
+ # Start FastAPI (internal) + Streamlit (public)
66
+ # -----------------------------
67
+ CMD ["bash", "-c", "uvicorn src.main:app --host 0.0.0.0 --port 8000 & exec streamlit run frontend/app.py --server.port=7860 --server.address=0.0.0.0 --server.headless=true --server.enableXsrfProtection=false"]
68
+
README.md ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: DocVision IQ
3
+ emoji: 🏆
4
+ colorFrom: yellow
5
+ colorTo: indigo
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ ---
10
+
11
+
12
+ # 📄 DocVision IQ
13
+
14
+ **Employee Name:** `<EmpID>`
15
+
16
+ ---
17
+
18
+ ## 1. Research Question / Hypothesis
19
+
20
+ ### Research Question
21
+ Can a hybrid pipeline combining OCR, Vision Large Language Models (Vision LLMs), and visual cue detection accurately classify and understand real-world document images and PDFs in a production-style API system?
22
+
23
+ ### Hypothesis
24
+ Integrating high-quality OCR (LlamaParse), Vision LLM reasoning, and logo/seal detection will improve document classification robustness compared to OCR-only or vision-only approaches, especially for visually distinctive documents.
25
+
26
+ ---
27
+
28
+ ## 2. Motivation and Relevance
29
+
30
+ Organizations handle large volumes of unstructured documents such as invoices, identity cards, certificates, and contracts. Traditional OCR-only systems struggle with:
31
+
32
+ - Diverse document layouts
33
+ - Poor scan quality
34
+ - Visual identifiers (logos, seals, emblems)
35
+ - Contextual ambiguity
36
+
37
+ **DocVision** addresses these challenges using a **multi-modal document understanding pipeline**, closely reflecting real-world enterprise document intelligence systems used in **fintech, compliance, onboarding, and automation workflows**.
38
+
39
+ ---
40
+
41
+ ## 3. System Architecture
42
+
43
+ DocVision is implemented as a **modular, production-style system** with clear separation of concerns.
44
+
45
+ ### High-Level Architecture
46
+
47
+ ```
48
+
49
+ User (UI / API)
50
+
51
+ FastAPI Backend
52
+
53
+ Validation & Hashing
54
+
55
+ PDF → Image Conversion (PyMuPDF)
56
+
57
+ OCR (LlamaParse)
58
+
59
+ Vision LLM Classification (OpenRouter)
60
+
61
+ Visual Cue Detection (Logos / Seals)
62
+
63
+ Caching Layer
64
+
65
+ Structured JSON Output
66
+ ```
67
+
68
+ **Backend:** FastAPI (`src/main.py`)
69
+ **Frontend:** Streamlit (`frontend/app.py`)
70
+ **Deployment:** Docker
71
+ **Experiments:** Declarative YAML (`experiments/`)
72
+
73
+
74
+
75
+ ## 4. Models and Versions Used
76
+
77
+ ### OCR
78
+ - **LlamaParse (Llama Cloud Services)**
79
+ - Used for high-quality text extraction from images and PDFs
80
+
81
+ ### Vision LLM
82
+ - **nvidia/nemotron-nano-12b-v2-vl** (via OpenRouter)
83
+ - Used for document classification and reasoning using combined text and image inputs
84
+
85
+ ### Visual Cue Detection
86
+ - **ellabettison/Logo-Detection-finetune**
87
+ - Transformer-based object detection model for detecting logos and seals
88
+
89
+ ---
90
+
91
+ ## 5. Prompting and / or Fine-Tuning Strategy
92
+
93
+ - **Zero-shot prompting** (no fine-tuning)
94
+ - Carefully designed **instruction-based prompt** that:
95
+ - Enforces strict JSON output
96
+ - Prioritizes strong document-specific identifiers
97
+ - Includes explicit classification constraints (e.g., Aadhaar rules)
98
+ - Combines OCR text with image context
99
+
100
+ **Rationale:**
101
+ Zero-shot prompting ensures better generalization and aligns with real-world Vision LLM API usage without introducing dataset-specific bias.
102
+
103
+ ---
104
+
105
+ ## 6. Evaluation Protocol
106
+
107
+ Evaluation is performed using a combination of **automated** and **human** methods.
108
+
109
+ ### Automated Evaluation
110
+ - JSON schema validation
111
+ - Rule-based checks (e.g., Aadhaar number presence)
112
+ - Field extraction completeness
113
+ - End-to-end latency measurement
114
+
115
+ ### Human Evaluation
116
+ - Manual inspection of document type correctness
117
+ - Assessment of reasoning quality and plausibility
118
+ - Evaluation of visual cue relevance
119
+
120
+ Experiments are defined declaratively in YAML files
121
+ (`experiments/exp_01.yaml`, `experiments/exp_02.yaml`) to ensure reproducibility.
122
+
123
+ ---
124
+
125
+ ## 7. Key Results
126
+
127
+ - Consistent document classification across common document categories
128
+ - Improved robustness when visual cue detection is enabled
129
+ - Stable performance on scanned images and PDFs
130
+ - Deterministic preprocessing and bounded runtime
131
+
132
+ (Refer to the **Experiments** and **Reproducibility Statement** for detailed analysis.)
133
+
134
+ ---
135
+
136
+ ## 8. Known Limitations and Ethical Considerations
137
+
138
+ ### Limitations
139
+ - Performance degrades on extremely low-resolution or heavily occluded documents
140
+ - Dependence on external APIs for OCR and Vision LLM inference
141
+ - Field-level extraction accuracy is not benchmarked against labeled datasets
142
+
143
+ ### Ethical Considerations
144
+ - Handles potentially sensitive personal documents
145
+ - No data is permanently stored beyond processing
146
+ - API keys are required and must be securely managed
147
+ - System outputs should not be used for identity verification without human review
148
+
149
+ ---
150
+
151
+ ## 9. Exact Instructions to Reproduce Results
152
+
153
+ ### 9.1 Prerequisites
154
+ - Python 3.10+
155
+ - Docker installed
156
+ - Internet access (required for external APIs)
157
+
158
+ ### 9.2 Environment Configuration
159
+
160
+ Create a `.env` file in the project root to securely store API credentials:
161
+
162
+ ```env
163
+ LLAMA_API_KEY=llx-xxxxxxxxxxxxxxxxxxxxxxxx
164
+ OPENAI_API_KEY=sk-or-xxxxxxxxxxxxxxxxxxxx
165
+ ```
166
+
167
+ ### 9.3 Project Structure
168
+
169
+ ```text
170
+ DOCVISION_IQ/
171
+
172
+ ├── experiments/
173
+ │ ├── exp_01.yaml
174
+ │ └── exp_02.yaml
175
+
176
+ ├── frontend/
177
+ │ └── app.py
178
+
179
+ ├── src/
180
+ │ ├── config.py
181
+ │ ├── main.py
182
+ │ ├── pdfconverter.py
183
+ │ ├── textextraction.py
184
+ │ ├── vision.py
185
+ │ └── visual_cues.py
186
+
187
+ ├── uploads/
188
+
189
+ ├── Dockerfile
190
+ ├── project.yaml
191
+ ├── reproducibility.md
192
+ ├── requirements.txt
193
+ └── README.md
194
+ ```
195
+
196
+ ### 9.4 Docker Execution
197
+
198
+ Build the Docker image:
199
+
200
+ ```bash
201
+ docker build -t docvision .
202
+ docker run -p 8000:8000 -p 8501:8501 --env-file .env docvision
203
+ ```
204
+
205
+ ### 9.5 Access
206
+
207
+ | Component | URL |
208
+ |----------|-----|
209
+ | **Streamlit UI** | http://localhost:8501 |
210
+ | **FastAPI Docs** | http://localhost:8000/docs |
211
+
212
+ ---
213
+
214
+ ## 👨‍💻 Author
215
+
216
+ **DocVision**
217
+ An end-to-end AI-powered document understanding system built for
218
+ **real-world applications, interviews, and scalable deployments**.
219
+
experiments/exp_01.yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: docvision_ocr_vision_baseline
2
+ task: document_classification
3
+ model: nvidia/nemotron-nano-12b-v2-vl:free
4
+
5
+ ocr_engine: llamaparse
6
+ use_visual_cues: false
7
+
8
+ max_pages: 1
9
+ image_resize: [1024, 1024]
10
+
11
+ temperature: 0.1
12
+ seed: 42
13
+
14
+ description: >
15
+ Baseline DocVision experiment using OCR (LlamaParse) combined with
16
+ a Vision LLM for document classification. Visual cue detection
17
+ (logos/seals) is disabled to evaluate the contribution of textual
18
+ and layout understanding alone.
experiments/exp_02.yaml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: docvision_full_pipeline
2
+ task: document_classification
3
+ model: nvidia/nemotron-nano-12b-v2-vl:free
4
+
5
+ ocr_engine: llamaparse
6
+ use_visual_cues: true
7
+ logo_detection_model: ellabettison/Logo-Detection-finetune
8
+
9
+ max_pages: 1
10
+ max_logos_per_page: 4
11
+ image_resize: [1024, 1024]
12
+
13
+ temperature: 0.1
14
+ seed: 42
15
+
16
+ description: >
17
+ Full DocVision pipeline experiment combining OCR, Vision LLM
18
+ reasoning, and visual cue detection. Logos and seals extracted
19
+ from documents are used to support document classification
20
+ and improve robustness on visually distinctive documents.
frontend/app.py ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import requests
3
+ import base64
4
+ from PIL import Image
5
+ import io
6
+ import os
7
+
8
+ # --------------------------------------------------
9
+ # INTERNAL API (HF SAFE)
10
+ # --------------------------------------------------
11
+ API_URL = "http://localhost:8000"
12
+
13
+ # --------------------------------------------------
14
+ # PAGE CONFIG
15
+ # --------------------------------------------------
16
+ st.set_page_config(
17
+ page_title="DocVision IQ",
18
+ layout="wide",
19
+ initial_sidebar_state="collapsed"
20
+ )
21
+
22
+ # --------------------------------------------------
23
+ # LIMITS (MUST MATCH BACKEND)
24
+ # --------------------------------------------------
25
+ MAX_TOTAL_FILES = 5
26
+ MAX_IMAGES = 5
27
+ MAX_PDFS = 3
28
+
29
+ # --------------------------------------------------
30
+ # STYLES
31
+ # --------------------------------------------------
32
+ st.markdown("""
33
+ <style>
34
+ .stApp {
35
+ background-color: #ffffff;
36
+ color: #1a1a1a;
37
+ }
38
+ .file-header {
39
+ background-color: #000000;
40
+ color: #ffffff;
41
+ padding: 12px 16px;
42
+ border-radius: 10px;
43
+ font-weight: bold;
44
+ font-size: 16px;
45
+ margin-bottom: 12px;
46
+ }
47
+ .section-gap {
48
+ margin-bottom: 18px;
49
+ }
50
+ .logo-card {
51
+ background-color: #fafafa;
52
+ border-radius: 10px;
53
+ padding: 12px;
54
+ border: 1px solid #cccccc;
55
+ text-align: center;
56
+ }
57
+ .confidence {
58
+ color: #0b6623;
59
+ font-size: 12px;
60
+ margin-top: 6px;
61
+ }
62
+ </style>
63
+ """, unsafe_allow_html=True)
64
+
65
+ # --------------------------------------------------
66
+ # HEADER
67
+ # --------------------------------------------------
68
+ st.title("📄 DocVision IQ")
69
+ st.caption("AI-powered Document Understanding & Visual Cue Extraction")
70
+
71
+ # --------------------------------------------------
72
+ # FILE UPLOAD
73
+ # --------------------------------------------------
74
+ uploaded_files = st.file_uploader(
75
+ "Upload Images or PDFs",
76
+ type=["png", "jpg", "jpeg", "pdf"],
77
+ accept_multiple_files=True
78
+ )
79
+
80
+ extract_visual = st.checkbox("🧿 Extract Visual Cues (Logos / Seals)")
81
+
82
+ # --------------------------------------------------
83
+ # CLIENT-SIDE VALIDATION
84
+ # --------------------------------------------------
85
+ if uploaded_files:
86
+ total = len(uploaded_files)
87
+ pdfs = sum(1 for f in uploaded_files if f.name.lower().endswith(".pdf"))
88
+ images = total - pdfs
89
+
90
+ if total > MAX_TOTAL_FILES:
91
+ st.error(f"❌ Maximum {MAX_TOTAL_FILES} files allowed")
92
+ st.stop()
93
+
94
+ if pdfs > MAX_PDFS:
95
+ st.error(f"❌ Maximum {MAX_PDFS} PDFs allowed")
96
+ st.stop()
97
+
98
+ if images > MAX_IMAGES:
99
+ st.error(f"❌ Maximum {MAX_IMAGES} images allowed")
100
+ st.stop()
101
+
102
+ files = [
103
+ ("files", (file.name, file.getvalue(), file.type))
104
+ for file in uploaded_files
105
+ ]
106
+
107
+ # --------------------------------------------------
108
+ # ANALYZE DOCUMENTS
109
+ # --------------------------------------------------
110
+ with st.spinner("🔍 Analyzing documents..."):
111
+ response = requests.post(
112
+ f"{API_URL}/analyze",
113
+ files=files,
114
+ timeout=300
115
+ )
116
+
117
+ if response.status_code != 200:
118
+ try:
119
+ st.error(response.json().get("message", "Analyze API failed"))
120
+ except Exception:
121
+ st.error("Analyze API failed")
122
+ st.stop()
123
+
124
+ analysis_data = response.json()
125
+
126
+ # --------------------------------------------------
127
+ # VISUAL CUES
128
+ # --------------------------------------------------
129
+ visual_map = {}
130
+
131
+ if extract_visual:
132
+ with st.spinner("🧿 Extracting visual cues..."):
133
+ visual_response = requests.post(
134
+ f"{API_URL}/visual_cues",
135
+ files=files,
136
+ timeout=300
137
+ )
138
+
139
+ if visual_response.status_code != 200:
140
+ try:
141
+ st.error(visual_response.json().get("message", "Visual cues API failed"))
142
+ except Exception:
143
+ st.error("Visual cues API failed")
144
+ st.stop()
145
+
146
+ visual_data = visual_response.json()
147
+ for item in visual_data:
148
+ visual_map[item["file"]] = item.get("visual_cues", [])
149
+
150
+ # --------------------------------------------------
151
+ # RENDER RESULTS
152
+ # --------------------------------------------------
153
+ for item in analysis_data:
154
+ filename = item.get("file", "Unknown File")
155
+
156
+ st.markdown(
157
+ f"<div class='file-header'>📄 {filename}</div>",
158
+ unsafe_allow_html=True
159
+ )
160
+
161
+ if "error" in item:
162
+ st.error(item["error"])
163
+ continue
164
+
165
+ st.markdown(
166
+ f"<div class='section-gap'><strong>📌 Document Type:</strong> {item.get('document_type')}</div>",
167
+ unsafe_allow_html=True
168
+ )
169
+
170
+ st.markdown(
171
+ f"<div class='section-gap'><strong>Reasoning:</strong><br>{item.get('reasoning')}</div>",
172
+ unsafe_allow_html=True
173
+ )
174
+
175
+ st.markdown(
176
+ "<div class='section-gap'><strong>Extracted Text Fields:</strong></div>",
177
+ unsafe_allow_html=True
178
+ )
179
+
180
+ fields = item.get("extracted_textfields", {})
181
+ if not fields:
182
+ st.info("No text fields extracted")
183
+ else:
184
+ for k, v in fields.items():
185
+ st.markdown(f"- **{k}**: {v}")
186
+
187
+ if extract_visual and filename in visual_map:
188
+ logos = []
189
+ for page in visual_map[filename]:
190
+ logos.extend(page.get("logos", []))
191
+
192
+ if logos:
193
+ cols = st.columns(min(len(logos), 4))
194
+ for col, logo in zip(cols, logos):
195
+ with col:
196
+ img = Image.open(
197
+ io.BytesIO(base64.b64decode(logo["image_base64"]))
198
+ )
199
+ st.markdown("<div class='logo-card'>", unsafe_allow_html=True)
200
+ st.image(img)
201
+ st.markdown(
202
+ f"<div class='confidence'>Confidence: {logo['confidence']}</div>",
203
+ unsafe_allow_html=True
204
+ )
205
+ st.markdown("</div>", unsafe_allow_html=True)
206
+ else:
207
+ st.info("No visual cues found")
project.yaml ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ project_name: "DocVision"
2
+ employee_name: "<EmpID>"
3
+
4
+ models_used:
5
+ - nvidia/nemotron-nano-12b-v2-vl
6
+ - ellabettison/Logo-Detection-finetune
7
+ - LlamaParse
8
+
9
+ data:
10
+ - real_world_documents
11
+ - scanned_images
12
+ - pdf_documents
13
+
14
+ tools:
15
+ - fastapi
16
+ - streamlit
17
+ - llamaparse
18
+ - openrouter
19
+ - transformers
20
+ - pymupdf
21
+ - pillow
22
+ - docker
23
+
24
+ evaluation:
25
+ - automated
26
+ - human
27
+
28
+ compute:
29
+ platform: cloud
30
+ gpu: none
reproducibility.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reproducibility Statement
2
+
3
+ This document describes the assumptions, constraints, and known sources of
4
+ variation affecting the reproducibility of the DocVision system.
5
+
6
+ ---
7
+
8
+ ## Hardware Assumptions
9
+
10
+ - CPU-only execution environment
11
+ - No GPU required
12
+ - Minimum 8 GB RAM recommended
13
+ - Stable internet connectivity required for external API access
14
+ - Tested on cloud-based Linux and local desktop environments
15
+
16
+ ---
17
+
18
+ ## Runtime Estimates
19
+
20
+ Approximate end-to-end runtime per document (single page):
21
+
22
+ | Pipeline Stage | Avg Time (seconds) |
23
+ |---------------------------|--------------------|
24
+ | OCR (LlamaParse) | 3.0 – 5.0 |
25
+ | Vision LLM Inference | 1.5 – 3.5 |
26
+ | Logo Detection (CPU) | < 10 |
27
+ | **Total (Full Pipeline)** | **14.5 – 18.0** |
28
+
29
+ Multi-page PDFs process only the first page by default, keeping runtime bounded.
30
+
31
+ Actual latency may vary depending on API load and network conditions.
32
+
33
+ ---
34
+
35
+ ## Random Seed Handling
36
+
37
+ - Temperature for Vision LLM inference is fixed at `0.1`
38
+ - Experiments declare a fixed seed value (`seed: 42`) for documentation purposes
39
+ - No stochastic sampling is intentionally introduced in the pipeline
40
+ - Deterministic preprocessing steps (PDF rendering, resizing, hashing)
41
+
42
+ ---
43
+
44
+ ## Known Sources of Nondeterminism
45
+
46
+ Despite fixed configuration, some nondeterminism remains due to:
47
+
48
+ - Cloud-hosted Vision LLM inference
49
+ - OCR service variability across identical inputs
50
+ - Network latency fluctuations
51
+ - Concurrent execution order of asynchronous tasks
52
+
53
+ These factors may cause minor variations in extracted fields or reasoning text,
54
+ but document classification remains stable in most cases.
55
+
56
+ ---
57
+
58
+ ## Cost Considerations
59
+
60
+ - OCR and Vision inference rely on external APIs (LlamaParse, OpenRouter)
61
+ - Cost scales linearly with number of processed documents
62
+ - No GPU compute cost is incurred
63
+ - Logo detection and preprocessing run locally on CPU
64
+ - Caching mechanisms reduce redundant API calls for repeated inputs
65
+
66
+ Users should be aware of API usage limits and associated costs when processing
67
+ large batches of documents.
68
+
69
+ ---
70
+
71
+ ## Summary
72
+
73
+ DocVision prioritizes reproducibility through deterministic preprocessing,
74
+ fixed inference parameters, and declarative experiment definitions. Remaining
75
+ sources of nondeterminism stem primarily from external AI services rather than
76
+ internal system design.
requirements.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -----------------------------
2
+ # Core Backend
3
+ # -----------------------------
4
+ fastapi==0.110.0
5
+ uvicorn==0.29.0
6
+ python-multipart==0.0.9
7
+ python-dotenv==1.0.1
8
+ requests==2.31.0
9
+
10
+ # -----------------------------
11
+ # Vision / OCR
12
+ # -----------------------------
13
+ pillow==10.1.0
14
+ pymupdf==1.23.8
15
+
16
+ # -----------------------------
17
+ # LlamaIndex
18
+ # -----------------------------
19
+ llama-index==0.12.4
20
+ llama-cloud-services==0.6.91
21
+ # -----------
22
+ # OpenAI SDK
23
+ # -----------------------------
24
+ openai==1.59.7
25
+
26
+ # -----------------------------
27
+ # Transformers (CPU)
28
+ # -----------------------------
29
+ transformers==4.36.2
30
+
31
+ # -----------------------------
32
+ # Frontend
33
+ # -----------------------------
34
+ streamlit==1.31.1
src/config.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Maximum number of concurrent OCR requests
2
+ MAX_CONCURRENT_OCR: int = 5
3
+
4
+ # Vision LLM model name (OpenRouter)
5
+ VISION_MODEL_NAME: str = "nvidia/nemotron-nano-12b-v2-vl:free"
6
+
7
+
8
+ # Logo detection model
9
+ LOGO_DETECTION_MODEL: str = "ellabettison/Logo-Detection-finetune"
10
+
11
+
12
+ # PDF to image conversion settings
13
+ PDF_IMAGE_DPI: int = 302
14
+ PDF_IMAGE_BASE_DIR: str = "uploads/images"
15
+
16
+
17
+ # file upload limitations
18
+
19
+ UPLOAD_DIR: str = "uploads"
20
+
21
+ ALLOWED_EXTENSIONS = {".png", ".jpg", ".jpeg", ".pdf"}
22
+
23
+ MAX_TOTAL_FILES: int = 5
24
+ MAX_PDFS: int = 3
25
+ MAX_IMAGES: int = 5
26
+
27
+ MAX_IMAGE_MB: int = 5
28
+ MAX_PDF_MB: int = 10
29
+
30
+ # -------------------------------
31
+ # IMAGE VALIDATION
32
+ # -------------------------------
33
+ MIN_WIDTH: int = 300
34
+ MIN_HEIGHT: int = 300
35
+ MAX_WIDTH: int = 6000
36
+ MAX_HEIGHT: int = 6000
37
+
38
+ # -------------------------------
39
+ # VISUAL CUES
40
+ # -------------------------------
41
+ MAX_VISUAL_PAGES: int = 1
42
+ MAX_LOGOS_PER_PAGE: int = 4
43
+ MAX_IMAGE_RESIZE = (1024, 1024)
44
+
src/main.py ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import os
3
+ import io
4
+ import hashlib
5
+ import asyncio
6
+ from typing import List, Dict, Any
7
+
8
+ from fastapi import FastAPI, UploadFile, File
9
+ from fastapi.responses import JSONResponse
10
+ from fastapi.middleware.cors import CORSMiddleware
11
+ from PIL import Image
12
+
13
+ from src.pdfconverter import pdf_to_images
14
+ from src.vision import classify_image
15
+ from src.visual_cues import detect_logos_from_bytes
16
+ from src.config import (
17
+ UPLOAD_DIR,
18
+ ALLOWED_EXTENSIONS,
19
+ MAX_TOTAL_FILES,
20
+ MAX_PDFS,
21
+ MAX_IMAGES,
22
+ MAX_IMAGE_MB,
23
+ MAX_PDF_MB,
24
+ MIN_WIDTH,
25
+ MIN_HEIGHT,
26
+ MAX_WIDTH,
27
+ MAX_HEIGHT,
28
+ MAX_VISUAL_PAGES,
29
+ MAX_LOGOS_PER_PAGE,
30
+ MAX_IMAGE_RESIZE,
31
+ )
32
+
33
+
34
+ # --------------------------------------------------
35
+ # FASTAPI APPLICATION
36
+ # --------------------------------------------------
37
+ app = FastAPI(title="DocVision API")
38
+
39
+
40
+ # --------------------------------------------------
41
+ # CORS (REQUIRED FOR HUGGING FACE)
42
+ # --------------------------------------------------
43
+ app.add_middleware(
44
+ CORSMiddleware,
45
+ allow_origins=["*"],
46
+ allow_credentials=True,
47
+ allow_methods=["*"],
48
+ allow_headers=["*"],
49
+ )
50
+
51
+
52
+ # --------------------------------------------------
53
+ # HEALTH CHECK
54
+ # --------------------------------------------------
55
+ @app.get("/")
56
+ def health() -> Dict[str, str]:
57
+ """Health check endpoint for routing and monitoring."""
58
+ return {"status": "ok"}
59
+
60
+
61
+ # --------------------------------------------------
62
+ # DIRECTORIES
63
+ # --------------------------------------------------
64
+ os.makedirs(UPLOAD_DIR, exist_ok=True)
65
+
66
+
67
+ # --------------------------------------------------
68
+ # IN-MEMORY CACHES
69
+ # --------------------------------------------------
70
+ TEXT_CACHE: Dict[str, Dict[str, Any]] = {}
71
+ VISUAL_CACHE: Dict[str, Dict[str, Any]] = {}
72
+
73
+
74
+ # --------------------------------------------------
75
+ # HELPER FUNCTIONS
76
+ # --------------------------------------------------
77
+ def file_hash(data: bytes) -> str:
78
+ """Generate a deterministic hash for file contents."""
79
+ return hashlib.md5(data).hexdigest()
80
+
81
+
82
+ def read_file(file: UploadFile) -> bytes:
83
+ """Read file contents without consuming the stream."""
84
+ data = file.file.read()
85
+ file.file.seek(0)
86
+ return data
87
+
88
+
89
+ def validate_file(file: UploadFile, contents: bytes) -> str | None:
90
+ """
91
+ Validate file type, size, and image resolution.
92
+
93
+ Returns an error message if invalid, otherwise None.
94
+ """
95
+ ext = os.path.splitext(file.filename)[1].lower()
96
+ size_mb = len(contents) / (1024 * 1024)
97
+
98
+ if ext not in ALLOWED_EXTENSIONS:
99
+ return "Unsupported file format"
100
+
101
+ if ext == ".pdf" and size_mb > MAX_PDF_MB:
102
+ return f"PDF exceeds {MAX_PDF_MB} MB"
103
+
104
+ if ext != ".pdf" and size_mb > MAX_IMAGE_MB:
105
+ return f"Image exceeds {MAX_IMAGE_MB} MB"
106
+
107
+ if ext != ".pdf":
108
+ try:
109
+ image = Image.open(io.BytesIO(contents))
110
+ width, height = image.size
111
+
112
+ if width < MIN_WIDTH or height < MIN_HEIGHT:
113
+ return f"Image too small ({width}x{height})"
114
+
115
+ if width > MAX_WIDTH or height > MAX_HEIGHT:
116
+ return f"Image too large ({width}x{height})"
117
+
118
+ except Exception:
119
+ return "Invalid image file"
120
+
121
+ return None
122
+
123
+
124
+ # --------------------------------------------------
125
+ # DOCUMENT ANALYSIS ENDPOINT
126
+ # --------------------------------------------------
127
+ @app.post("/analyze")
128
+ async def analyze(files: List[UploadFile] = File(...)) -> JSONResponse:
129
+ """
130
+ Perform OCR + Vision-based document classification.
131
+ """
132
+ if len(files) > MAX_TOTAL_FILES:
133
+ return JSONResponse(
134
+ {"error": f"Maximum {MAX_TOTAL_FILES} files allowed"},
135
+ status_code=400,
136
+ )
137
+
138
+ pdf_count = sum(f.filename.lower().endswith(".pdf") for f in files)
139
+ img_count = len(files) - pdf_count
140
+
141
+ async def process_file(file: UploadFile) -> Dict[str, Any]:
142
+ contents = read_file(file)
143
+ fid = f"{file.filename}_{file_hash(contents)}"
144
+
145
+ if file.filename.lower().endswith(".pdf") and pdf_count > MAX_PDFS:
146
+ return {"file": file.filename, "error": f"Maximum {MAX_PDFS} PDFs allowed"}
147
+
148
+ if not file.filename.lower().endswith(".pdf") and img_count > MAX_IMAGES:
149
+ return {"file": file.filename, "error": f"Maximum {MAX_IMAGES} images allowed"}
150
+
151
+ if fid in TEXT_CACHE:
152
+ return TEXT_CACHE[fid]
153
+
154
+ error = validate_file(file, contents)
155
+ if error:
156
+ return {"file": file.filename, "error": error}
157
+
158
+ path = os.path.join(UPLOAD_DIR, file.filename)
159
+ with open(path, "wb") as f:
160
+ f.write(contents)
161
+
162
+ try:
163
+ if file.filename.lower().endswith(".pdf"):
164
+ pdf_name = await asyncio.to_thread(pdf_to_images, path)
165
+ base_dir = os.path.join("uploads", "images", pdf_name)
166
+ first_page = sorted(os.listdir(base_dir))[0]
167
+ analysis = await classify_image(os.path.join(base_dir, first_page))
168
+ else:
169
+ analysis = await classify_image(path)
170
+
171
+ result = {
172
+ "file": file.filename,
173
+ "document_type": analysis.get("document_type"),
174
+ "reasoning": analysis.get("reasoning"),
175
+ "extracted_textfields": analysis.get("extracted_textfields", {}),
176
+ }
177
+
178
+ TEXT_CACHE[fid] = result
179
+ return result
180
+
181
+ except Exception as exc:
182
+ return {"file": file.filename, "error": f"Processing failed: {exc}"}
183
+
184
+ results = await asyncio.gather(*[process_file(f) for f in files])
185
+ return JSONResponse(content=results)
186
+
187
+
188
+ # --------------------------------------------------
189
+ # VISUAL CUES ENDPOINT
190
+ # --------------------------------------------------
191
+ @app.post("/visual_cues")
192
+ async def visual_cues(files: List[UploadFile] = File(...)) -> JSONResponse:
193
+ """
194
+ Detect logos, seals, and visual symbols from documents.
195
+ """
196
+
197
+ async def process_visual(file: UploadFile) -> Dict[str, Any]:
198
+ contents = read_file(file)
199
+ fid = f"{file.filename}_{file_hash(contents)}"
200
+
201
+ if fid in VISUAL_CACHE:
202
+ return VISUAL_CACHE[fid]
203
+
204
+ error = validate_file(file, contents)
205
+ if error:
206
+ return {"file": file.filename, "error": error}
207
+
208
+ path = os.path.join(UPLOAD_DIR, file.filename)
209
+ with open(path, "wb") as f:
210
+ f.write(contents)
211
+
212
+ visuals = []
213
+
214
+ try:
215
+ if file.filename.lower().endswith(".pdf"):
216
+ pdf_name = await asyncio.to_thread(pdf_to_images, path)
217
+ base_dir = os.path.join("uploads", "images", pdf_name)
218
+
219
+ for img_name in sorted(os.listdir(base_dir))[:MAX_VISUAL_PAGES]:
220
+ with open(os.path.join(base_dir, img_name), "rb") as img_file:
221
+ logos = await asyncio.to_thread(
222
+ detect_logos_from_bytes,
223
+ img_file.read(),
224
+ MAX_IMAGE_RESIZE,
225
+ MAX_LOGOS_PER_PAGE,
226
+ )
227
+ visuals.append({"page": img_name, "logos": logos})
228
+ else:
229
+ logos = await asyncio.to_thread(
230
+ detect_logos_from_bytes,
231
+ contents,
232
+ MAX_IMAGE_RESIZE,
233
+ MAX_LOGOS_PER_PAGE,
234
+ )
235
+ visuals.append({"page": "image", "logos": logos})
236
+
237
+ result = {"file": file.filename, "visual_cues": visuals}
238
+ VISUAL_CACHE[fid] = result
239
+ return result
240
+
241
+ except Exception as exc:
242
+ return {"file": file.filename, "error": f"Visual processing failed: {exc}"}
243
+
244
+ results = await asyncio.gather(*[process_visual(f) for f in files])
245
+ return JSONResponse(content=results)
src/pdfconverter.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ import os
4
+ from typing import Optional
5
+
6
+ import fitz # PyMuPDF
7
+
8
+ from src.config import PDF_IMAGE_DPI, PDF_IMAGE_BASE_DIR
9
+
10
+
11
+ # --------------------------------------------------
12
+ # PDF TO IMAGE CONVERSION
13
+ # --------------------------------------------------
14
+ def pdf_to_images(
15
+ pdf_path: str,
16
+ base_dir: Optional[str] = None
17
+ ) -> str:
18
+ """
19
+ Convert a multi-page PDF into individual PNG images.
20
+
21
+ Each page of the PDF is rendered at a fixed DPI and
22
+ saved as a separate image file inside a directory
23
+ named after the PDF.
24
+
25
+ Parameters
26
+ ----------
27
+ pdf_path : str
28
+ Path to the input PDF file.
29
+ base_dir : str, optional
30
+ Base directory where page images will be stored.
31
+ Defaults to the configured PDF_IMAGE_BASE_DIR.
32
+
33
+ Returns
34
+ -------
35
+ str
36
+ Name of the PDF file (without extension), used
37
+ as the output folder name.
38
+ """
39
+
40
+ # Resolve base output directory
41
+ output_base: str = base_dir or PDF_IMAGE_BASE_DIR
42
+
43
+ # Extract PDF name (without extension)
44
+ pdf_name: str = os.path.splitext(os.path.basename(pdf_path))[0]
45
+
46
+ # Create output directory for this PDF
47
+ output_dir: str = os.path.join(output_base, pdf_name)
48
+ os.makedirs(output_dir, exist_ok=True)
49
+
50
+ # Open PDF document
51
+ document = fitz.open(pdf_path)
52
+
53
+ # Render each page as a high-resolution PNG image
54
+ for page_index, page in enumerate(document, start=1):
55
+ pixmap = page.get_pixmap(dpi=PDF_IMAGE_DPI)
56
+ pixmap.save(
57
+ os.path.join(output_dir, f"page_{page_index}.png")
58
+ )
59
+
60
+ # Close document to release resources
61
+ document.close()
62
+
63
+ return pdf_name
src/textextraction.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import os
3
+ import asyncio
4
+ from typing import Dict
5
+
6
+ from llama_cloud_services import LlamaParse
7
+ from llama_index.core import SimpleDirectoryReader
8
+
9
+ from src.config import MAX_CONCURRENT_OCR
10
+
11
+ from dotenv import load_dotenv
12
+ load_dotenv()
13
+
14
+ # --------------------------------------------------
15
+ # CONCURRENCY CONTROL
16
+ # --------------------------------------------------
17
+ # Limits the number of simultaneous OCR operations
18
+ ocr_semaphore: asyncio.Semaphore = asyncio.Semaphore(
19
+ MAX_CONCURRENT_OCR
20
+ )
21
+
22
+
23
+ # --------------------------------------------------
24
+ # OCR PARSER INITIALIZATION
25
+ # --------------------------------------------------
26
+ # LlamaParse internally reads the API key from environment
27
+ parser: LlamaParse = LlamaParse(
28
+ api_key=os.getenv("LLAMA_API_KEY"),
29
+ result_type="text"
30
+ )
31
+
32
+
33
+ # --------------------------------------------------
34
+ # FILE EXTENSION HANDLERS
35
+ # --------------------------------------------------
36
+ file_extractor: Dict[str, LlamaParse] = {
37
+ ".jpg": parser,
38
+ ".jpeg": parser,
39
+ ".png": parser,
40
+ ".pdf": parser,
41
+ }
42
+
43
+
44
+ # --------------------------------------------------
45
+ # ASYNCHRONOUS OCR EXTRACTION
46
+ # --------------------------------------------------
47
+ async def extract_text_from_image_async(file_path: str) -> str:
48
+ """
49
+ Extract text asynchronously from an image or PDF using LlamaParse.
50
+
51
+ Concurrency is limited using a semaphore to prevent excessive
52
+ parallel OCR requests.
53
+
54
+ Parameters
55
+ ----------
56
+ file_path : str
57
+ Path to the image or PDF file.
58
+
59
+ Returns
60
+ -------
61
+ str
62
+ Extracted text content, or an empty string on failure.
63
+ """
64
+ async with ocr_semaphore:
65
+ try:
66
+ documents = await asyncio.to_thread(
67
+ lambda: SimpleDirectoryReader(
68
+ input_files=[file_path],
69
+ file_extractor=file_extractor
70
+ ).load_data()
71
+ )
72
+
73
+ return "\n".join(doc.text for doc in documents).strip()
74
+
75
+ except Exception as exc:
76
+ print(f"OCR failed for {file_path}: {exc}")
77
+ return ""
src/vision.py ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ import os
4
+ import base64
5
+ import json
6
+ import re
7
+ from typing import Dict, Any
8
+
9
+ from dotenv import load_dotenv
10
+ from openai import OpenAI
11
+
12
+ from src.textextraction import extract_text_from_image_async
13
+ from src.config import VISION_MODEL_NAME
14
+
15
+ # --------------------------------------------------
16
+ # ENVIRONMENT SETUP
17
+ # --------------------------------------------------
18
+ # Load environment variables for API keys
19
+ load_dotenv()
20
+
21
+
22
+ # --------------------------------------------------
23
+ # OPENROUTER CLIENT INITIALIZATION
24
+ # --------------------------------------------------
25
+ client: OpenAI = OpenAI(
26
+ base_url="https://openrouter.ai/api/v1",
27
+ api_key=os.getenv("OPENAI_API_KEY")
28
+ )
29
+
30
+
31
+ # --------------------------------------------------
32
+ # STRICT JSON PROMPT (UNCHANGED)
33
+ # --------------------------------------------------
34
+ PROMPT = """
35
+ You are an intelligent document understanding system.
36
+ **Prompt:** You are an advanced document classification AI tasked with accurately identifying the specific type of document presented to Your objective is to analyze the visual layout and textual content of the document while adhering to the following guidelines:
37
+ 1.**Visual Layout Analysis**: Examine the structural elements of the document such as logos, headers, footers, and any unique formatting that may indicate the document type.Pay attention to layout patterns that are characteristic of each document type.
38
+ 2.**Textual Evidence Extraction**: Extract and analyze textual information from the document.Look for key phrases, terms, and identifiers that are strongly associated with each document type.This includes: - For passports: Look for terms like "Passport", "Nationality", "Date of Birth", and country-specific formatting.- For aadhaar cards: Identify "Aadhaar Number", "Biometric Data", and any unique UIDAI branding.- For pan cards: Search for "Permanent Account Number" and tax-related keywords.- For contracts: Identify terms like "Agreement", "Parties", "Terms and Conditions".- For invoices: Look for "Invoice Number", "Billing Address", "Total Amount".
39
+ 3.**Prioritization of Strong Identifiers**: Focus on strong, document-specific identifiers rather than generic text.If there is ambiguity or multiple potential matches, prioritize the identifiers that are most definitive and unique to the document type.
40
+ 4.**Avoiding Guesses**: Do not make assumptions or guesses if the evidence is insufficient.If the document does not clearly fit into one of the specified categories based on the analysis, return a response indicating that the document type cannot be determined with confidence.
41
+ 5.**Output Format**: Provide your classification result in the following format: - "Document Type: [identified type]".
42
+
43
+ important: Your analysis must be thorough and based on concrete evidence from the document's content and layout.
44
+ there is a difference between invoice and receipt.etc
45
+
46
+ extract exach and every information from the document.
47
+ and provide correct field names and values
48
+
49
+ important:Never classify a document as aadhaar_card unless a clear 12-digit Aadhaar number OR UIDAI reference is present.
50
+ extract fields each and every important information.
51
+
52
+ this is normal reasoning brief 2 to 3 lines explanation of how the document type was determined,Highlight underline
53
+
54
+ take document type decision based on the normal reasoning.
55
+
56
+ OUTPUT FORMAT (JSON – STRICT, ADDITIONAL):
57
+
58
+ Return VALID JSON ONLY.
59
+ Do not include any text outside JSON.
60
+ Do not include markdown or code blocks.
61
+
62
+ {
63
+ "document_type": "<Document Type>",
64
+ "reasoning": "<brief 2 to 3 lines explanation of how the document type was determined, highlighting the key visual or textual features that influenced the decision>",
65
+ "extracted_textfields": {
66
+ "<field_name>": "<value>",
67
+ "<field_name>": "<value>"
68
+ }
69
+ }
70
+ """
71
+
72
+
73
+ # --------------------------------------------------
74
+ # HELPER FUNCTIONS
75
+ # --------------------------------------------------
76
+ def image_to_base64(path: str) -> str:
77
+ """
78
+ Convert an image file to a base64-encoded string.
79
+
80
+ Parameters
81
+ ----------
82
+ path : str
83
+ Path to the image file.
84
+
85
+ Returns
86
+ -------
87
+ str
88
+ Base64-encoded image content.
89
+ """
90
+ with open(path, "rb") as file:
91
+ return base64.b64encode(file.read()).decode()
92
+
93
+
94
+ def extract_json_from_text(text: str) -> Dict[str, Any]:
95
+ """
96
+ Safely extract JSON content from model output.
97
+
98
+ This function handles cases where the model may
99
+ accidentally include extra text around the JSON.
100
+
101
+ Parameters
102
+ ----------
103
+ text : str
104
+ Raw model output.
105
+
106
+ Returns
107
+ -------
108
+ dict
109
+ Parsed JSON object.
110
+
111
+ Raises
112
+ ------
113
+ ValueError
114
+ If valid JSON cannot be extracted.
115
+ """
116
+ try:
117
+ return json.loads(text)
118
+ except json.JSONDecodeError:
119
+ match = re.search(r"\{[\s\S]*\}", text)
120
+ if match:
121
+ return json.loads(match.group())
122
+ raise ValueError("LLM did not return valid JSON")
123
+
124
+
125
+ # --------------------------------------------------
126
+ # ASYNC DOCUMENT CLASSIFICATION
127
+ # --------------------------------------------------
128
+ async def classify_image(image_path: str) -> Dict[str, Any]:
129
+ """
130
+ Perform document classification using OCR + Vision LLM.
131
+
132
+ Steps:
133
+ 1. Extract text using OCR
134
+ 2. Encode image as base64
135
+ 3. Send text + image to Vision LLM
136
+ 4. Parse and normalize JSON output
137
+
138
+ Parameters
139
+ ----------
140
+ image_path : str
141
+ Path to the image or PDF page.
142
+
143
+ Returns
144
+ -------
145
+ dict
146
+ Structured classification result with document type,
147
+ reasoning, and extracted fields.
148
+ """
149
+ # OCR extraction
150
+ ocr_text: str = await extract_text_from_image_async(image_path)
151
+
152
+ # Image encoding
153
+ image_base64: str = image_to_base64(image_path)
154
+
155
+ # Vision LLM request
156
+ response = client.chat.completions.create(
157
+ model=VISION_MODEL_NAME,
158
+ temperature=0.1,
159
+ messages=[
160
+ {
161
+ "role": "user",
162
+ "content": [
163
+ {
164
+ "type": "text",
165
+ "text": PROMPT + "\n\nOCR TEXT:\n" + ocr_text
166
+ },
167
+ {
168
+ "type": "image_url",
169
+ "image_url": {
170
+ "url": f"data:image/png;base64,{image_base64}"
171
+ }
172
+ }
173
+ ]
174
+ }
175
+ ]
176
+ )
177
+
178
+ raw_output: str = response.choices[0].message.content.strip()
179
+
180
+ # Safe JSON parsing
181
+ try:
182
+ result: Dict[str, Any] = extract_json_from_text(raw_output)
183
+ except Exception:
184
+ result = {
185
+ "document_type": "unknown",
186
+ "reasoning": "Model output could not be parsed as JSON",
187
+ "extracted_textfields": {}
188
+ }
189
+
190
+ # Ensure required keys are always present
191
+ return {
192
+ "document_type": result.get("document_type", "unknown"),
193
+ "reasoning": result.get("reasoning", ""),
194
+ "extracted_textfields": result.get("extracted_textfields", {}),
195
+ }
src/visual_cues.py ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import io
3
+ import base64
4
+ from typing import List, Dict, Tuple
5
+
6
+ from PIL import Image
7
+ from transformers import pipeline
8
+
9
+ from src.config import LOGO_DETECTION_MODEL
10
+
11
+
12
+ # --------------------------------------------------
13
+ # MODEL INITIALIZATION (LOAD ONCE)
14
+ # --------------------------------------------------
15
+ # Object detection pipeline for logo / seal detection
16
+ detector = pipeline(
17
+ task="object-detection",
18
+ model=LOGO_DETECTION_MODEL,
19
+ device=-1 # CPU
20
+ )
21
+
22
+
23
+ # --------------------------------------------------
24
+ # LOGO DETECTION
25
+ # --------------------------------------------------
26
+ def detect_logos_from_bytes(
27
+ image_bytes: bytes,
28
+ resize: Tuple[int, int] = (1024, 1024),
29
+ max_logos: int = 3
30
+ ) -> List[Dict[str, str | float]]:
31
+ """
32
+ Detect logos or visual emblems from raw image bytes.
33
+
34
+ The function resizes the image for faster inference,
35
+ detects logo regions, crops them, and returns the
36
+ cropped logo images encoded in base64 along with
37
+ confidence scores.
38
+
39
+ Parameters
40
+ ----------
41
+ image_bytes : bytes
42
+ Raw image data.
43
+ resize : tuple[int, int], optional
44
+ Maximum image size for inference (default: 1024x1024).
45
+ max_logos : int, optional
46
+ Maximum number of detected logos to return.
47
+
48
+ Returns
49
+ -------
50
+ list[dict]
51
+ List of detected logos with:
52
+ - confidence: float
53
+ - image_base64: str
54
+ """
55
+
56
+ # Load image from bytes
57
+ image: Image.Image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
58
+
59
+ # Resize image for performance optimization
60
+ image.thumbnail(resize)
61
+
62
+ # Run object detection
63
+ detections = detector(image)
64
+
65
+ results: List[Dict[str, str | float]] = []
66
+
67
+ # Process top detections only
68
+ for det in detections[:max_logos]:
69
+ box = det["box"]
70
+ score: float = float(det["score"])
71
+
72
+ xmin: int = int(box["xmin"])
73
+ ymin: int = int(box["ymin"])
74
+ xmax: int = int(box["xmax"])
75
+ ymax: int = int(box["ymax"])
76
+
77
+ # Crop detected logo region
78
+ cropped = image.crop((xmin, ymin, xmax, ymax))
79
+
80
+ # Convert cropped logo to base64
81
+ buffer = io.BytesIO()
82
+ cropped.save(buffer, format="PNG")
83
+
84
+ results.append({
85
+ "confidence": round(score, 3),
86
+ "image_base64": base64.b64encode(buffer.getvalue()).decode()
87
+ })
88
+
89
+ return results