File size: 7,778 Bytes
7c71fe1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
title: Saudi Law AI Assistant
emoji: ⚖️
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---

# Law Document RAG Chat Application
...

# Law Document RAG Chat Application

A web application that allows users to ask questions about indexed legal documents using Retrieval Augmented Generation (RAG) techniques.

## Features

- 🤖 **RAG-powered Q&A**: Ask questions about your legal documents and get answers extracted directly from the context
- 📚 **Document Indexing**: Automatically index PDF, TXT, DOCX, and DOC files from a folder
- 🎨 **Modern React Frontend**: Beautiful, responsive chat interface
-**FastAPI Backend**: High-performance API with LangChain and FAISS
- 🔍 **Exact Context Extraction**: Answers are extracted directly from documents, not generated
- 🔀 **Hybrid Search**: Combines BM25 (keyword-based) and semantic search for improved retrieval accuracy
- 🤗 **Qwen Model Support**: Uses Qwen/Qwen3-32B model via HuggingFace router for high-quality Arabic language understanding
- 🚀 **Hugging Face Spaces Ready**: Configured for easy deployment

## Tech Stack

- **Frontend**: React 18
- **Backend**: FastAPI
- **RAG**: LangChain + FAISS with Hybrid Search (BM25 + Semantic)
- **Vector Database**: FAISS
- **Embeddings**: Qwen/Qwen3-Embedding-8B (HuggingFace) or OpenAI embeddings (configurable)
- **LLM**: Qwen/Qwen3-32B via HuggingFace router (default) or OpenAI API (configurable)
- **Hybrid Search**: BM25 + Semantic search using EnsembleRetriever
- **Python**: 3.10 or 3.11 (required for faiss-cpu compatibility)

## Project Structure

```
KSAlaw-document-agent/
├── backend/
│   ├── main.py              # FastAPI application
│   ├── rag_system.py        # RAG implementation
│   ├── document_processor.py # Document processing logic
│   ├── embeddings.py        # OpenAI embeddings wrapper
│   └── chat_history.py     # Chat history management
├── frontend/
│   ├── src/
│   │   ├── App.js           # Main React component
│   │   ├── App.css          # Styles
│   │   ├── index.js         # React entry point
│   │   └── index.css        # Global styles
│   ├── build/               # Built React app (for deployment)
│   ├── public/
│   │   └── index.html       # HTML template
│   └── package.json         # Node dependencies
├── documents/               # Place your PDF documents here
├── vectorstore/            # FAISS vectorstore (auto-generated)
├── app.py                   # Hugging Face Spaces entry point
├── Dockerfile               # Docker configuration
├── pyproject.toml           # Python dependencies (uv)
├── uv.lock                  # Locked dependencies
├── processed_documents.json # Processed document summaries
└── README.md                # This file
```

## Quick Start

**Local Development:**
1. Install dependencies: `uv sync` and `cd frontend && npm install`
2. Create `.env` file in the project root with required environment variables:
   - `HF_TOKEN`: Your HuggingFace API token (required for Qwen model)
   - `OPENAI_API_KEY`: Your OpenAI API key (required for document processing)
   - Optionally set `USE_HYBRID_SEARCH=true` to enable hybrid search (BM25 + Semantic)
3. Add documents to `documents/` folder
4. Run backend: `uv run python backend/main.py`
5. Run frontend: `cd frontend && npm start`

**Deployment to Hugging Face Spaces:**
1. Build frontend: `cd frontend && npm run build`
2. Set up Xet storage (recommended) or prepare to upload PDFs via UI
3. Push to Hugging Face: `git push hf main`
4. Set required environment variables in Space secrets:
   - `HF_TOKEN`: Your HuggingFace API token
   - `OPENAI_API_KEY`: Your OpenAI API key
   - Optionally set `USE_HYBRID_SEARCH=true` to enable hybrid search

## API Endpoints

- `GET /api/` - Health check
- `GET /api/health` - Health status
- `POST /api/index` - Index documents from a folder
  ```json
  {
    "folder_path": "documents"
  }
  ```
- `POST /api/ask` - Ask a question
  ```json
  {
    "question": "What is the law about X?",
    "use_history": true,
    "context_mode": "chunks",
    "model_provider": "qwen"
  }
  ```
  - `question` (required): The question to ask
  - `use_history` (optional): Whether to use chat history (default: `true`)
  - `context_mode` (optional): Context mode - `"full"` (entire document) or `"chunks"` (top semantic chunks, default)
  - `model_provider` (optional): Model provider - `"qwen"` (default) or `"openai"`
  
  **Note**: The default `model_provider` is `"qwen"` which uses Qwen/Qwen3-32B via HuggingFace router. When using `context_mode="chunks"` with hybrid search enabled, the system combines BM25 and semantic search for improved retrieval accuracy.

## Environment Variables

### Required Variables

- `HF_TOKEN`: Your HuggingFace API token (required for Qwen model and HuggingFace embeddings)
- `OPENAI_API_KEY`: Your OpenAI API key (required for document processing and optional for embeddings/LLM)

### Optional Configuration

- `QWEN_MODEL`: Qwen model to use (default: `Qwen/Qwen3-32B:nscale`)
- `EMBEDDINGS_PROVIDER`: Embeddings provider - `"openai"` or `"hf"`/`"huggingface"` (default: `"openai"`)
- `HF_EMBEDDING_MODEL`: HuggingFace embedding model (default: `Qwen/Qwen3-Embedding-8B`)
- `OPENAI_LLM_MODEL`: OpenAI LLM model to use (default: `gpt-4o-mini`)
- `OPENAI_EMBEDDING_MODEL`: OpenAI embedding model (default: `text-embedding-ada-002`)
- `USE_HYBRID_SEARCH`: Enable hybrid search combining BM25 and semantic search (default: `"false"`, set to `"true"` to enable)
- `HYBRID_BM25_WEIGHT`: Weight for BM25 component in hybrid search (default: `0.5`)
- `HYBRID_SEMANTIC_WEIGHT`: Weight for semantic component in hybrid search (default: `0.5`)
- `CHAT_HISTORY_TURNS`: Number of conversation turns to keep in history (default: `10`)

## Notes

- The system extracts exact text from documents, not generated responses
- Supported document formats: PDF, TXT, DOCX, DOC
- The vectorstore is saved locally and persists between sessions
- Documents are automatically processed on startup (no manual indexing needed)
- **Default Model**: The system uses Qwen/Qwen3-32B via HuggingFace router by default for better Arabic language understanding
- **Hybrid Search**: When enabled (`USE_HYBRID_SEARCH=true`), combines BM25 keyword search with semantic search for improved retrieval accuracy
- For Hugging Face Spaces, the frontend automatically uses `/api` as the API URL
- This project uses `uv` for Python package management - dependencies are defined in `pyproject.toml` and `uv.lock`
- The `.env` file should be in the project root (not in the backend folder)
- PDFs can be stored using Hugging Face Xet storage or uploaded via the Space UI

## Troubleshooting

### Common Issues

- **HF_TOKEN Error**: Make sure `HF_TOKEN` is set in your `.env` file (local) or Space secrets (deployment) when using Qwen model
- **OpenAI API Key Error**: Make sure `OPENAI_API_KEY` is set in your `.env` file (local) or Space secrets (deployment) for document processing
- **No documents found**: Ensure documents are in the `documents/` folder with supported extensions (PDF, TXT, DOCX, DOC)
- **Frontend can't connect**: Check that the backend is running on port 8000
- **Build fails on Spaces**: Ensure `frontend/build/` exists (run `npm run build`), check Dockerfile, verify dependencies in `pyproject.toml`
- **RAG system not initialized**: Check Space logs, ensure `processed_documents.json` exists and is not ignored by `.dockerignore`
- **Hybrid search not working**: Ensure `rank-bm25` is installed (`uv sync` should handle this) and `USE_HYBRID_SEARCH=true` is set

## License

MIT