Spaces:
Sleeping
Sleeping
Commit
·
2f34b04
1
Parent(s):
5d4e100
Improve RAG init & docs new5
Browse files- .dockerignore +1 -1
- .gitignore +2 -2
- Dockerfile +13 -1
- README_HF_SPACES.md +4 -3
- backend/main.py +7 -1
.dockerignore
CHANGED
|
@@ -10,7 +10,7 @@ README.md
|
|
| 10 |
documents/*.pdf
|
| 11 |
documents/*.docx
|
| 12 |
documents/*.txt
|
| 13 |
-
|
| 14 |
|
| 15 |
|
| 16 |
|
|
|
|
| 10 |
documents/*.pdf
|
| 11 |
documents/*.docx
|
| 12 |
documents/*.txt
|
| 13 |
+
|
| 14 |
|
| 15 |
|
| 16 |
|
.gitignore
CHANGED
|
@@ -49,8 +49,8 @@ yarn-error.log*
|
|
| 49 |
Thumbs.db
|
| 50 |
|
| 51 |
# Documents (optional - you may want to track these)
|
| 52 |
-
|
| 53 |
-
documents1/
|
| 54 |
|
| 55 |
# documents/*.docx
|
| 56 |
# documents/*.txt
|
|
|
|
| 49 |
Thumbs.db
|
| 50 |
|
| 51 |
# Documents (optional - you may want to track these)
|
| 52 |
+
documents/
|
| 53 |
+
# documents1/
|
| 54 |
|
| 55 |
# documents/*.docx
|
| 56 |
# documents/*.txt
|
Dockerfile
CHANGED
|
@@ -21,9 +21,21 @@ RUN uv pip install --system .
|
|
| 21 |
COPY backend/ ./backend/
|
| 22 |
|
| 23 |
# Copy processed documents and vector data
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
COPY processed_documents.json ./processed_documents.json
|
| 26 |
|
|
|
|
|
|
|
|
|
|
| 27 |
# Copy built frontend bundle
|
| 28 |
COPY frontend/build/ ./frontend/build/
|
| 29 |
|
|
|
|
| 21 |
COPY backend/ ./backend/
|
| 22 |
|
| 23 |
# Copy processed documents and vector data
|
| 24 |
+
# Ensure documents and vectorstore folders exist (even if empty)
|
| 25 |
+
RUN mkdir -p documents
|
| 26 |
+
|
| 27 |
+
# Copy vectorstore folder if it exists in the build context
|
| 28 |
+
# Note: If vectorstore/ doesn't exist in repo, ensure an empty vectorstore/ folder exists
|
| 29 |
+
# (create it with: mkdir -p vectorstore && touch vectorstore/.gitkeep)
|
| 30 |
+
# The vectorstore will be populated at runtime from processed_documents.json if not pre-built
|
| 31 |
+
COPY vectorstore/ ./vectorstore/
|
| 32 |
+
|
| 33 |
+
# Copy processed documents JSON
|
| 34 |
COPY processed_documents.json ./processed_documents.json
|
| 35 |
|
| 36 |
+
# Optionally copy documents if they exist in the repo
|
| 37 |
+
# COPY documents/ ./documents/
|
| 38 |
+
|
| 39 |
# Copy built frontend bundle
|
| 40 |
COPY frontend/build/ ./frontend/build/
|
| 41 |
|
README_HF_SPACES.md
CHANGED
|
@@ -37,7 +37,8 @@ This guide will help you deploy the Law Document RAG application to Hugging Face
|
|
| 37 |
- `backend/` - Backend code
|
| 38 |
- `frontend/build/` - Built React app (always run `npm run build` before pushing)
|
| 39 |
- `processed_documents.json` - Optional bundled data so the Space can answer immediately (make sure it is **not** ignored in `.dockerignore`; the backend now initializes at import time and expects this file if no PDFs are present)
|
| 40 |
-
-
|
|
|
|
| 41 |
|
| 42 |
### 3. Set Up Environment Variables
|
| 43 |
|
|
@@ -100,9 +101,9 @@ https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space
|
|
| 100 |
## Important Notes
|
| 101 |
|
| 102 |
1. **API Endpoints**: The frontend is configured to use `/api` prefix for backend calls. This is handled by the `app.py` file.
|
| 103 |
-
2. **Documents**: Upload PDFs via the Space UI or keep them in external storage; do not push
|
| 104 |
3. **Processed Data**: `processed_documents.json` can be bundled with the repo. Because the backend now tries to bootstrap from this file at import/startup, make sure it reflects the same content you expect the Space to serve (and keep it under version control if you rely on it).
|
| 105 |
-
4. **Vectorstore**: The
|
| 106 |
5. **Port**: Hugging Face Spaces uses port 7860 by default, which is configured in `app.py`.
|
| 107 |
|
| 108 |
## Troubleshooting
|
|
|
|
| 37 |
- `backend/` - Backend code
|
| 38 |
- `frontend/build/` - Built React app (always run `npm run build` before pushing)
|
| 39 |
- `processed_documents.json` - Optional bundled data so the Space can answer immediately (make sure it is **not** ignored in `.dockerignore`; the backend now initializes at import time and expects this file if no PDFs are present)
|
| 40 |
+
- `vectorstore/` - Optional pre-built vectorstore folder (if it exists in your repo, it will be included in the Docker image; otherwise it will be created at runtime from `processed_documents.json`. To ensure the folder exists even if empty, create it with: `mkdir -p vectorstore && touch vectorstore/.gitkeep`)
|
| 41 |
+
- *(Optional)* `documents/` — large PDFs should be uploaded via the Space UI or an HF dataset (git pushes can’t include big binaries). The `documents/` folder will be created automatically if it doesn't exist.
|
| 42 |
|
| 43 |
### 3. Set Up Environment Variables
|
| 44 |
|
|
|
|
| 101 |
## Important Notes
|
| 102 |
|
| 103 |
1. **API Endpoints**: The frontend is configured to use `/api` prefix for backend calls. This is handled by the `app.py` file.
|
| 104 |
+
2. **Documents Folder**: The `documents/` folder is automatically created if it doesn't exist. Upload PDFs via the Space UI or keep them in external storage; do not push large PDFs through git.
|
| 105 |
3. **Processed Data**: `processed_documents.json` can be bundled with the repo. Because the backend now tries to bootstrap from this file at import/startup, make sure it reflects the same content you expect the Space to serve (and keep it under version control if you rely on it).
|
| 106 |
+
4. **Vectorstore**: The `vectorstore/` folder is now included in the Docker image if it exists in your repo. If you have a pre-built vectorstore, include it in your repository and it will be copied to the Docker image. If the vectorstore folder doesn't exist in your repo, ensure an empty folder exists (create with `mkdir -p vectorstore && touch vectorstore/.gitkeep`) or the Docker build may fail. The vectorstore will be created at runtime from `processed_documents.json` if not pre-built.
|
| 107 |
5. **Port**: Hugging Face Spaces uses port 7860 by default, which is configured in `app.py`.
|
| 108 |
|
| 109 |
## Troubleshooting
|
backend/main.py
CHANGED
|
@@ -51,8 +51,14 @@ def initialize_rag_system():
|
|
| 51 |
try:
|
| 52 |
rag_ready = False
|
| 53 |
print("[RAG Init] Starting initialization (import-time)")
|
| 54 |
-
|
|
|
|
| 55 |
docs_folder = Path("documents")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
processed_json = Path("processed_documents.json")
|
| 57 |
|
| 58 |
print(f"[RAG Init] processed_documents.json exists? {processed_json.exists()}")
|
|
|
|
| 51 |
try:
|
| 52 |
rag_ready = False
|
| 53 |
print("[RAG Init] Starting initialization (import-time)")
|
| 54 |
+
|
| 55 |
+
# Ensure documents folder exists
|
| 56 |
docs_folder = Path("documents")
|
| 57 |
+
if not docs_folder.exists():
|
| 58 |
+
docs_folder.mkdir(parents=True, exist_ok=True)
|
| 59 |
+
print("[RAG Init] Created documents folder")
|
| 60 |
+
|
| 61 |
+
rag_system = RAGSystem()
|
| 62 |
processed_json = Path("processed_documents.json")
|
| 63 |
|
| 64 |
print(f"[RAG Init] processed_documents.json exists? {processed_json.exists()}")
|