Spaces:

AldawsariNLP
/

Saudi-Law-AI-Assistant

Sleeping

App Files Files Community

AldawsariNLP commited on Nov 18, 2025

Commit

2f34b04

1 Parent(s): 5d4e100

Improve RAG init & docs new5

Browse files

Files changed (5) hide show

.dockerignore +1 -1
.gitignore +2 -2
Dockerfile +13 -1
README_HF_SPACES.md +4 -3
backend/main.py +7 -1

.dockerignore CHANGED Viewed

@@ -10,7 +10,7 @@ README.md
 documents/*.pdf
 documents/*.docx
 documents/*.txt
-vectorstore

 documents/*.pdf
 documents/*.docx
 documents/*.txt

.gitignore CHANGED Viewed

@@ -49,8 +49,8 @@ yarn-error.log*
 Thumbs.db
 # Documents (optional - you may want to track these)
-# documents/
-documents1/
 # documents/*.docx
 # documents/*.txt

 Thumbs.db
 # Documents (optional - you may want to track these)
+documents/
+# documents1/
 # documents/*.docx
 # documents/*.txt

Dockerfile CHANGED Viewed

@@ -21,9 +21,21 @@ RUN uv pip install --system .
 COPY backend/ ./backend/
 # Copy processed documents and vector data
-COPY documents/ ./documents/
 COPY processed_documents.json ./processed_documents.json
 # Copy built frontend bundle
 COPY frontend/build/ ./frontend/build/

 COPY backend/ ./backend/
 # Copy processed documents and vector data
+# Ensure documents and vectorstore folders exist (even if empty)
+RUN mkdir -p documents
+# Copy vectorstore folder if it exists in the build context
+# Note: If vectorstore/ doesn't exist in repo, ensure an empty vectorstore/ folder exists
+# (create it with: mkdir -p vectorstore && touch vectorstore/.gitkeep)
+# The vectorstore will be populated at runtime from processed_documents.json if not pre-built
+COPY vectorstore/ ./vectorstore/
+# Copy processed documents JSON
 COPY processed_documents.json ./processed_documents.json
+# Optionally copy documents if they exist in the repo
+# COPY documents/ ./documents/
 # Copy built frontend bundle
 COPY frontend/build/ ./frontend/build/

README_HF_SPACES.md CHANGED Viewed

@@ -37,7 +37,8 @@ This guide will help you deploy the Law Document RAG application to Hugging Face
    - `backend/` - Backend code
    - `frontend/build/` - Built React app (always run `npm run build` before pushing)
    - `processed_documents.json` - Optional bundled data so the Space can answer immediately (make sure it is **not** ignored in `.dockerignore`; the backend now initializes at import time and expects this file if no PDFs are present)
-   - *(Optional)* `documents/` — large PDFs should be uploaded via the Space UI or an HF dataset (git pushes can’t include big binaries)
 ### 3. Set Up Environment Variables
@@ -100,9 +101,9 @@ https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space
 ## Important Notes
 1. **API Endpoints**: The frontend is configured to use `/api` prefix for backend calls. This is handled by the `app.py` file.
-2. **Documents**: Upload PDFs via the Space UI or keep them in external storage; do not push them through git.
 3. **Processed Data**: `processed_documents.json` can be bundled with the repo. Because the backend now tries to bootstrap from this file at import/startup, make sure it reflects the same content you expect the Space to serve (and keep it under version control if you rely on it).
-4. **Vectorstore**: The FAISS vectorstore will be created and stored in the Space's persistent storage.
 5. **Port**: Hugging Face Spaces uses port 7860 by default, which is configured in `app.py`.
 ## Troubleshooting

    - `backend/` - Backend code
    - `frontend/build/` - Built React app (always run `npm run build` before pushing)
    - `processed_documents.json` - Optional bundled data so the Space can answer immediately (make sure it is **not** ignored in `.dockerignore`; the backend now initializes at import time and expects this file if no PDFs are present)
+   - `vectorstore/` - Optional pre-built vectorstore folder (if it exists in your repo, it will be included in the Docker image; otherwise it will be created at runtime from `processed_documents.json`. To ensure the folder exists even if empty, create it with: `mkdir -p vectorstore && touch vectorstore/.gitkeep`)
+   - *(Optional)* `documents/` — large PDFs should be uploaded via the Space UI or an HF dataset (git pushes can’t include big binaries). The `documents/` folder will be created automatically if it doesn't exist.
 ### 3. Set Up Environment Variables
 ## Important Notes
 1. **API Endpoints**: The frontend is configured to use `/api` prefix for backend calls. This is handled by the `app.py` file.
+2. **Documents Folder**: The `documents/` folder is automatically created if it doesn't exist. Upload PDFs via the Space UI or keep them in external storage; do not push large PDFs through git.
 3. **Processed Data**: `processed_documents.json` can be bundled with the repo. Because the backend now tries to bootstrap from this file at import/startup, make sure it reflects the same content you expect the Space to serve (and keep it under version control if you rely on it).
+4. **Vectorstore**: The `vectorstore/` folder is now included in the Docker image if it exists in your repo. If you have a pre-built vectorstore, include it in your repository and it will be copied to the Docker image. If the vectorstore folder doesn't exist in your repo, ensure an empty folder exists (create with `mkdir -p vectorstore && touch vectorstore/.gitkeep`) or the Docker build may fail. The vectorstore will be created at runtime from `processed_documents.json` if not pre-built.
 5. **Port**: Hugging Face Spaces uses port 7860 by default, which is configured in `app.py`.
 ## Troubleshooting

backend/main.py CHANGED Viewed

@@ -51,8 +51,14 @@ def initialize_rag_system():
     try:
         rag_ready = False
         print("[RAG Init] Starting initialization (import-time)")
-        rag_system = RAGSystem()
         docs_folder = Path("documents")
         processed_json = Path("processed_documents.json")
         print(f"[RAG Init] processed_documents.json exists? {processed_json.exists()}")

     try:
         rag_ready = False
         print("[RAG Init] Starting initialization (import-time)")
+        # Ensure documents folder exists
         docs_folder = Path("documents")
+        if not docs_folder.exists():
+            docs_folder.mkdir(parents=True, exist_ok=True)
+            print("[RAG Init] Created documents folder")
+        rag_system = RAGSystem()
         processed_json = Path("processed_documents.json")
         print(f"[RAG Init] processed_documents.json exists? {processed_json.exists()}")