AldawsariNLP commited on
Commit
2f34b04
·
1 Parent(s): 5d4e100

Improve RAG init & docs new5

Browse files
Files changed (5) hide show
  1. .dockerignore +1 -1
  2. .gitignore +2 -2
  3. Dockerfile +13 -1
  4. README_HF_SPACES.md +4 -3
  5. backend/main.py +7 -1
.dockerignore CHANGED
@@ -10,7 +10,7 @@ README.md
10
  documents/*.pdf
11
  documents/*.docx
12
  documents/*.txt
13
- vectorstore
14
 
15
 
16
 
 
10
  documents/*.pdf
11
  documents/*.docx
12
  documents/*.txt
13
+
14
 
15
 
16
 
.gitignore CHANGED
@@ -49,8 +49,8 @@ yarn-error.log*
49
  Thumbs.db
50
 
51
  # Documents (optional - you may want to track these)
52
- # documents/
53
- documents1/
54
 
55
  # documents/*.docx
56
  # documents/*.txt
 
49
  Thumbs.db
50
 
51
  # Documents (optional - you may want to track these)
52
+ documents/
53
+ # documents1/
54
 
55
  # documents/*.docx
56
  # documents/*.txt
Dockerfile CHANGED
@@ -21,9 +21,21 @@ RUN uv pip install --system .
21
  COPY backend/ ./backend/
22
 
23
  # Copy processed documents and vector data
24
- COPY documents/ ./documents/
 
 
 
 
 
 
 
 
 
25
  COPY processed_documents.json ./processed_documents.json
26
 
 
 
 
27
  # Copy built frontend bundle
28
  COPY frontend/build/ ./frontend/build/
29
 
 
21
  COPY backend/ ./backend/
22
 
23
  # Copy processed documents and vector data
24
+ # Ensure documents and vectorstore folders exist (even if empty)
25
+ RUN mkdir -p documents
26
+
27
+ # Copy vectorstore folder if it exists in the build context
28
+ # Note: If vectorstore/ doesn't exist in repo, ensure an empty vectorstore/ folder exists
29
+ # (create it with: mkdir -p vectorstore && touch vectorstore/.gitkeep)
30
+ # The vectorstore will be populated at runtime from processed_documents.json if not pre-built
31
+ COPY vectorstore/ ./vectorstore/
32
+
33
+ # Copy processed documents JSON
34
  COPY processed_documents.json ./processed_documents.json
35
 
36
+ # Optionally copy documents if they exist in the repo
37
+ # COPY documents/ ./documents/
38
+
39
  # Copy built frontend bundle
40
  COPY frontend/build/ ./frontend/build/
41
 
README_HF_SPACES.md CHANGED
@@ -37,7 +37,8 @@ This guide will help you deploy the Law Document RAG application to Hugging Face
37
  - `backend/` - Backend code
38
  - `frontend/build/` - Built React app (always run `npm run build` before pushing)
39
  - `processed_documents.json` - Optional bundled data so the Space can answer immediately (make sure it is **not** ignored in `.dockerignore`; the backend now initializes at import time and expects this file if no PDFs are present)
40
- - *(Optional)* `documents/` large PDFs should be uploaded via the Space UI or an HF dataset (git pushes can’t include big binaries)
 
41
 
42
  ### 3. Set Up Environment Variables
43
 
@@ -100,9 +101,9 @@ https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space
100
  ## Important Notes
101
 
102
  1. **API Endpoints**: The frontend is configured to use `/api` prefix for backend calls. This is handled by the `app.py` file.
103
- 2. **Documents**: Upload PDFs via the Space UI or keep them in external storage; do not push them through git.
104
  3. **Processed Data**: `processed_documents.json` can be bundled with the repo. Because the backend now tries to bootstrap from this file at import/startup, make sure it reflects the same content you expect the Space to serve (and keep it under version control if you rely on it).
105
- 4. **Vectorstore**: The FAISS vectorstore will be created and stored in the Space's persistent storage.
106
  5. **Port**: Hugging Face Spaces uses port 7860 by default, which is configured in `app.py`.
107
 
108
  ## Troubleshooting
 
37
  - `backend/` - Backend code
38
  - `frontend/build/` - Built React app (always run `npm run build` before pushing)
39
  - `processed_documents.json` - Optional bundled data so the Space can answer immediately (make sure it is **not** ignored in `.dockerignore`; the backend now initializes at import time and expects this file if no PDFs are present)
40
+ - `vectorstore/` - Optional pre-built vectorstore folder (if it exists in your repo, it will be included in the Docker image; otherwise it will be created at runtime from `processed_documents.json`. To ensure the folder exists even if empty, create it with: `mkdir -p vectorstore && touch vectorstore/.gitkeep`)
41
+ - *(Optional)* `documents/` — large PDFs should be uploaded via the Space UI or an HF dataset (git pushes can’t include big binaries). The `documents/` folder will be created automatically if it doesn't exist.
42
 
43
  ### 3. Set Up Environment Variables
44
 
 
101
  ## Important Notes
102
 
103
  1. **API Endpoints**: The frontend is configured to use `/api` prefix for backend calls. This is handled by the `app.py` file.
104
+ 2. **Documents Folder**: The `documents/` folder is automatically created if it doesn't exist. Upload PDFs via the Space UI or keep them in external storage; do not push large PDFs through git.
105
  3. **Processed Data**: `processed_documents.json` can be bundled with the repo. Because the backend now tries to bootstrap from this file at import/startup, make sure it reflects the same content you expect the Space to serve (and keep it under version control if you rely on it).
106
+ 4. **Vectorstore**: The `vectorstore/` folder is now included in the Docker image if it exists in your repo. If you have a pre-built vectorstore, include it in your repository and it will be copied to the Docker image. If the vectorstore folder doesn't exist in your repo, ensure an empty folder exists (create with `mkdir -p vectorstore && touch vectorstore/.gitkeep`) or the Docker build may fail. The vectorstore will be created at runtime from `processed_documents.json` if not pre-built.
107
  5. **Port**: Hugging Face Spaces uses port 7860 by default, which is configured in `app.py`.
108
 
109
  ## Troubleshooting
backend/main.py CHANGED
@@ -51,8 +51,14 @@ def initialize_rag_system():
51
  try:
52
  rag_ready = False
53
  print("[RAG Init] Starting initialization (import-time)")
54
- rag_system = RAGSystem()
 
55
  docs_folder = Path("documents")
 
 
 
 
 
56
  processed_json = Path("processed_documents.json")
57
 
58
  print(f"[RAG Init] processed_documents.json exists? {processed_json.exists()}")
 
51
  try:
52
  rag_ready = False
53
  print("[RAG Init] Starting initialization (import-time)")
54
+
55
+ # Ensure documents folder exists
56
  docs_folder = Path("documents")
57
+ if not docs_folder.exists():
58
+ docs_folder.mkdir(parents=True, exist_ok=True)
59
+ print("[RAG Init] Created documents folder")
60
+
61
+ rag_system = RAGSystem()
62
  processed_json = Path("processed_documents.json")
63
 
64
  print(f"[RAG Init] processed_documents.json exists? {processed_json.exists()}")