--- license: mit language: - en pipeline_tag: text-classification tags: - faqs - bitwise - semantic-search - knn - sbert - bit-vector - binary-embedding --- # AskBit FAQ Retriever A fast, interpretable FAQ retriever using **bit vector encoding of SBERT sentence embeddings** combined with a **binary KNN classifier**. This repository hosts a **model artifact** from the [AskBit](https://github.com/Shanvit7/askbit) project. > 📚 This model was created as part of an educational journey exploring efficient semantic FAQ matching with bitwise vector representations and KNN classification. * 🔢 Uses SBERT (`all-MiniLM-L6-v2`) to embed question-answer pairs as dense semantic vectors. * 🧠 Converts dense embeddings into binarized bit vectors for fast similarity search. * ⚡ Uses a K-Nearest Neighbors classifier with Hamming distance over bit vectors. * 💡 Fully open source, efficient, and suitable for lightweight semantic FAQ retrieval. * 🗂️ Model file: `model.pkl` * 📄 Training data file: `faq.json` --- ## 📁 Files in This Repository | File | Description | |---------------------|--------------------------------------------------------------| | `model.pkl` | Trained KNN classifier model over SBERT-based bit vectors. | | `faq.json` | FAQ question-answer dataset used for training and evaluation.| | `requirements.txt` | Python dependencies to load and use the model. | | `README.md` | Model usage instructions, background, and examples. | --- ## 🧠 How It Works ### Semantic Bit Vector Encoding (`SbertBitEncoder`) - Uses the **Sentence-BERT** model (`all-MiniLM-L6-v2`) to generate dense semantic embeddings of entire question-answer pairs. - Embeddings capture **meaningful sentence-level semantics**, enabling effective retrieval beyond simple word overlap. - Each dense embedding vector is **binarized** by thresholding (e.g., bits set to 1 if value > 0) to produce a compact, fixed-length bit vector. - Both the FAQ entries and queries are encoded this way, ensuring semantic similarity maps to bitwise proximity. ### Binary K-Nearest Neighbors Classifier (`FAQClassifier`) - Implements a KNN classifier using **Hamming distance** as the similarity metric on bit vectors. - Learns to associate bit-encoded queries with their corresponding answers. - Supports retrieving the best matching answer or top-k candidates with similarity scores. --- ## 🚀 Usage Example ``` import pickle import numpy as np # Load the trained model artifact with open("model.pkl", "rb") as f: model = pickle.load(f) # Bit vector input: binarized SBERT embeddings (e.g., 384-bit vector) query_vec = np.array([1, 0, 1, 1, 0, ..., 0]) # Must match training bit vector format # Predict (get best matching answer) answer = model.predict(query_vec) print("Predicted answer:", answer) ``` > ⚠️ Important: Ensure you encode new queries with the same SBERT bit-vector encoder used at training for consistent results. --- ## 📦 Dependencies Install dependencies with: ``` pip install -r requirements.txt ``` Main dependencies: - `sentence-transformers` - `scikit-learn` - `numpy` - `yake` - `spacy` (for optional text preprocessing) --- ## 📚 Related Project This model is part of the [AskBit project on GitHub](https://github.com/Shanvit7/askbit): - ✅ Full source code with CLI and training scripts - ✅ Debug and inspect bit vectors and retrieval results - ✅ Lightweight, interpretable semantic FAQ search --- ## 📜 License MIT License — free to use, modify, or contribute. --- ## 🤝 Contributing This model is intended for learning and experimentation. Feel free to fork, improve, or build upon it! > Model trained and shared by [@Shanvit](https://huggingface.co/Shanvit)