Kabyle (Taqbaylit) Tesseract OCR Model

By Bouaziz Ait Driss

Overview

Tesseract OCR model for Taqbaylit Kabyle language with support for special characters (ɣ, ɛ, ḍ, ṭ, ḥ, ṛ, ṣ, ẓ, ǧ, č).

Quick Start

# Install
cp kab.traineddata /usr/share/tesseract-ocr/4.00/tessdata/

# Use
tesseract image.png output -l kab

Try it Online

https://huggingface.co/spaces/AitBAD/kab-ocr-tanti

Starting Training Details

  • Base model: eng (English)
  • Training data: ~5,000 files
  • Wordlist: 5,000+ Kabyle words
  • Iteration count: ~10,000

Starting Performance

Performance using python dedicated script test_cer.py

  • Character Error Rate (CER): 30.56%
  • Word Error Rate (WER): 72.94%
  • Tested on average quality scanned documents

Fine Tuning Training details

  • Base model: kab (kabyle)
  • Training data: 5,554 files
  • Wordlist: 29,840 Kabyle words
  • Iteration count: ~26,000

Optimal Performance from Fine Tuning

Tesseract ended at checkpoint with
BCER: 2.9% at iteration #26000

Performance using python dedicated script test_cer.py

  • Character Error Rate (CER): 5.08%
  • Word Error Rate (WER): 15.28%
  • Tested on average quality scanned documents

Note that the use in conda environment pytesseract is calling the model from conda environment (tessdatada) where Tesseract is installed and not from the Tesseract installed under windows.

Data Quality Guidelines

  • Good quality scans (300+ DPI, clear text): ~98% accuracy
  • Average quality scans (150-300 DPI): ~84% word accuracy
  • Poor quality scans (<150 DPI, skewed, faded): May require manual review

Citation

If you use this model, please cite: Bouaziz Ait Driss. (2025). Kabyle (Taqbaylit) Tesseract OCR Model. Hugging Face. https://huggingface.co/AitBAD/kab-Taqbaylit-Tesseract-ocr

Known Limitations

  • Numbers: Limited training data
  • May miss some old less used characters such as "Г" equivalent to "ɣ" and "ţ" equivalent to "tt"
  • Performance degrades with poor scan quality
  • Best results on printed text (not handwritten)

Acknowledgments

  • Based on Tesseract OCR
  • Trained using Tesstrain
  • Trained over WSL on Windows 11
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support