Spaces:
Runtime error
Runtime error
| I've been working through the first two lessons of | |
| [the fastai course](https://course.fast.ai/). For lesson one I trained a model | |
| to recognise my cat, Mr Blupus. For lesson two the emphasis is on getting those | |
| models out in the world as some kind of demo or application. | |
| [Gradio](https://gradio.app) and | |
| [Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a | |
| prototype of your model on the internet. | |
| This MVP app runs two models to mimic the experience of what a final deployed | |
| version of the project might look like. | |
| - The first model (a classification model trained with fastai, available on the | |
| Huggingface Hub | |
| [here](https://huggingface.co/strickvl/redaction-classifier-fastai) and | |
| testable as a standalone demo | |
| [here](https://huggingface.co/spaces/strickvl/fastai_redaction_classifier)), | |
| classifies and determines which pages of the PDF are redacted. I've written | |
| about how I trained this model [here](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html). | |
| - The second model (an object detection model trained using [IceVision](https://airctic.com/) (itself | |
| built partly on top of fastai)) detects which parts of the image are redacted. | |
| This is a model I've been working on for a while and I described my process in | |
| a series of blog posts (see below). | |
| This MVP app does several things: | |
| - it extracts any pages it considers to contain redactions and displays that | |
| subset as an [image carousel](https://gradio.app/docs/#o_carousel). It also | |
| displays some text alerting you to which specific pages were redacted. | |
| - if you click the "Analyse and extract redacted images" checkbox, it will: | |
| - pass the pages it considered redacted through the object detection model | |
| - calculate what proportion of the total area of the image was redacted as | |
| well as what proportion of the actual content (i.e. excluding margins etc | |
| where there is no content) | |
| - create a PDF that you can download that contains only the redacted images, | |
| with an overlay of the redactions that it was able to identify along with | |
| the confidence score for each item. | |
| ## The Dataset | |
| I downloaded a few thousand publicly-available FOIA documents from a government | |
| website. I split the PDFs up into individual `.jpg` files and then used | |
| [Prodigy](https://prodi.gy/) to annotate the data. (This process was described | |
| in | |
| [a blogpost written last | |
| year](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).) | |
| For the object detection model, the process was quite a bit more involved and I | |
| direct you to the series of articles referenced below in the 'Further Reading' section. | |
| ## Training the model | |
| I trained the classification model with fastai's flexible `vision_learner`, fine-tuning | |
| `resnet18` which was both smaller than `resnet34` (no surprises there) and less | |
| liable to early overfitting. I trained the model for 10 epochs. | |
| The object detection model is trained using IceVision, with VFNet as the | |
| model and `resnet50` as the backbone. I trained the model for 50 epochs and | |
| reached 89% accuracy on the validation data. | |
| ## Further Reading | |
| This initial dataset spurred an ongoing interest in the domain and I've since | |
| been working on the problem of object detection, i.e. identifying exactly which | |
| parts of the image contain redactions. | |
| Some of the key blogs I've written about this project: | |
| - How to annotate data for an object detection problem with Prodigy | |
| ([link](https://mlops.systems/redactionmodel/computervision/datalabelling/2021/11/29/prodigy-object-detection-training.html)) | |
| - How to create synthetic images to supplement a small dataset | |
| ([link](https://mlops.systems/redactionmodel/computervision/python/tools/2022/02/10/synthetic-image-data.html)) | |
| - How to use error analysis and visual tools like FiftyOne to improve model | |
| performance | |
| ([link](https://mlops.systems/redactionmodel/computervision/tools/debugging/jupyter/2022/03/12/fiftyone-computervision.html)) | |
| - Creating more synthetic data focused on the tasks my model finds hard | |
| ([link](https://mlops.systems/tools/redactionmodel/computervision/2022/04/06/synthetic-data-results.html)) | |
| - Data validation for object detection / computer vision (a three part series β | |
| [part 1](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/19/data-validation-great-expectations-part-1.html), | |
| [part 2](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/26/data-validation-great-expectations-part-2.html), | |
| [part 3](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/28/data-validation-great-expectations-part-3.html)) | |