Getting started with IceVision
IceVision is an Object-Detection Framework that connects to different libraries/frameworks such as fastai, Pytorch Lightning, and Pytorch with more to come.
Features a Unified Data API with out-of-the-box support for common annotation formats (COCO, VOC, etc.)
The IceData repo hosts community maintained parsers and custom datasets
Provides flexible model implementations with pluggable backbones
Helps researchers reproduce, replicate, and go beyond published models
Enables practioners to get moving with object detection technology quickly
This tutorial walks you through the different steps of training and using a model.
If you are using Google Colab, the GPU runtime should be enabled, but if you experience problems when training your model, you may want to check this.
Change runtime type ->
Hardware accelerator dropdown ->
Install icevision and icedata
!pip install icevision[all] !pip install icedata
Import the package
from icevision.all import * import icedata
IceVision provides handy methods to load a dataset, parse annotations, and more.
In the example below, we work with the PETS dataset to detect cats and dogs in images and identify their species. Loading the PETS dataset is one line code.
data_dir = icedata.pets.load_data() data_dir
Parser is one of the most important concepts in IceVision. It allows us to work with any annotation format.
The basic job of the parser is to convert a custom format to something the library can understand. You might still need to create a custom parser for your own dataset. Fear not! Creating parsers is easy. After you've finished this tutorial, check this custom parser documentation to understand how to.
IceVision already provides a
parser for the Pets Dataset
class_map = icedata.pets.class_map() class_map
parser = icedata.pets.parser(data_dir, class_map)
Parse the data
parse() the dataset using the data splitter. This returns returns 2 lists of records: one for training and another for validation. Behind the scenes we shuffle the data and proceed with a 80% train 20% valid split.
train_records, valid_records = parser.parse()
What's a record?
A record is a dictionary that contains all parsed fields defined by the parser used. No matter what format the annotation has, a record has a common structure that can be connected to different DL frameworks (fastai, Pytorch-Lightning, etc.)
Visualize the training data
We can show one of the records (image + box + label). This helps to understand what is in the dataset and check that the boxes and labels make sense.
We can also display the label instead of its identifier by providing the
Of course, we often want to see several images with their corresponding boxes and labels.
records = train_records[:6] show_records(records, ncols=3, class_map=class_map)
IceVision supports the widely used albumentations library out-of-the-box.
It is possible to integrate other transform libraries. You just need to inherit and override all abstract methods of the
Transform class. We plan to add more to future versions in response to community feedback.
It is typical to use different transformations for the training and validation datasets. The
valid_tfms apply to the validation set. These are minimal - just resizing the image and normalising it. The
train_tfms typically do data augmentations such as zoom, crop, lighting adjustments, horizontal flips, and so on. These help to reduce the required training set size, reduce overfitting, and produce a more robust model. Icevision makes this easy - all of the bounding boxes are adjusted if needed. For example, zooming in will make the bounding boxes larger. Crops will not cut any bounding boxes.
presize parameter helps to improve the resulting image quality. See the Fast AI Book for more details.
A.Normalize function applies a set of default normalizations that have been refined over the years on the Imagenet dataset.
presize = 512 size = 384
valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(size), tfms.A.Normalize()]) train_tfms = tfms.A.Adapter([*tfms.A.aug_tfms(size=size, presize=presize), tfms.A.Normalize()])
Dataset class combines the records and transforms.
To create a
Dataset, we just need need to pass the parsed records from the previous step along with the transforms.
train_ds = Dataset(train_records, train_tfms) valid_ds = Dataset(valid_records, valid_tfms)
What does the
Dataset class do?
- Prepares the record: For example, in the record we just have a filename that points to the image, it's at this stage that we open the image.
- Applies the pipeline of transforms to the record prepared in the previous step
Transforms are applied lazily, meaning they are only applied when we grab (get) an item.
This means that, if you have augmentation (random) transforms, each time you get the same item from the dataset you will get a slightly different version of it.
Because we normalized our images with
imagenet_stats, when displaying transformed images, we need to denormalize them.
show_sample function receives an optional argument called
denormalize_fn that we can be passed: In our case, we pass
Displaying the same image with different transforms
samples = [train_ds for _ in range(6)] show_samples(samples, ncols=3, class_map=class_map)
In this tutorial, we are learning to predict bounding boxes and classes, but not performing image segmentation. We will use the
To create the model, we need to specify how many classes our dataset has. This is the length of the
class_map. Note that the
class_map includes a value for
"background" with index 0, which is added behind the scenes by default.
model = faster_rcnn.model(num_classes=len(class_map))
Each model has its own dataloader (a pytorch
DataLoader) that could be customized: the dataloaders for the RCNN models have a custom collate function.
train_dl = faster_rcnn.train_dl(train_ds, batch_size=16, num_workers=4, shuffle=True) valid_dl = faster_rcnn.valid_dl(valid_ds, batch_size=16, num_workers=4, shuffle=False)
You could also plug it into a new DL frameworks using your own custom code.
Metrics are essential for tracking the model progress as it's training.
Here we are going to be using the well established
COCOMetric, which reports on the mean average precision of the predictions.
metrics = [COCOMetric(metric_type=COCOMetricType.bbox)]
Training with fastai
Creating a Learner object
Creating a fastai compatible
Learner using the fastai interface.
learn = faster_rcnn.fastai.learner(dls=[train_dl, valid_dl], model=model, metrics=metrics)
Training the RCNN model using fastai
fine_tune method is useful when you have a pre-trained model, which we are using. It does an initial epoch where it freezes everything except its final layers. It then carries on for the indicated number of epochs using a differential learning rate to train the whole model. It adjusts the learning rate both across the layers of the model as well as across the epochs. This can give excellent results with reduced training time.
In September 2020, if everything is working, the model might require around 3 minutes per epoch on a free Google Colab server.
Training with Pytorch-Lightning
Creating a Pytorch-Lightning (PL) model class
It inherits from
RCNNLightningAdapter and implements the method PL
class LightModel(faster_rcnn.lightning.ModelAdapter): def configure_optimizers(self): return SGD(self.parameters(), lr=1e-4)
light_model = LightModel(model, metrics=metrics)
Training the RCNN model using PL
trainer = pl.Trainer(max_epochs=10, gpus=1) trainer.fit(light_model, train_dl, valid_dl)
To quickly visualize the results of the model on a specific dataset use
faster_rcnn.show_results(model, valid_ds, class_map=class_map)
Load a model
Training the model with
fine_tune twice and I got led the the following results:
Using our Trained Weights
If you don't want to train the model, you can use our trained weights that we publicly available: You can download them with
weights_url = "https://github.com/airctic/model_zoo/releases/download/m3/pets_faster_resnetfpn50.zip" state_dict = torch.hub.load_state_dict_from_url(weights_url, map_location=torch.device("cpu"))
Typically inference is done on the cpu, this is why we specify the paramater
cpu when loading the state dict.
Let's recreate the model and load the downloaded weights:
model = faster_rcnn.model(num_classes=len(class_map)) model.load_state_dict(state_dict) model.cuda()
The first step for prediction is to have some images, let's grab some random ones from the validation dataset:
11.3- Predict all images at once
If you don't have too many images, you can get predictions with a single forward pass.
In case your images don't fit in memory simultaneously, you should predict in batches, the next section shows how to do that.
For demonstration purposes, let's take download a single image from the internet and see how our model performs on it.
IMAGE_URL = "https://petcaramelo.com/wp-content/uploads/2018/06/beagle-cachorro.jpg" IMG_PATH = "tmp.jpg" download_url(IMAGE_URL, IMG_PATH) img = open_img(IMG_PATH) show_img(img)
Try other images!
IMAGE_URL to point to another image you found on the internet.
Just be sure to take one of the breeds from
class_map, or else the model might get confused.
Whenever you have images in memory (numpy arrays) you can use
We're going to use the same transforms we used on the validation dataset.
infer_ds = Dataset.from_images([img], valid_tfms)
For any model, the prediction steps are always the same, first call
build_infer_batch and then
faster_rcnn we have
detection_threshold, which specifies how confident the model should be to output a bounding box.
batch, samples = faster_rcnn.build_infer_batch(infer_ds) preds = faster_rcnn.predict(model=model, batch=batch)
For displaying the predictions, we first need to grab our image from
samples. We do this instead of using the original images because transforms may have been applied to the image (in fact, in this case, a resize was used).
imgs = [sample["img"] for sample in samples]
Now we just need to call
show_preds, to show the image with its corresponding predictions (boxes + labels).
show_preds(imgs=imgs, preds=preds, class_map=class_map, show=True)
11.4- Predicting a batch of images
Instead of predicting a whole list of images at one, we can process a small batch at the time: This option is more memory efficient: We use
Had we have a test dataset, we would have maken our predicition using the batch technique mentionned here above. As an illustrative example, we will predict all images belonging to the validation dataset using the following approach:
infer_dl = faster_rcnn.infer_dl(valid_ds, batch_size=16) samples, preds = faster_rcnn.predict_dl(model=model, infer_dl=infer_dl)
Same as before, we grab our images from
imgs = [sample["img"] for sample in samples]
Let's show the first 6 predictions:
show_preds( imgs=imgs[:6], preds=preds[:6], ncols=3, class_map=class_map, show=True, )
If you need any assistance, feel free to join our forum.