Skip to content

Open In Colab

Getting started with IceVision

Why IceVision?

  • IceVision is an Object-Detection Framework that connects to different libraries/frameworks such as fastai, Pytorch Lightning, and Pytorch with more to come.

  • Features a Unified Data API with out-of-the-box support for common annotation formats (COCO, VOC, etc.)

  • The IceData repo hosts community maintained parsers and custom datasets

  • Provides flexible model implementations with pluggable backbones

  • Helps researchers reproduce, replicate, and go beyond published models

  • Enables practioners to get moving with object detection technology quickly


This tutorial walks you through the different steps of training and using a model.

The IceVision Framework is an agnostic framework. To demonstrate this we will train and use our model with both the fastai, and pytorch-lightning libraries.

If you are using Google Colab, the GPU runtime should be enabled, but if you experience problems when training your model, you may want to check this. Runtime -> Change runtime type -> Hardware accelerator dropdown -> GPU

Install icevision and icedata

!pip install icevision[all]
!pip install icedata

Import the package

from icevision.all import *
import icedata


IceVision provides handy methods to load a dataset, parse annotations, and more.

In the example below, we work with the PETS dataset to detect cats and dogs in images and identify their species. Loading the PETS dataset is one line code.

data_dir = icedata.pets.load_data()


The Parser is one of the most important concepts in IceVision. It allows us to work with any annotation format.

The basic job of the parser is to convert a custom format to something the library can understand. You might still need to create a custom parser for your own dataset. Fear not! Creating parsers is easy. After you've finished this tutorial, check this custom parser documentation to understand how to.

IceVision already provides a parser for the Pets Dataset

class_map = icedata.pets.class_map()
parser = icedata.pets.parser(data_dir, class_map)

Parse the data

Next we parse() the dataset using the data splitter. This returns returns 2 lists of records: one for training and another for validation. Behind the scenes we shuffle the data and proceed with a 80% train 20% valid split.

train_records, valid_records = parser.parse()

What's a record?

A record is a dictionary that contains all parsed fields defined by the parser used. No matter what format the annotation has, a record has a common structure that can be connected to different DL frameworks (fastai, Pytorch-Lightning, etc.)

Visualize the training data

We can show one of the records (image + box + label). This helps to understand what is in the dataset and check that the boxes and labels make sense.



We can also display the label instead of its identifier by providing the class_map.

show_record(train_records[1], class_map=class_map)


Of course, we often want to see several images with their corresponding boxes and labels.

records = train_records[:6]
show_records(records, ncols=3, class_map=class_map)



Data transformations are an essential part of the training pipeline. There are many transformation libraries available including: albumentations, solt, and torchvision.

IceVision supports the widely used albumentations library out-of-the-box.

It is possible to integrate other transform libraries. You just need to inherit and override all abstract methods of the Transform class. We plan to add more to future versions in response to community feedback.

It is typical to use different transformations for the training and validation datasets. The valid_tfms apply to the validation set. These are minimal - just resizing the image and normalising it. The train_tfms typically do data augmentations such as zoom, crop, lighting adjustments, horizontal flips, and so on. These help to reduce the required training set size, reduce overfitting, and produce a more robust model. Icevision makes this easy - all of the bounding boxes are adjusted if needed. For example, zooming in will make the bounding boxes larger. Crops will not cut any bounding boxes.

The presize parameter helps to improve the resulting image quality. See the Fast AI Book for more details.

The A.Normalize function applies a set of default normalizations that have been refined over the years on the Imagenet dataset.

presize = 512
size = 384
valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(size), tfms.A.Normalize()])
train_tfms = tfms.A.Adapter([*tfms.A.aug_tfms(size=size, presize=presize), tfms.A.Normalize()])


The Dataset class combines the records and transforms.

To create a Dataset, we just need need to pass the parsed records from the previous step along with the transforms.

train_ds = Dataset(train_records, train_tfms)
valid_ds = Dataset(valid_records, valid_tfms)

What does the Dataset class do?

  • Prepares the record: For example, in the record we just have a filename that points to the image, it's at this stage that we open the image.
  • Applies the pipeline of transforms to the record prepared in the previous step

Lazy transforms

Transforms are applied lazily, meaning they are only applied when we grab (get) an item.
This means that, if you have augmentation (random) transforms, each time you get the same item from the dataset you will get a slightly different version of it.


Because we normalized our images with imagenet_stats, when displaying transformed images, we need to denormalize them.
The show_sample function receives an optional argument called denormalize_fn that we can be passed: In our case, we pass denormalize_imagenet.

Displaying the same image with different transforms

samples = [train_ds[3] for _ in range(6)]
show_samples(samples, ncols=3, class_map=class_map)



In this tutorial, we are learning to predict bounding boxes and classes, but not performing image segmentation. We will use the FasterRCNN model.

To create the model, we need to specify how many classes our dataset has. This is the length of the class_map. Note that the class_map includes a value for "background" with index 0, which is added behind the scenes by default.

model = faster_rcnn.model(num_classes=len(class_map))


Each model has its own dataloader (a pytorch DataLoader) that could be customized: the dataloaders for the RCNN models have a custom collate function.

train_dl = faster_rcnn.train_dl(train_ds, batch_size=16, num_workers=4, shuffle=True)
valid_dl = faster_rcnn.valid_dl(valid_ds, batch_size=16, num_workers=4, shuffle=False)


IceVision is an agnostic framework meaning it can be plugged to multiple DL frameworks such as fastai, and pytorch-lightning.

You could also plug it into a new DL frameworks using your own custom code.


Metrics are essential for tracking the model progress as it's training.
Here we are going to be using the well established COCOMetric, which reports on the mean average precision of the predictions.

metrics = [COCOMetric(metric_type=COCOMetricType.bbox)]

Training with fastai

Creating a Learner object

Creating a fastai compatible Learner using the fastai interface.

learn = faster_rcnn.fastai.learner(dls=[train_dl, valid_dl], model=model, metrics=metrics)

Training the RCNN model using fastai fine_tune() method

The fastai fine_tune method is useful when you have a pre-trained model, which we are using. It does an initial epoch where it freezes everything except its final layers. It then carries on for the indicated number of epochs using a differential learning rate to train the whole model. It adjusts the learning rate both across the layers of the model as well as across the epochs. This can give excellent results with reduced training time.

In September 2020, if everything is working, the model might require around 3 minutes per epoch on a free Google Colab server.

learn.fine_tune(10, 1e-4)

Training with Pytorch-Lightning

Creating a Pytorch-Lightning (PL) model class

It inherits from RCNNLightningAdapter and implements the method PL configure_optimizers.

class LightModel(faster_rcnn.lightning.ModelAdapter):
    def configure_optimizers(self):
        return SGD(self.parameters(), lr=1e-4)
**Note:** If you are familiar to working with lightning, you will note that we've been able to skip some of the boilerplate. This is because the IceVision `RCNNLightningAdapter` takes care of it behind the scene. For example, it defines `training_step` and `validation_step`. The adaptor also supports working with `Metric`s. If you need custom functionality, you can override or re-implement those methods.
light_model = LightModel(model, metrics=metrics)

Training the RCNN model using PL method

trainer = pl.Trainer(max_epochs=10, gpus=1), train_dl, valid_dl)

Visualize results

To quickly visualize the results of the model on a specific dataset use show_results:

faster_rcnn.show_results(model, valid_ds, class_map=class_map)



Load a model

Training the model with fastai using fine_tune twice and I got led the the following results:
train_loss: 0.06772
valid_loss: 0.074435

Using our Trained Weights

If you don't want to train the model, you can use our trained weights that we publicly available: You can download them with torch.hub:

weights_url = ""
state_dict = torch.hub.load_state_dict_from_url(weights_url, map_location=torch.device("cpu"))


Typically inference is done on the cpu, this is why we specify the paramater map_location to cpu when loading the state dict.

Let's recreate the model and load the downloaded weights:

model = faster_rcnn.model(num_classes=len(class_map))

The first step for prediction is to have some images, let's grab some random ones from the validation dataset:

11.3- Predict all images at once

If you don't have too many images, you can get predictions with a single forward pass.

In case your images don't fit in memory simultaneously, you should predict in batches, the next section shows how to do that.

For demonstration purposes, let's take download a single image from the internet and see how our model performs on it.

IMG_PATH = "tmp.jpg"

download_url(IMAGE_URL, IMG_PATH)
img = open_img(IMG_PATH)


Try other images!

Change IMAGE_URL to point to another image you found on the internet.
Just be sure to take one of the breeds from class_map, or else the model might get confused.

Whenever you have images in memory (numpy arrays) you can use Dataset.from_images.

We're going to use the same transforms we used on the validation dataset.

infer_ds = Dataset.from_images([img], valid_tfms)

For any model, the prediction steps are always the same, first call build_infer_batch and then predict.

For faster_rcnn we have detection_threshold, which specifies how confident the model should be to output a bounding box.

batch, samples = faster_rcnn.build_infer_batch(infer_ds)
preds = faster_rcnn.predict(model=model, batch=batch)

For displaying the predictions, we first need to grab our image from samples. We do this instead of using the original images because transforms may have been applied to the image (in fact, in this case, a resize was used).

imgs = [sample["img"] for sample in samples]

Now we just need to call show_preds, to show the image with its corresponding predictions (boxes + labels).

show_preds(imgs=imgs, preds=preds, class_map=class_map, show=True)


11.4- Predicting a batch of images

Instead of predicting a whole list of images at one, we can process a small batch at the time: This option is more memory efficient: We use infer_dataloader

Had we have a test dataset, we would have maken our predicition using the batch technique mentionned here above. As an illustrative example, we will predict all images belonging to the validation dataset using the following approach:

infer_dl = faster_rcnn.infer_dl(valid_ds, batch_size=16)
samples, preds = faster_rcnn.predict_dl(model=model, infer_dl=infer_dl)

Same as before, we grab our images from samples.

imgs = [sample["img"] for sample in samples]

Let's show the first 6 predictions:



Happy Learning!

If you need any assistance, feel free to join our forum.