Skip to content

Open In Colab

Getting Started with Object Detection using IceVision


IceVision is a Framework for object detection and deep learning that makes it easier to prepare data, train an object detection model, and use that model for inference.

The IceVision Framework provides a layer across multiple deep learning engines, libraries, models, and data sets.

It enables you to work with multiple training engines, including fastai, and pytorch-lightning.

It enables you to work with some of the best deep learning libraries including mmdetection, Ross Wightman's efficientdet implementation and model library, torchvision, and ultralytics Yolo.

It enables you to select from many possible models and backbones from these libraries.

IceVision lets you switch between them with ease. This means that you can pick the engine, library, model, and data format that work for you now and easily change them in the future. You can experiment with with them to see which ones meet your requirements.

In this tutorial, you will learn how to
1. Install IceVision. This will include the IceData package that provides easy access to several sample datasets, as well as the engines and libraries that IceVision works with.
2. Download and prepare a dataset to work with.
3. Select an object detection library, model, and backbone.
4. Instantiate the model, and then train it with both the fastai and pytorch lightning engines.
5. And finally, use the model to identify objects in images.

The notebook is set up so that you can easily select different libraries, models, and backbones to try.

Install IceVision and IceData

The following downloads and runs a short shell script. The script installs IceVision, IceData, the MMDetection library, and Yolo v5 as well as the fastai and pytorch lightning engines.



All of the IceVision components can be easily imported with a single line.

from icevision.all import *

Download and prepare a dataset

Now we can start by downloading the Fridge Objects dataset. This tiny dataset contains 134 images of 4 classes: - can, - carton, - milk bottle, - water bottle.

IceVision provides methods to load a dataset, parse annotation files, and more.

For more information about how the fridge dataset as well as its corresponding parser, check out the fridge folder in icedata.

# Download the dataset
url = ""
dest_dir = "fridge"
data_dir = icedata.load_data(url, dest_dir)

Parse the dataset

The parser loads the annotation file and parses them returning a list of training and validation records. The parser has an extensible autofix capability that identifies common errors in annotation files, reports, and often corrects them automatically.

The parsers support multiple formats (including VOC and COCO). You can also extend the parser for additional formats if needed.

The record is a key concept in IceVision, it holds the information about an image and its annotations. It is extensible and can support other object formats and types of annotations.

# Create the parser
parser = parsers.VOCBBoxParser(annotations_dir=data_dir / "odFridgeObjects/annotations", images_dir=data_dir / "odFridgeObjects/images")
# Parse annotations to create records
train_records, valid_records = parser.parse()
<ClassMap: {'background': 0}>

Creating datasets with agumentations and transforms

Data augmentations are essential for robust training and results on many datasets and deep learning tasks. IceVision ships with the Albumentations library for defining and executing transformations, but can be extended to use others.

For this tutorial, we apply the Albumentation's default aug_tfms to the training set. aug_tfms randomly applies broadly useful transformations including rotation, cropping, horizintal flips, and more. See the Albumentations documentation to learn how to customize each transformation more fully.

The validation set is only resized (with padding).

We then create Datasets for both. The dataset applies the transforms to the annotations (such as bounding boxes) and images in the data records.

# Transforms
# size is set to 384 because EfficientDet requires its inputs to be divisible by 128
image_size = 384
train_tfms = tfms.A.Adapter([*tfms.A.aug_tfms(size=image_size, presize=512), tfms.A.Normalize()])
valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(image_size), tfms.A.Normalize()])
# Datasets
train_ds = Dataset(train_records, train_tfms)
valid_ds = Dataset(valid_records, valid_tfms)

Understanding the transforms

The Dataset transforms are only applied when we grab (get) an item. Several of the default aug_tfms have a random element to them. For example, one might perform a rotation with probability 0.5 where the angle of rotation is randomly selected between +45 and -45 degrees.

This means that the learner sees a slightly different version of an image each time it is accessed. This effectively increases the size of the dataset and improves learning.

We can look at result of getting the 0th image from the dataset a few times and see the differences. Each time you run the next cell, you will see different results due to the random element in applying transformations.

# Show an element of the train_ds with augmentation transformations applied
samples = [train_ds[0] for _ in range(3)]
show_samples(samples, ncols=3)


Select a library, model, and backbone

In order to create a model, we need to: * Choose one of the libraries supported by IceVision * Choose one of the models supported by the library * Choose one of the backbones corresponding to a chosen model

You can access any supported models by following the IceVision unified API, use code completion to explore the available models for each library.

Creating a model

Selections only take two simple lines of code. For example, to try the mmdet library using the retinanet model and the resnet50_fpn_1x backbone could be specified by:

model_type = models.mmdet.retinanet
backbone = model_type.backbones.resnet50_fpn_1x(pretrained=True)
As pretrained models are used by default, we typically leave this out of the backbone creation step.

We've selected a few of the many options below. You can easily pick which option you want to try by setting the value of selection. This shows you how easy it is to try new libraries, models, and backbones.

# Just change the value of selection to try another model

selection = 0

extra_args = {}

if selection == 0:
  model_type = models.mmdet.retinanet
  backbone = model_type.backbones.resnet50_fpn_1x

elif selection == 1:
  # The Retinanet model is also implemented in the torchvision library
  model_type = models.torchvision.retinanet
  backbone = model_type.backbones.resnet50_fpn

elif selection == 2:
  model_type = models.ross.efficientdet
  backbone = model_type.backbones.tf_lite0
  # The efficientdet model requires an img_size parameter
  extra_args['img_size'] = image_size

elif selection == 3:
  model_type = models.ultralytics.yolov5
  backbone = model_type.backbones.small
  # The yolov5 model requires an img_size parameter
  extra_args['img_size'] = image_size

model_type, backbone, extra_args

Now it is just a one-liner to instantiate the model. If you want to try another option, just edit the line at the top of the previous cell.

# Instantiate the mdoel
model = model_type.model(backbone=backbone(pretrained=True), num_classes=len(parser.class_map), **extra_args) 

Data Loader

The Data Loader is specific to a model_type. The job of the data loader is to get items from a dataset and batch them up in the specific format required by each model. This is why creating the data loaders is separated from creating the datasets.

We can take a look at the first batch of items from the valid_dl. Remember that the valid_tfms only resized (with padding) and normalized records, so different images, for example, are not returned each time. This is important to provide consistent validation during training.

# Data Loaders
train_dl = model_type.train_dl(train_ds, batch_size=8, num_workers=4, shuffle=True)
valid_dl = model_type.valid_dl(valid_ds, batch_size=8, num_workers=4, shuffle=False)
# show batch
model_type.show_batch(first(valid_dl), ncols=4)



The fastai and pytorch lightning engines collect metrics to track progress during training. IceVision provides metric classes that work across the engines and libraries.

The same metrics can be used for both fastai and pytorch lightning.

metrics = [COCOMetric(metric_type=COCOMetricType.bbox)]


IceVision is an agnostic framework meaning it can be plugged into other DL learning engines such as fastai2, and pytorch-lightning.

Training using fastai

learn = model_type.fastai.learner(dls=[train_dl, valid_dl], model=model, metrics=metrics)

# For Sparse-RCNN, use lower `end_lr`
# learn.lr_find(end_lr=0.005)
SuggestedLRs(lr_min=6.918309954926372e-05, lr_steep=9.120108734350652e-05)


learn.fine_tune(20, 1e-4, freeze_epochs=1)
epoch train_loss valid_loss COCOMetric time
0 1.308208 0.935791 0.228113 00:07
epoch train_loss valid_loss COCOMetric time
0 0.852521 0.686144 0.273189 00:06
1 0.751516 0.582417 0.311040 00:06
2 0.673795 0.476592 0.455760 00:06
3 0.600898 0.381470 0.599883 00:06
4 0.537530 0.306719 0.771026 00:06
5 0.490598 0.261897 0.830975 00:06
6 0.445553 0.231296 0.832984 00:06
7 0.401092 0.231449 0.837599 00:05
8 0.368480 0.208900 0.860189 00:06
9 0.340383 0.218216 0.830745 00:06
10 0.316760 0.197272 0.856959 00:05
11 0.297015 0.166001 0.872045 00:05
12 0.278996 0.174514 0.862729 00:06
13 0.262735 0.176559 0.862040 00:05
14 0.248073 0.155698 0.875354 00:06
15 0.238265 0.159138 0.870347 00:06
16 0.228922 0.151915 0.869607 00:05
17 0.224587 0.150929 0.876862 00:05
18 0.220448 0.150267 0.872779 00:05
19 0.215704 0.150309 0.872779 00:05

Training using Pytorch Lightning

class LightModel(model_type.lightning.ModelAdapter):
    def configure_optimizers(self):
        return SGD(self.parameters(), lr=1e-4)

light_model = LightModel(model, metrics=metrics)
trainer = pl.Trainer(max_epochs=20, gpus=1), train_dl, valid_dl)

Using the model - inference and showing results

The first step in reviewing the model is to show results from the validation dataset. This is easy to do with the show_results function.

model_type.show_results(model, valid_ds, detection_threshold=.5)



Sometimes you want to have more control than show_results provides. You can construct an inference dataloader using infer_dl from any IceVision dataset and pass this to predict_dl and use show_preds to look at the predictions.

A prediction is returned as a dict with keys: scores, labels, bboxes, and possibly masks.

Prediction functions that take a detection_threshold argument will only return the predictions whose score is above the threshold.

Prediction functions that take a keep_images argument will only return the (tensor representation of the) image when it is True. In interactive environments, such as a notebook, it is helpful to see the image with bounding boxes and labels applied. In a deployment context, however, it is typically more useful (and efficient) to return the bounding boxes by themselves.

NOTE: For a more detailed look at inference check out the inference tutorial

infer_dl = model_type.infer_dl(valid_ds, batch_size=4, shuffle=False)
preds = model_type.predict_from_dl(model, infer_dl, keep_images=True)


Happy Learning!

If you need any assistance, feel free to join our forum.