BLOG Computer Vision

Image Annotation For Computer Vision — The Ultimate Guide With Data Samples

Merlin Peter
May 18, 2021

Computer vision has widespread applications in various industries employing AI today. AR/VR applications, surveillance, medical and diagnostic applications, autonomous driving, retail, e-commerce, construction, and many other niche industries including governments have been exploring computer vision in the last decade. ML developments in these fields can be largely attributed to increased data volumes and improved accuracy of labeled datasets.

In supervised machine learning, labeled data forms an important part of training ML models for desired outcomes. The process of creating training data for ML models is known as data labeling. Training data comes in different forms for computer vision applications, today, we will deep dive into image annotations.

What is an image annotation? 

In machine learning or deep learning, image annotation is the most common type of data used for training computer vision models. Image annotations help ML algorithms interpret or see the world around them as humans do. Essentially, images are enriched with labels of information via added text tags, identification, and classification of objects, metadata, instances, etc. This is the process of annotating images. 

All parameters that are important for the model’s outcomes are predetermined by ML engineers and images are annotated with the required information to train the ML models. For example, if you are training a model to differentiate between cats and dogs, the training data will have to accurately represent many types of cats and dogs, including features that differentiate the two. The model will learn to recognise cats and dogs after processing large amounts of images with annotated information about cats and dogs.

Image annotation is a labor-intensive process because humans have to annotate these images accurately to help build training datasets for the ML algorithms. 

Image annotation functions for computer vision

While there are many different types of image annotations, they all initiate a particular function for an ML algorithm’s output. There are four primary types of image annotations used to train computer vision models. 

We’ll explain the four primary annotations using the following example: 

Image Annotations For Computer Vision


Image classification aims at classifying images into broad categories based on their contents. The goal is to simply identify which objects and other properties exist in an image. It’s mostly used for unsupervised learning to help the algorithms identify the presence of similar objects in images across an entire dataset because the entire image contains just one label. It is especially useful for abstract information like scene detection, weather, or time of day. In the above example, the annotator only assigns one tag to the entire image with ‘Animals’. This is also sometimes referred to as image tagging

Object Detection/Identification 

Individual objects of interest in an image are tagged with specific tags to help the algorithms interpret information with higher levels of granularity. The goal here is to go one step further to find the position/location and number of individual objects in the image. The bounding box is the most basic annotation used to identify objects in an image. For more complex use cases other annotations like cuboids, polygons, lines, splines, and landmarks are also used to identify objects and their additional attributes based on the nature of the problem statement. In the example above, the annotator has used bounding boxes to identify the dog and the cat in the image. 


A more advanced application of image annotation is segmentation. To put it simply, pixel-wise labeling of an image is known as segmentation. The goal here is to recognize and understand what an image contains at the pixel level. Every pixel in an image belongs to at least one class, as opposed to object detection where the bounding boxes of objects can overlap. 

This type of annotation helps specify the shape/size/form of the objects in addition to their location and presence. It’s used in cases that require more specificity and where a model needs to definitively know whether or not an image contains the object of interest and also what isn’t an object of interest. In the above example, all the green pixels indicate the Dog and the pink pixels indicate the Cat.

Instance segmentation tracks the presence, location, number, size, and shape of objects in an image. The goal here is to understand the image more accurately with every pixel. In the above example, the pixels related to the dog are tagged as ‘Dog 1’ and those that belong to the cat are tagged as ‘Cat 1’. If there were more dogs and cats in the image, then they would be segmented in similar colours and tagged consecutively. 

This type of annotation blends both semantic and instance segmentation. The background and objects are semantically segmented and the objects also have instances. This provides granular information for certain ML algorithms. 

Take a look at the different types of segmentation in the example below for an autonomous driving use case. 

b) Segmented classes have no instances and are tagged as Car, Building, Road, Sky

c) Segmented classes have instanced and are tagged as Car 1, Car 2, Car 3

d) Combination of segmented classes with instances Car 1, Car 2, Car 3, and non-instanced classes like Sky, Road, etc.

Types of Segmentation

Object Tracking

This is very similar to object identification, but the objects are tracked across frames in a given dataset. The goal is to label and plot an object’s movement across multiple frames of video. Features like interpolation make it very easy to label a large number of frames. With this feature, all the objects in 5-6 frames are automatically labeled and the annotator can go over the frames and adjust the annotations where movements in the object are observed. 

Object Tracking Across Frames

Image Transcription

Transcription is used to annotate text in images or video when there is multimodal information (i.e., images and text) in the data. It is a very rare use case in CV and falls mostly under the NLP domain.

Types of image annotations with data samples from popular use cases across industries

There’s no one-size-fits-all approach when it comes to image annotations. All industries and use cases require different types of annotations depending on the nature and complexity of the problem statement. 

For simplicity, we’ll walk through the different types of image annotations along with examples from different use cases and provide accurate samples for each. 

2D Bounding Boxes 

Bounding boxes are rectangular/square boxes commonly used for object detection or localisation and classification in a few cases. Annotators are tasked with drawing these boxes manually around objects of interest. The box should be as close to every edge of the object as possible and labeled correctly for classes and attributes [if any] for an accurate representation of the object. 

Sample 1: Drone Imagery Object Detection

2D labeled images for training smart surveillance drones and robots to identify a variety of objects like cars, buildings, trees.

Drone Imagery Object Detection - 2D Bounding Box

Sample 2: Object Localisation For AV

2D bounding boxes to train autonomous driving perception models for classes like pedestrians, traffic/road signs, lane obstacles, etc.

Object Detection & Localisation For AV - 2D Bounding Box

3D Bounding Boxes or Cuboids 

2D bounding boxes only provide information about the length and width of an object, but 3D cuboids label the length, width, and approximate depth of an object for a more accurate 3D representation of an object. It allows the machine to distinguish features like volume and position of an object in the three-dimensional space. Human annotators are tasked with drawing a box encapsulating the object of interest and placing anchor points at each of the object's edges. If one of the object’s edges is out of view or blocked by another object in the image, the annotator approximates where the edge would be based on the size and height of the object and the angle of the image.

Sample 1: Object Depth Perception

3D bounding boxes to train robots in high-traffic manufacturing units to detect objects or human movement using 3D cuboids for better navigation/low-risk remodels.

Object Depth Perception - 3D Bounding Box or Cuboids

Sample 2: Spatial Cognition

3D bounding boxes to estimate the spatial distribution of objects to build 3D simulated worlds from 2D data for building AR/VR applications.

Spatial Cognition - 3D Bounding Box or Cuboids


All objects come in different sizes and shapes and it is difficult to always label them using 2D or 3D bounding boxes. Therefore, in cases where the object’s shape and form are irregular, polygon annotations are used instead. Polygon annotations usually require a high level of precision from the labeler. The annotators are required to draw lines by accurately placing dots around the outer edge of the object they want to annotate.

Sample 1: Object Detection

Accurate detection of masks on humans traveling in public transport for better surveillance models.

Object Detection - Polygons

Sample 2: Object Shape Detection

Polygons to train models to understand underwater object shapes to support marine research.

Object Shape Detection - Polygons

Lines & Splines 

While lines and splines are used for a variety of purposes, they’re mainly used to train machines to recognise lanes and boundaries, especially in the autonomous driving industry. The annotators simply draw lines along the boundaries of areas that the machine must recognise. 

Sample 1: Lane Detection

Accurate lane detection in daytime, city scenes and high-traffic areas to help autonomous vehicles understand drivable areas.

Lane Detection - Polylines

Landmarks or Key-points

Key-point and landmark annotation is commonly used to detect small objects and shape variations by creating dots across the image. This is useful for detecting facial features, facial expressions, emotions, human body parts, and poses for applications such as AR/VR, sports analytics, facial recognition, etc. 

Sample 1: Facial Feature Detection

Plotting points accurately to detect facial features, expressions, and emotions for AR/VR applications.

Facial Feature Detection - Landmarks & Keypoints

Sample 2: Driver Drowsiness Detection [ADAS]

Plotting driver movement points accurately to design advanced driver assistance systems.

Driver Drowsiness Detection [ADAS] - Landmarks & Keypoints


Segmentation is used for the granular understanding of images in a variety of industries, and it is especially popular in the autonomous driving industry, as self-driving cars require a deep understanding of their surroundings. The annotator is given the task of separating an image into multiple sections and classifying every pixel in each segment to a corresponding class label of what it represents. Consider these examples for different types of segmentation that has already been discussed above:

Semantic Segmentation: Geospatial Application

Full-pixel, non-instanced segmentation is used for training perception models to identify objects of interest from faraway cameras.

Geospatial Application - Semantic Segmentation

Instance Segmentation: Autonomous Driving

Full-pixel instance segmentation is used when information of every pixel is critical and may influence the accuracy of the perception model.

Autonomous Driving - Instance Segmentation

Pan-optic Segmentation: Autonomous Driving

Pan-optic segmentation is used to individually segment objects of the same class by assigning unique instance IDs to each object of interest.

Autonomous Driving - Pan-optic Segmentation

Essentials for creating accurate image annotations 

  1. Collected Raw Data 
  2. Trained Annotators or Workforce
  3. Data Annotation Platform 

Annotation processes begin with collecting or sourcing data that is compatible with the problem statement at hand. It is followed by a number of data preparation tasks like aggregation, augmentation, cleaning, and the most important of all — data labeling. 

While annotators need not have fancy qualifications to build training datasets, they must be adequately trained on the annotation requirements. They must also be adept at using the data annotation platform to ensure faster throughputs for fast-paced ML projects. 

This brings us to one of the most important pieces of the puzzle —  the data annotation platform. The platform must be easy to use, compatible with different annotations and complex use cases. The annotators must be able to adapt easily to the platform.

GT Studio: The Data annotation platform that makes complex labeling very simple

GT Studio is a scalable, web-based data labeling platform designed to empower ML teams. The platform is completely free for a team of 5 users so that ML teams can create labeled data faster to test their ML initiatives. 

Most of the open-source tools available in the market, don’t make the cut for complex computer vision use cases. From agriculture to autonomous driving, bounding boxes to segmentation, we’ve built annotation tools and have perfected workflows and quality check tools for any CV/AV use case you can practically think of. 

With GT Studio, you can leverage:

Try GT Studio for free today... 

Anyone can sign up and start using GT Studio. We have an excellent team to support you through your journey while exploring our platform. 

Feel free to reach us at