BLOG Computer Vision

Video Annotations For Deep Learning — Popular Applications and Examples

Merlin Peter
June 3, 2021

Just like image annotations, video annotations also help machines recognise objects within their surroundings through computer vision. They are very similar to image annotations, the only difference being that the objects are annotated across a series of frames to help determine the motion and movement more accurately for deep learning applications. 

Video annotation has widespread applications across industries, the most common being autonomous vehicles, tracking human activity and pose points for sports analytics, facial emotion recognition among others. 

In this blog, we’ll deep dive into video annotations, the different types of annotations, use cases, features that simplify annotating frames, and free labeling platform for video annotations. 

What Is Video Annotation?

Video annotation is the process of enriching video data by adding useful information in the form of tags across every frame. These tags could be bounding boxes, polylines, or even meta tags for classification. Video annotations help machines interpret or see the world around them more accurately.  Video annotation is used to train algorithms for a variety of tasks, from simple classification to the tracking of objects across multiple frames.

All parameters that are important for the model’s outcomes are predetermined by ML engineers and the frames are annotated with the required information to train the ML models. For example, if you are training a model to identify the movement of pedestrians on a crossing, then the algorithm is fed with data that spans across multiple frames which accurately record pedestrian movements on the crossing. 

Types of Video Annotations

There are simple annotations like 2D bounding boxes, 3D cuboids, landmarks, polylines, and polygons that are commonly used to annotate frames in a video. We’ll explain each type along with examples below:

2D Bounding Boxes 

Bounding boxes are rectangular/square boxes commonly used for object detection or localisation and classification in a few cases. Annotators draw these boxes manually around objects of interest in motion across multiple frames. The box should be as close to every edge of the object as possible and labeled correctly for classes and attributes [if any] for an accurate representation of the object and its movement in each frame.

2D Bounding Box Annotation Across Frames

3D Bounding Boxes or Cuboids

3D cuboids label the length, width, and approximate depth of an object in motion for a more accurate 3D representation of an object and how it interacts with its surroundings. It allows the machine to distinguish features like volume and position of an object in the three-dimensional space along with its movements. 

Human annotators are tasked with drawing a box encapsulating the object of interest and placing anchor points at each of the object's edges. If one of the object’s edges is out of view or blocked by another object in the frame, the annotator approximates where the edge would be based on the size and height of the object and the angle of the frame.

3D Bounding Box Annotations Across Frames


Polygons are commonly used when 2D or 3D bounding boxes are not sufficient to accurately represent an object in motion or its shape. For instance, in cases where the object’s shape and form are irregular. Polygon annotations usually require a high level of precision from the labeller. The annotators are required to draw lines by accurately placing dots around the outer edge of the object they want to annotate.

Polygon Annotations Across Frames


Key-point and landmark annotation is commonly used to detect small objects and shape variations by creating dots across the image and lines connecting these dots to form a skeleton of the object of interest across each frame. This is useful for detecting facial features, expressions, emotions, human body parts, and poses for applications such as AR/VR, sports analytics, facial recognition, etc. 

Landmarks Across Frames

Lines & Splines 

While lines and splines are used for a variety of purposes, they’re mainly used to train machines to recognise lanes and boundaries, especially in the autonomous driving industry. The annotators simply draw lines along the boundaries of areas that the machine must recognise across frames. 

Popular Applications of Video Annotations

Video annotations have wide-spread applications in deep learning across industries. We’ll look at a few common use cases that we come across below:

2D and 3D Object Detection [Fame-by-Frame Annotation]

The most common purpose of video annotation is to capture the object of interest frame-by-frame, making it recognizable to machines.


Accurate object detection and tracking across frames in an interior set up. Tracked classes include human motions, plants, shelves, chairs, etc.


Accurate detection of suspicious activities in public airports for better surveillance models.


Accurate object detection and tracking across frames in 3D point cloud scenarios. All objects of interest are assigned unique object IDs.

2D and 3D Object Localisation 

Another application of video annotation is in localising the objects in the video. When there are multiple objects visible in a video, localisation locates the main object in an image, meaning the object that is mostly visible or focused in the frame.


Tracking buyer behaviour via 2D annotations. 

2D AV 

Object localisation via 2D bounding boxes to detect movement across frames.

2D and 3D Object Tracking

Another important use of video annotation is to help visual perception AI models to detect and recognise varied categories of objects across frames. 


Accurate object detection and tracking across frames in retail scenarios. Tracked classes include shoppers, store assistants, stationary store objects, price tags, etc.

2D Bounding Boxes For AV 

Enhanced visual intelligence for computer vision models with ML-assisted tools to enable object detection and tracking across multiple frames in a sequence.

3D Bounding Boxes for AV

Accurate object detection using cuboids to detect moving vehicles on a busy road across all frames.

Human Pose-Point Estimation

Another significant application of video annotation is to train computer vision based AI or machine learning models to track human activities and estimate their poses. This is mainly used in sports fields to track the actions athletes perform during the competitions and sports events, helping machines to accurately estimate the human poses. This is also increasingly used in ADAS and DMS applications to improve road safety. 

Driver Drowsiness Detection for ADAS

Plotting driver movement points accurately to design advanced driver assistance systems.

Sports Analytics 

Pose point estimation is used to detect human figures and estimate human poses in 2D images and videos etc.

Frames Classification 

Video frame classification aims at classifying frames into broad categories based on their contents. The goal is to simply identify which objects and other properties exist in an image. It is especially useful for abstract information like scene detection, weather, or time of day. In the example below, the annotator assigns tags like weather, landscape, etc

AV Frames Classification

Accurate video frames classification for AV perception models. Time stamps are tagged for attributes like time of day, weather, environment, object/border occlusion, etc.

Sensor Fusion Object Detection and Tracking

In sensor fusion object detection and tracking, the videos are usually recorded with multiple camera and lidar sensors and we tag, detect, and track objects of interest in both 2D and 3D scenarios and assign accurate tracking ids to the objects in both scenarios across all frames in a video. 

Autonomous Vehicles

Accurate object detection and tracking across frames in 2D and 3D scenarios. All objects of interest are assigned unique tracking ids and are linked in both scenes.

Free Video Annotation Platform — GT Studio 

GT Studio is a scalable, web-based data labeling platform designed to empower ML teams. Most of the open-source tools available in the market don’t make the cut for complex computer vision use cases. From agriculture to autonomous driving, bounding boxes to segmentation, we’ve built annotation tools and have perfected workflows and quality check tools for any CV/AV use case you can practically think of. 

With GT Studio, you can leverage:

Interpolations In GT Studio For Faster Video Labeling 

Video annotation is a labor-intensive process because humans have to annotate each frame accurately to help build training datasets for the ML algorithms. And with video, the number of frames can quickly add up to millions and so that’s why automations in labeling become indispensable for video data labeling. 

Interpolation is a well known feature that largely simplifies the process of creating video annotations. We use interpolation methods to label objects in a sequence. With the interpolation feature, the annotator is required to label every second to the fifth frame in a sequence, instead of labeling the same object across each frame. This drastically reduces the time taken to label videos and sensor fusion sequences.

Compare the time taken to label videos manually vs using the interpolation feature in GT Studio.

Manual Labeling

Annotators Manually Draw Cuboids For Each Frame (Number of Cars Annotated = 3)


Interpolation Automatically Detects Objects In Multiple Frames Increasing The Number of Cars Annotated From 3 to 6

Try GT Studio for free today... 

Anyone can sign up and start using GT Studio. We have an excellent team to support you through your journey while exploring our platform. 

Feel free to reach us at