Data Labeling for
Machine Learning

Will this guide help you?

This guide will help you if you want to,

  • Understand labeling concepts for real-world application
  • Label loads of data without a quality compromise
  • Explore quality assurance for your data labeling operations
  • Learn about data security in data labeling.
  • Comply with regulations like GDPR

Irrespective of your intent, this guide will help you gain a better understanding of data labeling concepts and overcome corresponding challenges involved with the process. If you are an ML engineer then this guide will come in handy for your next ML initiative.

Data labeling in machine learning

It is a well-established fact that successful AI systems are a function of high-quality datasets. And so it’s no surprise that 80% of the time spent on ML initiatives is dedicated to data preparation activities such as data identification/collection, data cleaning, data aggregation, data augmentation, and most importantly data labeling.

Take a look at a typical machine learning project lifecycle:

Data labeling takes up a significant portion of an ML engineer's time in the entire project lifecycle. When you’re building a machine learning model, training, validation, and testing datasets become an integral part of the data preparation and processing stage. Data labeling or data annotation is the process of creating these datasets for the successful implementation of AI.
It is important to ensure that the data labeling process does not interfere with the end goal and so we will look at strategies to make data labeling easier later in this guide.
But first, let’s start with the basics.

What is data labeling or data annotation?

Data labeling and data annotation are commonly used interchangeably to describe the process of creating ground truth datasets for training, validating, and testing machine learning algorithms.

The supervised machine learning models use a large amount of annotated/labeled datasets to understand and process data for desired outcomes.

Data labeling includes activities like tagging, annotating, classifying, moderating, transcribing, or processing data for machine learning algorithms.

Common terminology used in data labeling

Training Datasets

Training data is the dataset you use to train your algorithm or model so it can accurately predict the outcome. Training data is tagged, classified, and enriched with annotations, meta tags, transcriptions, etc. to train the algorithms to recognize repetitive patterns in the data.

Validation Datasets

Validation data is used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.

Testing Datasets

Test data is used to measure the accuracy and efficiency of the algorithm used to train the machine — to see how well it can predict new answers based on its training.

You can learn more about the process of building good training, validation, and testing datasets on the Playment blog.

Human In The Loop

Human in the loop refers to the collaboration between human and machine intelligence to create machine learning models. In human-in-the-loop processes, people are involved in various stages to ensure the ML algorithms are producing desired outcomes.

Although we are seeing an increase in automation efforts for data labeling, the process still heavily relies on human-in-the-loop configurations.

Read more about AI and Human-in-the-loop debates on the Playment blog.

Ground Truth Datasets

Ground truth datasets are used to check the results of machine learning algorithms for accuracy against real-world scenarios. In essence, ground truth represents the absolute truth for machine learning models.

Steps involved in the data labeling process

Here’s a data labeling reality check!

Most teams starting out in ML tend to underestimate the amount of work that goes into data labeling. Data labeling operations are complex and involve a lot more than just hiring annotators and creating data annotations. Here’s an overview of the activities that are often overlooked.
What data labeling appears to be and what it involves

The steps involved in data labeling will help you understand the intricacies of the process so you can tackle the activity with zero hassle.

Cuboid annotation tool

1. Data preparation

a)   Defining labeling requirements based on your problem statement

  • The first step is to identify the type of annotations required for your ML model. For example, some models may require a combination of 2D, 3D, and linked datasets.
  • Next, you must identify the objects/features to be labeled in your dataset. For example, cars, people, road signs, etc for an autonomous driving use case.
  • Consider what all conditions/scenarios you want our model to identify. For example, snowy scenes or night scenes for autonomous vehicles.
  • Decide the size of the dataset you will need to train your model. There’s no magic number, but the general rule of thumb is that your model should train on data at least an order of magnitude more than the trainable parameters.

b)   Data collection and data selection

  • If datasets are not readily available, then you might want to start collecting relevant data for your machine learning model.
  • Once the data is collected, you can use any of the data splitting methods compatible with your problem statement to create training, validation, and testing datasets.
Read more about prepping for building good training datasets on the Playment Blog.

2. Setting up labeling pipeline

a)   Define labeling guidelines for your dataset

  • Create simple and objective labeling instructions so annotators can understand the labeling requirements.

b)  Choose annotation tools

  • Your annotation tool must support the annotation type required for your model, be easily integrated into your data pipeline, and easy to use for annotators. You can refer to effective annotation tool strategies in the next section of this guide.

c)   Configure annotation tools

  • Once you have an annotation tool, you have to set up annotation projects with class and attribute configurations, workflows, data visualization [images, 3D point clouds, etc], and set up quality parameters to be met by annotators.

d)   Hire skilled workforce

  • The next step is to hire annotators or outsource your annotation projects because an ML engineer is better off building models rather than labeling large datasets. You can read more about effective workforce strategies in the next section of this guide.

e)   Train workforce

  • Training annotators with relevant training material is important because it directly affects the quality of annotations and your model performances. Selecting annotators based on proficiency is a must if you want to build high-quality datasets.
At Playment, we have a developed a secure annotator training mechanism as follows:
High-Threshold Annotator Training and Qualification
We’ve created a secure, highly-effective annotator training and qualification program to help them execute unique and complex annotation requirements for our global client base
  • Data/ Annotation Requirement Analysis: Our expert project managers conduct a thorough study of your data and requirements to fill any gaps in the project.
  • Gold Standard Task Creation: Labeling tasks are first performed by expert in-house annotators and high benchmark scores are set for other annotator qualifications.
  • Customised Training Model: Based on the complexities of the project, the managers then create customised training materials to bring annotators to speed.
  • Accuracy-Based Qualifications : Solved tasks are reviewed by checkers and SMEs for concensus. Annotators scoring above 90% are selected for the project.

f)   Identifying quality parameters

  • There’s no one size fits all approach when it comes to data labeling quality parameters. Defining the right quality metrics for different annotations at the beginning of the process is very important to accurately measure the quality of a particular dataset in quantifiable terms.
You can refer to our quality assurance page to get an in-depth understanding of quality frameworks for different types of annotations.

3.  Labeling process + project management

a)   Creating annotations as per guidelines

  • The next step is actually getting down to manually labeling your datasets using annotation tools. Having automation is a huge advantage for speeding up the throughputs and accuracy of your datasets.

b)  Checking/correcting annotations

  • Streamlining both manual and automated checks during the labeling process helps improve the accuracy of the annotations.

c)   Sampling for quality check

  • Manually checking all the parameters of all the annotations present in a dataset when you are dealing with millions of annotations becomes prohibitively expensive and time-consuming. Therefore, creating a statistically significant random sample that adequately represents the dataset will simplify the QC process.

d)   Quantification of quality

  • Different types of annotations require different quantification metrics based on predetermined parameters setup during the labeling project setup stage
You can refer to our quality assurance page to get an in-depth understanding of the quality process and protocols we follow to ensure high quality outputs.

e)   Project Management

  • Handling edge cases : Policies and guidelines must be updated to support edge cases in the dataset for more accurate results.
  • Workforce and task management : This includes keeping a tab of annotator productivity and accuracy and also monitoring project progress and removing blockers for faster throughputs.

4.  Data transformation

The next step involves converting data into necessary formats for model ingestion. This concludes the labeling process. The datasets are then used to train and test the model for results. Based on the results, the entire labeling process is repeated or revamped to include more additions and upgrades.

Five main components of data labeling — a handy checklist to evaluate data labeling partners

Cuboid annotation tool

1. Annotation Tools

What a paintbrush is to an artist, annotation tools are to any data labeling process. It's an essential pre-requisite. Annotations help give meaning to raw unstructured data. You need to either build these tools yourself or buy it from a third party. Based on where you are in your data labeling journey, you might have different requirements.

How to choose the data labeling tools?

a)   Sampling for quality check

Manually checking all the parameters of all the annotations present in a dataset when you are dealing with millions of annotations becomes prohibitively expensive and time-consuming. Therefore, creating a statistically significant random sample that adequately represents the dataset will simplify the QC process.

b)   Ease of integration

As your data labeling process grows, you would develop integrations with annotation tools in various parts of the pipeline. Choosing a tool that uses modern web technologies and which provides clean integrations will make your life much easier. It might sound like the obvious thing to do, but we've seen hard drives shipped across the Atlantic just to get data labeled (no judgments made).

c) Ease of use for annotators

Annotators spend a huge part of their working day on the annotation tools. Hence, the user experience of the tools can't be emphasized enough. A tool where making an annotation is a quick & snappy process can help you build your datasets in less time and with fewer people.

Levels of Automation

  • Level 0: No Automation:
    The annotator manually performs all the labeling tasks.
  • Level 1: Tool Assitance
    The human annotators are assisted by superpixel annotation by annotating a group of pixels at a time, interpolation models to perform labeling tasks faster without intensive manual effort for every single frame or annotation.
  • Level 2: Partly-automated Labeling
    The human annotators are assisted by machine learning models to auto-detect objects of interest whereas human-only performs the verification or editing of the AI-detected objects.
  • Level 3: Highly-automated Labeling
    The majority of the data is pre-labeled by machine learning algorithms. The annotators only perform QC and suggest edits.

2.  Workforce structure

As your model consumes more data and its accuracy improves, you'll discover new edge cases and your model will have to learn new features. The need for labeled data can grow pretty fast and managing such a process yourself can soon become overwhelming.

When should I scale and hire a data labeling service?

If your most expensive resources like data scientists or engineers are spending 60-70% of their precious time on datasets for training, you’re ready to consider scaling with a data labeling service. Increases in data labeling volume, whether they happen over weeks or months, will become increasingly difficult to manage in-house. They also drain the time and focus of some of your most expensive human resources i.e. data scientists and machine learning engineers. If your data scientist is labeling or wrangling data, you’re paying up to $90 an hour. It’s better to free up such a high-value resource for more strategic and analytical work that will extract business value from your data.

5 steps to scaling data labeling functions

Here's how you can scale your data labeling functions more effectively:

a)   Workforce strength

Depending upon the data volume, conclude the workforce strength.

b)   Workforce elasticity

Depending on the frequency of labeling, space out and allocate workforce.

c)   Workforce hiring

Get annotators on board after understanding points 1 and 2.

d)   Workforce quality

Measure annotator productivity in terms of speed and accuracy. Being data-driven can help you optimize your operations better.

e)   Enabling feedback mechanism

Streamline the feedback and review processes with the data labeling teams.

3.  Quality assurance engine

The quality of the labels is the most critical piece when it comes to data labeling. It's a function of annotation tools, ambiguity in annotation guidelines, the expertise of the workforce, quality assurance workflows and sometimes depends on the type or, diversity of the data itself.  Optimizing for the highest quality with available resources is a continuous effort, just like running a high-grade assembly line.

How is quality measured in data labeling?

The first step in assuring quality is measuring it. If you think of data labeling as an assembly line process, quality needs to be measured at each step in the assembly line. This can be done using various methods:

a)   Test questions

The annotator's output can be compared against a curated set of correct test questions that have perfect annotations. For most geometric annotations required in computer vision, any two annotations can be compared using an IoU score.

b)   Heuristic checks

Statistical analysis of annotations can flag outliers that creep in due to human misjudgment. For example, when labeling pedestrians in a point cloud, you can't have a pedestrian who is 10 feet tall.

c)   Quality check (QC) for data samples

A subset of annotations is sampled and can be carefully reviewed by expert annotators. Based on the number of true positives, false positives, and false negatives identified by the reviewers, metrics like precision and recall can be calculated.

Measuring quality will help you find out if the labeled data is good enough to be fed into your models. It'll also help you catch the type of errors that are commonly being made so that you can provide relevant feedback to your annotators. This feedback loop trains the annotators in a way that's not so different from the models that you are training.
Learn more about creating a successful quality frame for your data pipeline in our quality guides.

4.  Labeling cost

Budgeting for labeling project(s) may get complex at times due to high variance as no two projects can be considered very similar. Slight variation in data type, annotation type, number of classes, quality parameters, speed, or volume of data, and ease of automation can influence the pricing drastically.

4 Critical Price Considerations For Data Labeling

a)   Project Duration

Is it a one-time project or a long-term recurring project?

b)   Quality, Turn-Around Time, Cost

Rank these in order of importance for your project as at least one might have to be compromised.

3. Pricing Model

Evaluate if paying per hour or paying per annotation works better for you.

4. Internal Costs

Which part of the data labeling process do you want to carry out internally and which projects require outsourcing efforts? Anything you do internally would also incur significant costs.
Calculate a realistic labeling budget using Playment’s Labeling Cost Calculator.

5. Security

If data security is a factor in your machine learning process, your data labeling service must have a facility where the work can be done securely, the right training, policies, and processes in place.

3 Aspects of Data Security

Most importantly, your data labeling service must respect data the way you and your organization do. They also should have a documented data security approach in all of these three areas:

a)   People and Workforce

This could include background checks for annotators and may require you to sign a non-disclosure agreement (NDA) or similar document outlining your data security requirements. The workforce could be managed or measured for compliance. It may also include customized annotator training on security protocols related to your data.

b)   Technology and Network

Annotators may be required to turn off devices they bring into the workplace, such as mobile phones or tablets. Download or storage features may be disabled on devices annotators use to label data. There's likely to be significantly enhanced network security via restricted IP ranges, whitelisting select IPs, etc.

c)   Quality check (QC) for data samples

Annotators may sit in a space that blocks others from viewing their work. They may work in a secure location, with badged access that allows only authorized personnel to enter the building or room where data is being labeled. Video monitoring may be used to enhance physical security for the building and the room where work is done.

GDPR Compliance Requirements

To comply with GDPR, data collected in the EU can be sent outside the EU only if all personally identifiable information is removed. For visual data, this can be done by blurring out identifiers like faces, vehicle number plates, etc. Thankfully, this isn't as tedious as it sounds since there exist solutions that can automatically take care of anonymizing such data.

Security concerns shouldn’t stop you from using a data labeling service that will free up you and your team to focus on the most innovative and strategic part of machine learning: model training, tuning, and algorithm development.

Noteworthy strategies to consider for a seamless labeling process.

Important considerations for your ML project mostly pertain to data annotation/data labeling tools or platform and annotation workforce.

Build vs Buy
Annotation Tools

Often, when teams consider building tools for internal labeling operations, they don’t factor in long-term maintenance costs and the corresponding scaling expenses required to make the tool viable in the future. It takes heavy cost and time investments to build a platform that meets the evolving needs of a business.
There are other significant risks of building in-house tools like loss of opportunity cost, lack of collaboration interfaces, accumulation of technical debt, quality compromise, and lack of scalability.

You can read more about risks and calculate the cost of building vs buying for free to make an informed choice about your strategy.

Hire vs Outsource
Annotation Workforce

Companies typically use one of the following resources for data labeling:

Employees - They are on your payroll, either full-time or part-time. Their job description may not include data labeling.

Managed teams - You use vetted, trained, and actively managed data labelers [Eg: Playment’s managed workforce]

Contractors - They are temporary or freelance workers.

Crowdsourcing - A third-party platform to access large numbers of workers at once [Playment also provides a crowdsourced network if customers require it.]

There’s no one-size-fits-all solution, but we can offer a tried-and-tested framework that you can take inspiration to build your own labeling stack.

Here’s a ready-to-use decision framework for
your data labeling strategy

Read an in-depth article to understand the framework.

Some critical questions to ask your labeling partner

Image Annotation Tools
  • Does the tool support all popular data formats and annotation types?
  • Does the tool have any features to view annotations and share feedback on the labeling quality?
  • Does tooling support analytics to quantify the data quality of individual annotations and overall datasets?
  • Does the tool have any features to view annotations and share feedback on the labeling quality?
  • Do you have any automation features to accelerate labeling tasks?
Quality Assurance
  • Does the tool support all popular data formats and annotation types?
  • Does the tool have any features to view annotations and share feedback on the labeling quality?
  • Does tooling support analytics to quantify the data quality of individual annotations and overall datasets?
  • Does the tool have any features to view annotations and share feedback on the labeling quality?
  • Do you have any automation features to accelerate labeling tasks?
Scaling Workforce
  • Does the tool support all popular data formats and annotation types?
  • Does the tool have any features to view annotations and share feedback on the labeling quality?
  • Does tooling support analytics to quantify the data quality of individual annotations and overall datasets?
  • Does the tool have any features to view annotations and share feedback on the labeling quality?
  • Do you have any automation features to accelerate labeling tasks?
Pricing
  • Does the tool support all popular data formats and annotation types?
  • Does the tool have any features to view annotations and share feedback on the labeling quality?
  • Does tooling support analytics to quantify the data quality of individual annotations and overall datasets?
  • Does the tool have any features to view annotations and share feedback on the labeling quality?
  • Do you have any automation features to accelerate labeling tasks?

Playment : The one-stop solution for data labeling

Get started for free with GT Studio

GT Studio is a scalable, web-based data labeling platform designed to empower ML teams. The platform is completely free for a team of 5 users so that ML teams can create labeled data faster to test their ML initiatives.
You can log on to GT Studio and effortlessly label data using mature image/video annotation tools and manage labeling pipelines and workforce with our project management software, all in one single interface. The platform is extremely user-friendly, secure, and is easily scalable.

Try GT studio for free here.

Fully managed labeling solution

Our fully-managed labeling solution offloads ML teams from data labeling activities. We offer ML-assisted labeling tools, auto-scaling workforce options, and end-to-end project management for all unique ML use cases. Our dedicated project managers pre-plan all the essential requirements for labeling project success.

Our platform also accommodates quick ramp-ups, and we have efficient change management and project scaling protocols in place to scale with your changing annotation requirements.

Contact our ground truth expert for a free pilot setup.