We are already quite far along on the road to fully-automated AI systems. Today, machine learning models are evolving at a rapid pace, mimicking many practical applications more accurately than ever. Well-functioning ML models rely on training datasets to learn and perform many different real-world actions. In our everyday lives, we have voice-activated personal assistants like Siri and Alexa helping us in so many ways. Behavioural predictive models from Amazon, Netflix, Spotify, YouTube, and other platforms give us recommendations we didn’t know we needed. Uber and Lyft give us near-perfect ETAs during our rides. Teslas and Waymo have more than a few wow features that have changed mobility systems as we know them. There are a countless number of applications working behind the scenes in fields like marketing, robotics, finance, industrial manufacturing, healthcare, e-commerce, retail, insurance, agriculture, and many more. And if there’s one thing that is common across all these successful AI implementations, it is a good training dataset.
It is well documented that machine learning outcomes largely depend on the training datasets used to train the models. And so, good training datasets have become a prerequisite for building production-ready AI systems. But building a good training dataset is far from easy, and is a very important piece of the AI development cycle.
Dataset preparation on most days is a DIY project and so here’s everything you need to know about building a good training dataset.
Before we dive into the specifics, what is a training dataset?
To put it simply, training data is the initial data that is used to develop a machine learning model. Training data helps the model create and refine its rules to support desired outcomes. Training data requires some human intervention — for instance, processing data in formats that machine learning models understand.
How is training data used in machine learning?
Training data is commonly used in supervised learning methods of building machine learning models. Here. the human-in-the-loop chooses the data features to be used for the model. Training data is labeled — that is, annotated or enriched to teach the machine how to recognise the outcomes the model is designed to detect.
Training datasets can include text (words and numbers), images, video, or audio. And they are available in many formats, such as a CSV, PDF, HTML, or JSON. When labeled accurately, training data can serve as ground truth data for developing an evolving, performant machine-learning formulae.
Training data is used not only to train a model but also to retrain ML models throughout the AI development lifecycle. As real-world conditions evolve, the initial training dataset may not accurately represent ground truth, requiring you to update your training data to reflect those changes and retrain your model.
Steps to build a good training dataset.
Well, there are many different factors that make up the perfect training dataset for a particular machine learning model. Here, we’ve curated the most common steps you can follow to create a framework for curating training dataset for your next ML initiative.
Step 1: Accurately articulating your problem statement
As a precursor to building a good training dataset, it is important to articulate your problem statement precisely. Knowing what you want to predict will help you decide which data may be more valuable to collect and prepare for your ML model. Conduct data exploration while you’re formulating the problem statement and categorise model requirements into classification, clustering, regression, and ranking so you are clear about what your data must reflect.
Step 2: Checking dataset availability for your problem statement
The bigger organisations have an undue advantage of having large swathes of data available for their ML initiatives. They’ve been hoarding data for a long time now and some even require AWS Snowmobiles [large 45-foot long trucks] to migrate or transport exabyte-scale datasets in and out of the cloud.
But for those entering into the landscape now, firstly, there’s tons of open source resources to initiate ML execution. There’s so much data available for early training sessions and many companies [like Google] are willing to give it away for free.
Secondly, there’s always data that you can collect and label the right way to suit the requirements of your problem statement. Data collection, labeling and handling become very critical for companies that are just starting out with their ML endeavours. And it’s always better to be cautious about this process to avoid affecting your model performance at a later stage.
Step 3: Deciding on the size of the dataset [tentatively]
Firstly, it’s often impossible to accurately determine the size of data for any evolving ML model as they are constantly trained and retrained for real-world application. But it is important you have a good sense of the size for the initiation of the development process. Secondly, datasets come in all shapes and sizes. But gauging if it’s right for your ML model will depend entirely on your problem statement.
The size of a dataset is often responsible for poor performances in ML projects. Supervised machine learning models are data-hungry, and their performance relies heavily on the size of training data available. As a rough rule of thumb, your model should train on data at least an order of magnitude more than the trainable parameters. Simple models on large data sets generally beat fancy models on smaller data sets.
Step 4: Evaluate the quality of the dataset
Having a lot of data amounts to nothing if the quality is bad. It is always better to focus on these important parameters in the beginning stages of an ML project.
Diverse training datasets minimising biases while your model is predicting its outcomes. For instance, if you are training a model to predict cats, we don’t just use images of domestic cats. To get more accurate outcomes, we include a wide variety of cat images, including different attributes like sitting cats, running cats, standing cats, sleeping cats, etc.
Data adequacy and imbalance
Firstly, is your data sufficient for the task you are looking to solve. For example, if you’ve been selling electronics in the US and now plan on branching into Europe, can you use the same data to predict stock and demand?
Secondly, is your data imbalanced? Imagine that you’re trying to filter out suppliers that are unreliable based on a number of attributes like location, size, rating, etc. If your labeled dataset has 1,500 entries labeled as reliable and only 30 that you consider unreliable, the model won’t have enough samples to learn about the unreliable ones.
Reliability refers to the degree to which you can trust your data. A model trained on a reliable data set is more likely to yield useful predictions than a model trained on unreliable data. You can measure reliability by determining:
- Tangibility of human errors: If the dataset is labelled by humans, there are bound to be some errors. How frequent are those errors and how can you correct them? At Playment, we rely on an analytics-based approach to identify and minimise error rates by running the data through multiple human-loops till it reaches sufficient accuracy.
- Noisy data features: Some amount of noise is okay. But data that has too many noisy features may affect the outcome of your models.
- Data property filters: Is the dataset property filtered for your problem? For example, should your data set include search queries from bots? If you're building a spam-detection system, then likely the answer is yes, but if you're trying to improve search results for humans, then no.
- Quantity of omitted values: There are ways to deal with omitted values, but you must determine the quality to understand its impact on the training dataset quality.
- Duplicate data: For instance, the same data records can be duplicated because of a server error, or you had a storage crash, or maybe you experienced a cyberattack. Evaluate how these events impacted your data.
- Label accuracies: Wrong data labels and attributes account for huge gaps in model performances. It is important to maintain high precision and recall rates for labeled data.
Step 6: Maintain data format consistency
Data format is the file format you’re using. And this isn’t much of a problem to convert a dataset into a file format that fits your machine learning system best. You must also consider the format consistency of records themselves. If data is aggregated from different sources or if your dataset has been manually updated by different people, it’s worth making sure that all variables within a given attribute are consistently written. These may be date formats, sums of money (4.03 or $4.03, or even 4 dollars 3 cents), addresses, etc. The input format should be the same across the entire dataset. Some other aspects of data consistency include numeric ranges in an attribute. If you’ve set it at 0.0 to 5.0, ensure that there are no 5.5s in the training set.
Step 7: Data cleaning -- dealing with missing data
Missing values can tangibly reduce prediction accuracy. In terms of machine learning, assumed or approximated values are “more right” for an algorithm than just missing ones. If the number of missing values in your data is large (above 5%), that’s an area that will require significant effort. There’s no perfect solution to dealing with missing data. The process will depend on certain ‘success’ criteria for different datasets and even for different applications, such as recognition, segmentation, prediction, and classification. The solutions also vary depending on the kind of problem — time-series analysis, ML, regression, etc.
You can either substitute missing values with dummy values, e.g., n/a for categorical or 0 for numerical values [null value replacement], or substitute the missing numerical values with mean figures [mode/median/average value replacement]. You could also try model based imputation — regression, k-nearest neighbours, etc,
interpolation/extrapolation, forward filling/backward filling and multiple imputation or deleting the whole record.
Step 8: Reducing data for particular tasks
It’s tempting to include as much data as possible. But if you’re building training datasets for particular tasks, it is important that you understand that more strategic data will give better outcomes than a large dataset with too much unrelated data.
The different approaches of reducing data includes:
- Attribute sampling: Once you know what the target attribute (what value you want to predict) is, you can make a fair assumption about the critical values that help you achieve better outcomes. Domain expertise plays a greater role in such cases. For example, all data scientists might not understand that asthma can be a contributor to pneumonia complications. That’s why it’s important to have domain experts along with data scientists to build performant ML models.
- Record sampling: This approach implies that you simply remove records (objects) with missing, erroneous, or less representative values to make prediction more accurate. The technique can also be used in the later stages when you need a model prototype to understand whether a chosen machine learning method yields expected results and estimate ROI of your ML initiative.
- Data aggregation: You can also reduce data by aggregating it into broader records by dividing the entire attribute data into multiple groups and drawing the number for each group. This will help reduce data size and computing time without tangible prediction losses.
Training data vs Testing data vs Validation data — Common data splitting approaches
Data splitting is commonly used in machine learning to split data into a train, test, or validation set. This allows you to find the model hyper-parameter and also estimate the generalisation performance.
It is important to differentiate between training, testing data, and validation. All are integral to improving and validating machine learning models. Training data is used to teach an algorithm to recognise patterns in a dataset, validation data is used to validate algorithms and parameters, and testing data is used to assess the model’s accuracy.
To put it more accurately, training data is the dataset you use to train your algorithm or model so it can accurately predict the outcome. Validation data is used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. Test data is used to measure the accuracy and efficiency of the algorithm used to train the machine - to see how well it can predict new answers based on its training.
Now that you know what these datasets do, you might be looking for recommendations on how to split your dataset. This mainly depends on two things. First, the total number of samples in your data and second, on the actual model you are training.
Some models need substantial data for training, so you can optimise data for the larger training sets. Models with very few hyperparameters are easy to validate and tune, so a smaller validation set will work. But if your model has many hyperparameters, it will require a larger validation set as well (also consider cross validation). Also, if the model has no hyperparameters or ones that cannot be easily tuned, you can do away without a validation set too.
Dealing with lack of data? Here are some simple ways to get high-quality training data, customised for your ML use case.
As we already discussed, new data collection is one way of dealing with lack of data availability. Strategic data partnerships are another option to reduce the burden of collating high-quality training datasets. You also can purchase training data that is labeled for the data features you determine are relevant to the machine learning model you are developing. You can use your own data and label it yourself using an in-house team, crowdsourced annotators, or a data labeling service. There are also many open-source tools that allow you to label your own data.
GT-Studio is Playment’s web-based data annotation platform that’s designed to help ML teams build their own datasets and manage their data pipelines for multiple projects seamlessly. It includes a combination of high-precision labeling tools, workflow builder, and project management software. The platform is also free for a team of five.
Additionally, we also offer fully-managed labeling services for larger and complex ML datasets. From project set up to exporting accurate labeled data, we take care of the entire process. What’s more? We offer free pilot services. Unlock free data labeling services by contacting us at email@example.com or set up a demo with an expert.