Data labeling, also referred to as data annotation, is required for a variety of use cases including computer vision, natural language processing, and speech recognition. The goal of data labeling is to provide data that is "marked up, or annotated, to show the target, which is the answer you want your machine learning model to predict."For example, in the use case of computer vision for autonomous vehicles, labeled data might include tagged street signs, pedestrians, or other vehicles. While unsupervised machine learning models (ex., anomaly detection models) do not rely on annotated data, supervised or semi-supervised "human in the loop (HITL)" models are utilized for a variety of commercial applications, ranging from autonomous vehicles to facial recognition.
The process of data labeling can be completed as a time-intensive manual process or it can be automated to various degrees using software. Data labeling as a service arises out of the need for companies in a variety of industries to develop large sets of training data for artificial intelligence or machine learning models. In 2019, The Economist referred to “tagged” or labeled data as the “feedstock” for machine learning algorithms.Data labeling can take up to 25% of total of the time required to complete a machine learning project.
Over time, data labeling services have had to increase standards to avoid "garbage in, garbage out" data. According to Testin Data service General Manager Henry Jia, "In 2015 and 2016, AI companies could build a fine AI prototype solution based on open-sourced datasets or some publicly available data on the Internet to get funding. But if they really want to implement algorithms in real-world scenarios, they have to push the envelop of data quality."
According to Cognilytica, an industry research company for machine learning and other cognitive technologies, the market for data labeling was $1.5 billion in 2019. It is expected to grow to $3.5 billion by the end of 2024. it is expected that this growth will come as a result of domain-specific data labeling tasks.
Uber acquired Mighty AI on June 25, 2019 in an effort to improve its self-driving algorithms. Scale.AI's customers include many other self-driving and general transport companies, including Waymo, Lyft, Zoox, Cruise, and the Toyota Research Institute. Waymo, Argo AI, and Lyft have also open sourced their self-driving datasets. A "high-quality" vehicle dataset includes:
- Pixel-wise semantic annotation
- 3D semantic annotation
- Pixel-wise object instance annotation
- Fine-grained road segmentation
- Moving object trajectory
- High-precision GPS/IMO information, etc.
On January 29, 2019, IBM announced the release of a dataset with millions of possible faces representative of the real world. IBM pulled the images in partnership with Flickr.
Data labeling for natural language processing (NLP) is often used to perform sentiment analysis, such as for the end use case of customer service or marketing.
- Precision Agriculture (computer vision application)
- Micromobility (computer vision application)
More recently, the use of synthetic data has supplemented the data labeling process. Synthetic data is “generated through computer programs, instead of being composed through the documentation of real-world events”.
Synthetically-generated datasets can also be used to train machine learning models, particularly in computer vision. Synthetic data may augment real datasets to cover areas of the data distribution that are not sufficiently represented in order to alleviate dataset bias. Synthetic data may also be useful when real data is impossible or prohibitively difficult to acquire due to privacy or legal issues. Synthetic data has been used to train Google’s Waymo in the form of driving simulations. Facebook was reported to use synthetic data to train algorithms to detect bullying language.
A market has also emerged, adjacent to the data labeling market, that aims to ensure proper oversight over models and reduce bias in large datasets. This is part of the Ethical AI movement, which encourages the proactive embedding of diversity and inclusion principles into the AI lifecycle and aims to ensure transparency of AI systems.