Data labeling software

Is a

Industry

Industry attributes

Parent Industry

Software

Child Industry

Sentiment analysis

Data labeling, also referred to as data annotation, is required for a variety of use cases including computer vision, natural language processing, and speech recognition. The goal of data labeling is to provide data that is "marked up, or annotated, to show the target, which is the answer you want your machine learning model to predict."For example, in the use case of computer vision for autonomous vehicles, labeled data might include tagged street signs, pedestrians, or other vehicles. While unsupervised machine learning models (ex., anomaly detection models) do not rely on annotated data, supervised or semi-supervised "human in the loop (HITL)" models are utilized for a variety of commercial applications, ranging from autonomous vehicles to facial recognition.

The process of data labeling can be completed as a time-intensive manual process or it can be automated to various degrees using software. Data labeling as a service arises out of the need for companies in a variety of industries to develop large sets of training data for artificial intelligence or machine learning models. In 2019, The Economist referred to “tagged” or labeled data as the “feedstock” for machine learning algorithms.Data labeling can take up to 25% of total of the time required to complete a machine learning project.

Over time, data labeling services have had to increase standards to avoid "garbage in, garbage out" data. According to Testin Data service General Manager Henry Jia, "In 2015 and 2016, AI companies could build a fine AI prototype solution based on open-sourced datasets or some publicly available data on the Internet to get funding. But if they really want to implement algorithms in real-world scenarios, they have to push the envelop of data quality."

Market size and applications

According to Cognilytica, an industry research company for machine learning and other cognitive technologies, the market for data labeling was $1.5 billion in 2019. It is expected to grow to $3.5 billion by the end of 2024. it is expected that this growth will come as a result of domain-specific data labeling tasks.

Data labeling for autonomous vehicles

Uber acquired Mighty AI on June 25, 2019 in an effort to improve its self-driving algorithms. Scale.AI's customers include many other self-driving and general transport companies, including Waymo, Lyft, Zoox, Cruise, and the Toyota Research Institute. Waymo, Argo AI, and Lyft have also open sourced their self-driving datasets. A "high-quality" vehicle dataset includes:

Pixel-wise semantic annotation
3D semantic annotation
Pixel-wise object instance annotation
Fine-grained road segmentation
Moving object trajectory
High-precision GPS/IMO information, etc.

Data labeling for facial recognition

On January 29, 2019, IBM announced the release of a dataset with millions of possible faces representative of the real world. IBM pulled the images in partnership with Flickr.

Data labeling for sentiment analysis

Data labeling for natural language processing (NLP) is often used to perform sentiment analysis, such as for the end use case of customer service or marketing.

Other applications

Precision Agriculture (computer vision application)
Micromobility (computer vision application)

Synthetic data

More recently, the use of synthetic data has supplemented the data labeling process. Synthetic data is “generated through computer programs, instead of being composed through the documentation of real-world events”.

Synthetically-generated datasets can also be used to train machine learning models, particularly in computer vision. Synthetic data may augment real datasets to cover areas of the data distribution that are not sufficiently represented in order to alleviate dataset bias. Synthetic data may also be useful when real data is impossible or prohibitively difficult to acquire due to privacy or legal issues. Synthetic data has been used to train Google’s Waymo in the form of driving simulations. Facebook was reported to use synthetic data to train algorithms to detect bullying language.

Model operations and monitoring

A market has also emerged, adjacent to the data labeling market, that aims to ensure proper oversight over models and reduce bias in large datasets. This is part of the Ethical AI movement, which encourages the proactive embedding of diversity and inclusion principles into the AI lifecycle and aims to ensure transparency of AI systems.

Companies

Timeline

No Timeline data yet.

Companies in this industry

Further Resources

Title

Author

Link

Type

Date

AWS re:Invent 2018: [NEW LAUNCH!] Labeling for Accurate Machine Learning Training Datasets (DEM123)

https://youtu.be/tMeVyWJAXK8

Web

December 13, 2018

Best Practices for Managing

Data Annotation Projects

Bloomberg Finance

https://arxiv.org/pdf/2009.11654.pdf

Web

December 18, 2020

Data Annotation: The Billion Dollar Business Behind AI Breakthroughs

Synced

https://medium.com/syncedreview/data-annotation-the-billion-dollar-business-behind-ai-breakthroughs-d929b0a50d23

Web

August 28, 2019

Data labelling -- overcoming AI projects' biggest obstacle

Tech HQ

https://techhq.com/2020/10/data-labelling-overcoming-ai-projects-biggest-obstacle/

Web

October 20, 2020

Data-labelling startups want to help improve corporate AI

The Economist

https://www.economist.com/business/2019/10/17/data-labelling-startups-want-to-help-improve-corporate-ai

Web

October 17, 2019

Data labeling software

Contents

Industry attributes

Timeline

Companies in this industry

Further Resources

References

Find more entities like Data labeling software