Industry attributes
Other attributes
Natural language processing, often abbreviated as NLP, is a branch of computer science and artificial intelligence (AI) that is concerned with enabling computers to comprehend spoken words and text in the same way humans can. NLP technology ideally gives machines the ability to not only understand text or voice data but also to respond with text or speech of their own. The technology draws from several areas—computational linguistics, machine learning models, deep learning models and statistical models— in order to better connect human and machine communication.
NLP works by breaking down language into shorter, simpler pieces called tokens. Tokens are the parts of language that we use to string together and form sentences, such as words and punctuation. NLP technology then attempts to understand the relationship between these tokens, using higher-level NLP features including the following:
- Content Categorization: provides a linguistic document summary that includes content alerts, duplicate detection, search, and indexing
- Topic Discovery and Modeling: interprets the themes and meanings of text groups and applies advanced analytics to text
- Contextual Extraction: automatically pulls structured data from text-based sources
- Sentiment Analysis: identifies opinion-based language stored in large amounts of text
- Text-to-Speech and Speech-to-Text Conversion: translates voice commands into text and vice versa
- Document Summarization: condenses large amounts of text by automatically creating a synopsis
- Machine Translation: similar to voice translation, this automatically translates the text or speech from one language to another
Natural Language Processing is used in a wide variety of industries and has many uses. Some prominent examples include email filters, smart assistants, search results, predictive text, language translation, digital phone calls, data analysis, and text analytics.
NLP is also important to the process of machine learning-based data labeling. Data labeling refers to the process of adding annotations to or marking up data so it can be recognized by machine learning programs. In the context of NLP, data labeling can help a computer assign meaning to spoken words or text. NLP is a useful method for data labeling, especially when dealing with text data. NLP can be used to automate data learning through a variety of methods. A common use case for NLP in data labeling is to teach a computer to detect or discover the core meaning of a sentence in a method called Named Entity Recognition. A computer can be taught to identify certain words or phrases and assign a meaning to them, such as detecting that "Eric" is a person or that "California" is a location.
Another method NLP can provide for data labeling is a semantic analysis—an algorithm that can identify the tone of a sentence. A common example of this method is to teach the computer to identify tone on a binary scale of being either positive or negative, although more advanced classifiers with more nuance have been used. This method can also be applied to whole documents in a field appropriately titled Document Labeling.
Other, more advanced tasks in NLP data labeling include the following:
- Coreference Resolution—the task of finding all references to a specific entity in a text
- Dependency Parsing—teaching a computer to examine the dependency between words in a sentence to analyze its grammatical structure
- Syntax Trees—also known as a Parse Tree; refers to a tree structure of discovering the syntax of a sentence
These methods help machines better break down the structure of a sentence and navigate ambiguities in human language. The above methods can also be blended and used in combination with each other to highlight individual words for document labels.