Industry attributes
Technology attributes
Other attributes
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP). QA systems enable users to retrieve exact answers for questions posed in natural language, using either a pre-structured database or a collection of natural language documents.
QA systems can be considered an advanced form of information retrieval that makes it possible to retrieve answers using natural language queries. With an increasing demand for systems that deliver short, precise, question-specific answers, QA is a growing area of research worldwide.
QA system architecture is typically broken down into three modules:
- Question processing
- Document processing
- Answer processing
Question processing receives the input from the user (question in natural language) for analysis (obtaining preliminary information), classification, and reformulation.
Question classification breaks down the type of question to better understand the context for the answer. There are two main approaches to question classification: manual and automatic.
Manual classification applies hand-made rules for identifying expected answer types. While these rules can be accurate, they are time-consuming and non-extensible in nature. Some manual approaches improve answer detection by breaking down the question type into
- What questions
- Why questions
- Who questions
- How questions
- Where questions
In contrast, automatic classifications are extensible to new questions types with acceptable accuracy.
Reformulation of the question converts it into a pre-trained vector with several examples of question and answer pairs. The main types of answer provided by QA systems include the following:
- Factoid—a simple fact
- List—a set of entities that satisfies the given criteria defined in the question
- Definition—a summary of a short passage explaining the meaning of the subject/object of the question
- Complex question—utilizes information in its context to usually merge retrieved passages using a range of techniques.
Document processing takes the reformulated question as its input and uses an internal information retrieval system to map the closest documents to the input presented. A set of paragraphs, depending on the focus of the questions, are extracted and sorted according to their similarity and relevance to the question.
The document processing module includes three main tasks:
- Retrieve a set of relevant documents from the IR system
- Filter the documents and reduce them to a concise set of paragraphs
- Order and rank the documents by similarity and relevance to the question
This module uses extraction techniques on the result from the document processing module to present an answer to the question. While it returns a simple answer to the question, it may require merging and summarizing information from different sources, as well as dealing with uncertainty or contradiction.
Answer processing can be broken down into three major tasks:
- Identify statements/answers within the concise set of documents.
- Extract the relevant output by selecting appropriate phrases and words that answer the question.
- Validate the answer obtained in the previous step using evaluation metrics defined during the design of the QA system.
Web-based question answering systems use search engines to retrieve webpages potentially containing answers to the
questions before applying filters and ranking the recovered passages. The data available on the web has the
characteristics of semi-structure, heterogeneity, and distributivity.
NLP QA systems use linguistic intuitions and machine learning methods to extract answers from retrieved passages.
This type finds answers from structured data sources (knowledge base) instead of unstructured text. Standard data-based queries are used in replacement of word-based searches. This type of system makes use of structured data, such as ontology. An ontology describes a conceptual representation of concepts and their relationships within a specific domain.
High-performance QA systems use multiple types of resources. A hybrid approach uses a combination of web-based, NLP, and knowledge-based QA.
A range of techniques, algorithms, frameworks, and tools are utilized in QA systems:
- Deep neural network
- Graph-based
- Lemmatization
- Latent Semantic Analysis (LSA)
- Multi-document summarization
- Naive Bayes
- Named entity recognition
- Parser
- Part-of-speech (POS) Tagging
- Relation finding (Similarity Distance)
- Shallow syntactical
- Stemming
- Support vector machine
- Text chunking
- Tokenization
Training a QA system requires large datasets. There are many publicly available text and graph-based datasets that have been generated through crowd-sourcing or manual annotation.
NLP Question Answering Datasets
There are many methods for evaluating the performance of QA systems. Metrics are based on the difference between the actual answer and the predicted answer the system returns, shown by a 2 x 2 contingency table.
- True positive—fragment correctly selected
- False negative—fragment incorrectly not selected
- False positive—fragment incorrectly selected
- True negative—fragment correctly not selected
Basic evaluation metrics (F1, precision, and recall) can be calculated from the rate of these occurrences.
With the amount of information available online, there has been a rise in the use of automated answering systems that can accurately extract information. These systems have a range of applications:
- Customer support
- Education
- Search engines
- Data analytics