Industry attributes
Voice Translation, also referred to as speech-to-speech translation, is software that can translate human speech from one language into another. Once thought to be the work of science fiction (such as the Babel Fish from the Hitchhiker’s Guide to the Galaxy or the Universal Translator from the Star Trek universe), modern day Voice Translation technology relies on AI, neural processing, natural language understanding, speech recognition, and text-to-speech conversion. As of 2021, several companies offer voice translation technology through various formats, most commonly with apps, hand-held devices, or in-ear or over-the ear headphones.
Voice translation specifically refers to translating an individual's voice into a different language than the one spoken. The goal is to help people of different languages communicate, while still holding the personality and tone of an individual's voice in the resulting translation.
The first voice translation system was assembled in 1991 by Alexander Waibel, a German professor of computer science, after proposing the idea of artificial speech translation at MIT in 1978. Waibel's system had a 500-word vocabulary and took several minutes to process speech input via several large computing stations.
The concept of speech translation was first introduced on the world academic stage through a NEC Corporation speech translation proof of concept demonstration during the 1983 ITU Telecom World. ATR, Carnegie Melon University, and Siemens later conducted a three-pronged connected speech translation experiment in 1993, with multiple governments also starting their own research projects that year, including Germany's Verbmobil, the EU's Nespole! and TC-Star, and the US TransTac and GALE projects.
The explosion of the Asia-Pacific continent currently acts as a primary driver for voice translation technology, with Google stating that while 50% of the internet's content is in English, only 20% of the world's population speaks the language. The Asian Speech Translation Advanced Research (A-STAR) was founded in 2006, with founding members Japan, China, Korea, Thailand, Indonesia, and Vietnam holding their first meeting in the Advanced Telecommunications Research Institute (ATR) on November 14, 2006.
A-STAR grew by two new members before the memorandum-of-understanding (MoU) concluded in 2008. The Centre for Development of Advanced Computing (C-DAC) began participating with the A-STAR in 2007, and by July 2009 the first Asian network-based speech-to-speech (S2S) system was launched. Middle Eastern and European countries later joined the consortium, and the name was changed to Universal Speech Translation Advance Research (U-STAR). The organization is currently comprised of thirty-three institutes of twenty-six countries or regional municipality.
U-Star recommendations for functional network-based speech-to-speech translator (S2ST) requirements and S2ST architecture requirements were first approved by the ITU Telecommunication Standarization Sector (ITU-T) and ready for standardized communication protocols on October 14, 2010.
The organization first released its 'VoiceTra4U-M' application, which is capable of translating twenty-three languages for up to five users, on the iOS app store on July 17, 2012. The app was released on Android in August 2016.
Voice translation systems prior to Google's Translatotron utilized network-based speech-to-speech translation (S2ST) systems. In these systems, voice translation is accomplished through distributed modules across a worldwide network, which are capable of recognizing speech languages, translating recognized speech to other languages, and then synthesizing the translation into speech. Network-based S2ST technologies are characterized by multiple components:
- User-facing clients for speech input and output
- Servers that recognize and transcribe speech, automatically translate text in source language to target language, and synthesize speech from text
- Communication protocols that connect user clients to servers
- Conversion markup language, which is used to describe data for exchange between modality conversion modules
Google Translatotron is Google's newest translation system, which began development in 2016 as an update over its previous speech-to-text and text-to-speech services. The system aims to optimize translation processes and overcome common hurdles, including translation speed, compounding errors in recognition and translation, ability to retain original speaker voice post translation, and detection of words that do not need to be translated.
The AI translation tool uses neural machine translation, in contrast to previous statistic machine translation, in a sequence-to-sequence model that creates direct speech-to-speech translation and does not rely on intermediate text representation as a go-between.
As opposed to Google's previous translation system, Translatotron translates sentences as a whole instead of piece-by-piece, uses broader context clues to decode and provide relevant translations, and rearranges text to create more human-sounding responses. The machine model also uses sound waves to translate and produce other languages, mimicking input voice in the output language.
As of 2021, there are a variety of two-way voice translators available for purchase to consumers. Almost all rely on AI and require an internet connection.
Retail Voice Translation Devices
As voice translation can act as an intermediary for two people holding a conversation in different languages, the technology can be used and has been extended to use in customer support services. Depending on the translation technology, the implementation of voice translation technology can offer users a text-based translation of speech, while others can offer translation over voice to another member of the chat. Most of the technology works on speech-to-text translation as this allows systems to use a combination of machine learning and artificial technologies to identify the language, understand the context of a sentence or word, and offer a more realistic translation of what is being said. While text can do this in real time and edit itself based on context, a speech-to-speech translation can take longer, and similar to human translators, with the need for context that requires waiting.
Using a real-time translation service in a customer support scenario can provide representatives to chat or support with a larger pool of potential customers in a more global context, and reduce the overall load on needed representatives, by allowing any representative to support any region regardless of native language. And, it can allow those representatives to work in a preferred language rather than a region specific language. The ability to offer live translation can be beneficial for businesses like travel agencies, educational institutions, online health services, and online shopping; and these services can earn the trust of customers and encourage return use.
For example, in the case of some customer service portals, when a customer initiates a conversation, a representative can receive the question or request for assistance in the representative's default language. Meanwhile, it can continue to show the original language of a question or request. If a customer, in this case, initiates a conversation in Spanish, and the representative's default language is English, the message can show in the original Spanish and translated English.
With the larger movement of customer service and support moving towards automation and chatbots, especially in the case of handling routine inquiries, the need for chatbots capable of real-time translation or multilingual capabilities is important. In June 2016, 26 percent of the internet used English as a browsing or messaging language, while the majority of the chatbots on the market are developed in English. In the case of multinational organizations, the ability to have a single chatbot capable of real-time translation and provide language-specific answers offers an organization a more efficient utilization of chatbots and human agents. This also reduces the possible amount of chatbots needed to be trained by an organization.
These chatbots can be developed based on rules, based on machine learning technology, and based on both. But all chatbots can use artificial intelligence and natural language processing (NLP) to interpret a customer's request and engage customers with follow-up questions, to provide a more human-like interaction. The use of NLP and others has come in live language translation. This can help the chatbots interact with a user's question, interpret the intent, and provide as accurate a response as possible. And the more interaction chatbots receive, the smarter these systems get. In the case of multilingual environments or multinational organizations, the chatbots have a greater opportunity to grow.
For chatbots to detect the language to use in a customer conversation, there are a few methods that can reduce the possible friction with a customer in order to detect the language used in a conversation:
- IP based, in which the chatbot can determine the location and native language of a visitor based on the device's IP address
- Customer selection, in which a customer can choose a preferred language before an interaction begins
- Browser settings, in which a chatbot can detect the language settings of the visitors web browser and default to the same language
- HTML language attribute, which uses HTML to detect the language of the content and tailor the chatbot's response to the speaker's native language
Chatbots typically use Neural Machine Translation engines to provide instant responses and can enable them to adapt to a number of languages in the conversation. Whereas AI-powered chatbots have to use humans to translate responses to be trained to respond in different languages. However, the training can provide a chance for organizations to fine-tune a chatbot's response and a preferred response type regardless of language.
For the case of text contained in images or photos, users can translate the text contained in an image using a phone's camera. This is especially useful for those who are trying to translate signs, travel-related translations, or notes through the use of a phone. While other software, which can be included in a customer service chat portal, an image or picture can be brought into a portal or web browser, and in the case of those services with automation of language detection, can translate embedded text. Otherwise, a user may need to choose the language in order to properly translate the embedded text.