Robust NLP for voice assistants
Karthik Raghunathan – How to understand your users despite your Automatic Speech Recognition (ASR)’s bad hearing.
The MindMeld Conversational AI Platform has been used by developers to build text-based chatbots as well as voice assistants. While text-based chatbots certainly have their place and utility in today’s world, voice interfaces are a lot more intuitive and natural when they work well.
It’s been encouraging to see the general population become more comfortable with voice assistants in recent years. An early 2020 survey by Voicebot found that more than a third of US households now have a voice-enabled smart speaker.
Another survey found that 35% of the US population are regular voice assistant users.
These numbers are expected to grow even faster in this era as users start preferring touch-free interfaces. This presents a great opportunity for developers of voice user interfaces everywhere. However, anyone who’s worked on one of these systems knows that it’s no easy feat to build a production-quality voice assistant that delights users.
Several active research areas in natural language processing explore more complex and deeper neural network architectures for conversational natural language understanding, natural language generation, and dialog state tracking. But all of that great work can still get undermined by the simple fact that voice assistants often suffer from bad hearing. In real life, even simple voice commands get easily misunderstood because the assistant didn’t hear you clearly.
In more technical terms, this means that the accuracy of your Automatic Speech Recognition (ASR) system has a huge impact on the overall quality of your voice assistant. This ends up being the Achilles’ Heel for most voice assistants, and if you want to see a significant improvement in user experience, focusing your efforts here will give you the most bang for your buck.
Challenges with speech recognition
Modern voice assistants are built using a complex pipeline of AI technology. At a high level, three steps are common to all voice user interfaces:
- First, we use Automatic Speech Recognition to convert the user’s speech to text. Since building your own ASR requires prohibitively high amounts of data and resources, it’s common for developers to use an off-the-shelf cloud service like Google Cloud Speech-to-Text, Azure Speech to Text, or Amazon Transcribe.
- We then use Natural Language Processing to understand the transcribed text, take any appropriate actions, and formulate a text response. This can be accomplished with a platform like MindMeld that encompasses functionality for natural language understanding, dialogue management, and natural language generation.
- Lastly, we use Text To Speech to synthesize human-like speech for the generated text response to be “spoken” back to the user. This is commonly done using cloud services like Google Cloud Text-to-Speech, Azure Text-to-Speech, or Amazon Polly.
Since ASR is the first component in this pipeline, errors introduced at this step cascade to downstream components, causing them to make errors as well. You can use all the transformers you want in your NLP system, but if the input is garbage, you’ll still get garbage out.
In the last five years, there have been many headlines like these which may lead one to believe that ASR is an already solved problem:
The system’s word error rate is reported to be 5.9 percent, which Microsoft says is “about equal” to professional transcriptionists asked to work on speech taken from the same Switchboard corpus of conversations.
Google CEO Sundar Pichai today announced that the company’s speech recognition technology has now achieved a 4.9 percent word error rate.
While we’ve undoubtedly made large strides in speech recognition accuracy over the last decade, it’s far from being a solved problem in the real world. In many of our production applications, we see word error rates (the metric by which ASR quality is measured) to be far higher than the ~5% numbers reported on well-studied academic datasets. Off-the-shelf ASR services like those from Microsoft, Google, or Amazon still make many mistakes on proper nouns and domain-specific terminology. When deployed in the real world, these errors are further exacerbated when dealing with users with diverse accents or non-ideal acoustic environments.
Examples of ASR mistranscriptions in Webex Assistant
Below are a few examples of ASR mistranscriptions we’ve seen in Webex Assistant, our MindMeld-powered voice assistant for enterprise collaboration.
As you can see, the ASR often confuses proper nouns with common English words (e.g., Prakash’s vs. precautious or Mahojwal vs. my jaw). On other occasions, it mistakes one named entity for another (e.g., Kiran vs. Corrine or Didi vs. Stevie). There are also cases where it fuses named entities with surrounding words (e.g., Merriweather instead of me with Heather). Any of these mistakes would lead the assistant to take an unsatisfactory action since the primary entity of interest has been lost in the ASR output.
Clearly, we need to overcome these kinds of errors to understand the user correctly. But before we look at potential solutions, it’s worth emphasizing two things.
First, we’ll assume that the ASR we’re using is an off-the-shelf black box system that we can’t modify and have to use as is. This is a reasonable assumption because most popular cloud ASR services provide very little room for customization. However, we will assume that the ASR provides a ranked list of alternate hypotheses and not just its most confident transcript. This is something that all major cloud ASR services can do today.
Note that the techniques covered below will be useful even if you had the luxury of using your own highly customized domain-specific ASR models. That’s because no ASR is ever going to be perfect, and having robustness mechanisms built into your NLP pipeline is always a good idea. The assumption about an off-the-shelf black box ASR is more to restrict the scope of the discussion here to the most common scenario that developers find themselves in.
Second, when talking about the NLP stack for a voice assistant, different implementations might involve different steps as part of the full pipeline. In this post, we’ll only focus on the three main steps common to all modern conversational AI platforms: intent classification, entity recognition, and entity resolution.
Next, we’ll look at three different techniques we’ve used in MindMeld applications to make our NLP pipeline more resilient to ASR errors.
1. ASR n-best reranking
The first technique, called n-best rescoring or reranking, applies application-specific domain knowledge to bias and possibly correct the ASR output.
While this description doesn’t do justice to all the complexities of a modern ASR system, at a conceptual level, it’s still useful to think of an ASR as having three separate stages:
First, the feature extractor extracts some useful audio features from the input speech signal. The acoustic model then maps those extracted features to phonemes representing the distinct sounds in the language. Finally, the language model takes that sequence of phonemes and transforms it into a sequence of words, thereby forming a full sentence. Like other probabilistic systems, ASR systems can output not just their best guess but also an n-best list of ranked alternate hypotheses.
The language model (LM) has a huge impact on how the audio finally gets transcribed. The LM is essentially a statistical model that predicts the most likely word to follow a given sequence of words. Conversely, it can also be used to score any arbitrary sequence of words and provide a probability measure for that word sequence.
The key thing to note here is that the LM used by an off-the-shelf cloud ASR service is a generic domain-agnostic model that may work well for web searches or general dictation tasks, but may not be best suited for recognizing the kind of language your users might use when conversing with your assistant. This is why these ASR systems often mistranscribe a domain-specific named entity as some other popular term on the web, or simply as a common English phrase. Unfortunately, in most cases, we cannot change or customize the LM used by a black-box ASR service. Therefore, we train our own separate domain-aware LM and use it to pick the best candidate from the different hypotheses in the ASR’s n-best list.
To train our in-domain language model, we need a large corpus of sentences that reflects the kinds of things our users would say to our voice assistant. Luckily, we should already have a dataset of this kind that we use to train our intent and entity detection models in our NLP pipeline. That same data (with some augmentation, if needed) can be repurposed for training the LM. There are many free and open-source language modeling toolkits available, and depending on your corpus size, you can either pick a traditional n-gram-based model or a neural net-based one. In our experience, n-gram LMs trained using the KenLM or SRILM toolkits worked well in practice.
Once we have a trained in-domain LM, we can use it to rescore and rerank the ASR n-best list such that candidates with language patterns similar to those found in our training data are ranked higher. The post-reranking top candidate is treated as the corrected ASR output and used for further downstream processing by our NLP pipeline.
The above figure shows this technique in action in Webex Assistant. The original ASR output was trying marijuana’s emr, but after n-best reranking, the corrected output is join maria joana’s pmr, which seems more likely as something a user would say to our voice assistant. The ASR’s LM would have preferred a different top hypothesis originally because trying marijuana is a very popular n-gram on the web, and EMR, which stands for “electronic medical record” is a more popular term in general than PMR (“personal meeting room”), which only makes sense in an online meeting scenario. But our in-domain LM can pick the right candidate because it would assign higher probabilities to words like join, PMR, and possibly even Maria Joana if we had that name in our training data.
The advantage of this approach is that it isn’t directed at improving any one specific downstream task, but the entire NLP pipeline can benefit from getting to deal with a much cleaner input. This would help with improved accuracy for intent and entity classification as well as entity resolution.
The disadvantage is that this approach introduces one other new model to your overall pipeline that you now have to optimize and maintain in production. There’s also a small latency cost to introducing this additional processing step between your ASR and NLP. Even if you can make all those logistics work, there’s still a limitation to this approach that it cannot make any novel corrections but only choose from the n-best hypotheses provided by the ASR. So there’s a good chance that you’ll need other robustness mechanisms further down the NLP pipeline.
2. Training NLP models with noisy data
The next technique is a really simple one. NLP models are usually trained using clean data, i.e., user query examples that do not have any errors. The idea behind this technique is to spice up our labeled data with some noise so that the training data more closely resembles what the NLP models will encounter at run time. We do this by augmenting our training datasets with queries that contain commonly observed ASR errors.
Training data for intent and entity models augmented with queries containing common ASR errors (in blue)
Let’s again take the example of Webex Assistant. The intent classification training data for our assistant might have query examples like join meeting, join the meeting, start the meeting, and other similar expressions labeled as the join_meeting intent. Now, if the production application logs show that join the meeting often gets mistranscribed as shine the meeting, or start the meeting often gets confused as shark the meeting, we label those erroneous transcripts as join_meeting as well and add them to our intent classification training data.
We follow a similar approach with our entity recognition model, where we add mistranscriptions like cool tim turtle or video call with dennis toy to our training data and mark the misrecognized entity text (tim turtle, dennis toy, etc.) with the person_name entity label.
If executed correctly, this approach works out really well in practice and improves the real-world accuracy of both the intent classification and entity recognition models. One could argue that you shouldn’t pollute your training data this way, and your model should learn to generalize without resorting to these kinds of tricks. There’s some merit to that argument. You should definitely start with just clean data and experiment with different features and models to see how far you can get. For example, using character-level features like character n-grams or embeddings can make your intent classifier more robust to minor errors like join vs. joint, and a well-trained entity recognizer should be able to recognize benny would as a name (in call benny would now) by relying on the surrounding context words even if the word would is mistranscribed. But there will always be ASR errors that our NLP models won’t be able to handle, and data augmentation of this kind is an effective way to help the model learn better.
Of course, you need to be careful not to go overboard with this approach. If you were to throw in every single way in which an ASR mistranscribes your user queries, that would probably confuse the model more than it would help it. So what we do is only add examples with ASR errors that are really common in our logs. We also only include near-misses where the transcription is slightly off, and don’t include cases where the ASR output has been garbled beyond recognition. Lastly, you need to ensure that you don’t provide conflicting evidence to your NLP models in this process. For instance, the ASR may sometimes misrecognize start the meeting as stop the meeting, but you shouldn’t label stop the meeting as an example for the join_meeting intent. That would introduce a confusion between the join_meeting intent and the end_meeting intent where that example should rightfully belong.
This technique was mainly about improving our intent and entity detection models. But we’ll now turn our focus to entity resolution.
3. ASR-robust entity resolution
Entity resolution, or entity linking, is the task of mapping a detected entity in the user query to a canonical entry in a knowledge base.
In the above example, the person name entity sheryl is resolved to a concrete entity Sheryl Lee who’s a specific employee in the company directory. It’s this resolution step that allows us to correctly fulfill the user’s intent because we now know the right employee to initiate the video call with.
Entity resolution is often modeled as an information retrieval problem. For instance, you can create a knowledge base by using a full-text search engine like Elasticsearch to index all the canonical entities relevant to your application. Then at runtime, you can execute a search query against this knowledge base with the detected entity text and get back a ranked list of matching results.
To improve the search accuracy, and thereby the entity resolution accuracy, there are several features we can experiment with.
We can encourage partial or fuzzy matching by using features like normalized tokens, character n-grams, word n-grams, and edge n-grams. We can also do simple semantic matching by using a mapping of domain-specific entity synonyms or aliases. Textual similarity features like these are useful for any kind of conversational application regardless of the input modality. But next, we’ll specifically look at additional features that make the entity resolver for a voice assistant more robust to ASR errors.
First, we introduce phonetic similarity because textual similarity alone isn’t enough to deal with ASR errors. For example, when Kiran Prakash’s gets mistranscribed as Corrine precautious, relying purely on text similarity might not help us make the correct match because, at a textual level, these phrases are pretty far apart from each other. But since they sound similar, they should be fairly close in the phonetic space.
One way to encode text into a phonetic representation is by using the double metaphone algorithm. It’s a rule-based algorithm that maps a given word to a phonetic code such that similar sounding words have similar encodings. For words with multiple pronunciations, it provides a primary and a secondary code encoding the two most popular ways to pronounce the word. For example, the name Smith has the double metaphone codes SM0 and XMT, whereas the name Schmidt is represented by the codes XMT and SMT. The similar representations indicate that these two names are phonetically very close.
A more recent approach is to use a machine-learned grapheme-to-phoneme model that generates a sequence of phonemes for a given piece of text. Using this method, Smith is represented by the phoneme sequence S M IH1 TH, whereas Schmidt is represented as SH M IH1 T. Similar sounding words have similar phoneme sequences, and the detailed representations also make it easier to compute the phonetic similarity between words at a more granular level.
In our experiments, we found that these two methods often complement each other. Hence, we use phonetic features derived from both to improve our search.
Leveraging the ASR n-best list
One other technique that helps us significantly improve our search recall is leveraging the entire n-best list of hypotheses from the ASR, rather than just its top transcript. We run entity recognition on all the hypotheses and send all of the detected entities in our search query to the knowledge base.
On many occasions, the correct entity might even be present a little deeper in the n-best list, like in the above example where the correct name Sheetal was part of the ASR’s third-best guess. Even when that is not the case, pooling the various text and phonetic features across all the hypotheses has the effect of upweighting features which have more consistent evidence throughout the n-best list and downweighting outliers, thereby resulting in a much better overall match.
The last thing we’ll discuss is using personalization features to improve entity resolution. User-based personalization is something that search engines use to better cater their search results to each user. Similar techniques can help us resolve entities more accurately by leveraging prior information about the user, such as which entities a particular user is more likely to talk about. This is useful for any kind of conversational application, but can especially have a huge impact for voice assistants where there is a larger potential for confusion due to similar-sounding words and ASR errors.
Personalization features tend to be application-specific and depend on the use case at hand. For example, for Webex Assistant, a major use case is being able to call other people in your company. Assuming that in general, you are more likely to call someone you are more familiar with, we can devise a personalization score, which is essentially a measure of a user’s familiarity with others in the company. In other words, for every user, we compute a familiarity score between that user and everyone else in the company directory. This familiarity score considers factors like how far the two people are in the company’s organizational hierarchy and how frequently they interact with each other via calls or online meetings.
We can then leverage this additional personalization score during ranking to help us disambiguate among similar-sounding names in the ASR hypotheses, and pick the right one.
This was just one example for a specific use case, but you can envision similar personalization features for different applications. For a food ordering assistant, you could have a list of restaurants or dishes that a particular user has favorited or ordered a lot recently. For a music discovery app, you can use a list of artists and albums that a particular user likes and listens to more often. And so on.
ASR robustness features in MindMeld
You can employ one or all of the above techniques when building a MindMeld-powered voice assistant:
- We don’t have native support for building in-domain language models and using them for reranking n-best ASR hypotheses. But you can try this on your own by leveraging the LM toolkits mentioned above and include it as a preprocessing step before calling the MindMeld NLP pipeline. However, we would recommend starting with the other two techniques first since those can be achieved to an extent within MindMeld itself. Furthermore, they may reduce the need for having a separate n-best reranking step at the beginning.
- Training the NLP models with noisy data merely involves adding query examples with ASR errors to your training data files and then using MindMeld to build your NLP models as usual. Just heed the warnings about not adding too much noise or confusability to your models.
- There’s some out-of-the-box support for ASR-robust entity resolution in MindMeld, as described in our user guide. You can improve upon this by implementing personalized ranking techniques that are tailored to your specific application. For more details, read our 2019 EMNLP paper on entity resolution for noisy ASR transcripts.
It’s worth emphasizing that anyone who aspires to build a production-quality voice assistant must invest heavily in making their NLP models robust to ASR errors. This can often be the difference between an unusable product and one with a good user experience. MindMeld-powered assistants are extensively used in enterprise environments where tolerance for misunderstanding of voice commands is far lower than in a consumer setting. Robustness to ASR errors is always top-of-mind for us, and we’ll continue to share updates as we make more progress on this front.
About the author
Karthik Raghunathan is the Director of Machine Learning for Webex Intelligence, which is the team responsible for building machine learning-driven intelligent experiences across all of Cisco’s collaboration products. Karthik used to be the Director of Research at MindMeld, a leading AI company that powered conversational interfaces for some of the world’s largest retailers, media companies, government agencies, and automotive manufacturers. MindMeld was acquired by Cisco in May 2017. Karthik has more than 10 years of combined experience working at reputed academic and industry research labs on the problems of speech, natural language processing, and information retrieval. Prior to joining MindMeld, he was a Senior Scientist in the Microsoft AI & Research Group, where he worked on conversational interfaces such as the Cortana digital assistant and voice search on Bing and Xbox. Karthik holds an MS in Computer Science with Distinction in Research in Natural Language Processing from Stanford University. He was co-advised by professors Daniel Jurafsky and Christopher Manning, and his graduate research focused on the problems of Coreference Resolution, Spoken Dialogue Systems, and Statistical Machine Translation. Karthik is a co-inventor on two US patents and has publications in leading AI conferences such as EMNLP, SIGIR, and AAAI.
Click here to learn more about the offerings from Webex and to sign up for a free account.
May 13, 2021 — Espen Loberg
May 11, 2021 — Vanessa Philogène
May 11, 2021 — Aruna Ravichandran