Skit Tech

Speech LLMs for Conversations

2024-05-09T00:00:00+00:00

With LLMs making conversational systems has become easier. You no longer need to focus on the low-level details of categorizing semantics and designing responses. Instead, you can concentrate on controlling high-level behaviors via an LLM. This is the trend that we see most of the world moving towards as products are using vendor combinations of ASR, LLM, and TTS with some dialog management stitched in between. While this is going to be the norm soon, we want to keep exploring areas from where the next set of quality improvements will come.

Earlier we discussed how spoken conversations are richer than pure text and how the gap would be not bridged by LLMs purely working on transcriptions. In one of our recent experiments we built an efficient multi-modal LLM that takes speech directly to provide better conversational experience. For production usage, the constraint here is that this should happen without losing the flexibility that you get in a text-only LLM around writing prompts, making changes, evaluating, and debugging.

Below is a conversation with our recent in-house Speech LLM based conversational system. Notice that because of the extra information in speech some micro personalizations can happen like usage of gendered pronouns¹. You also get lower impact of transcription errors and in general better responses in non-speech signals. With access to both speech and text domains, the model allows for more fluent turn-taking, though not demonstrated in the current conversation. In addition, our approach also reduces the combined model size (<2B) for taking speech to response, leading to lower compute latency as compared to larger systems.

The model above doesn’t yet control speech synthesis beyond the textual markers it can generate, but that’s something to be added soon (you might have noticed erratic pitch shifts in the call above since TTS vendors don’t contextualize based on past conversations). Stay tuned for more details on how we take this and similar research areas forward.

Of course concerns around paralinguistic prediction accuracies are extremely important to take something like this in production. ↩

Improving consumer verification using confidence calibration and thresholding

2024-01-09T00:00:00+00:00

In the past year, our team’s current focus has shifted to building robust and scalable voice-bots for US companies. In particular, we are honing in on the use case of facilitating the collection of borrowed funds. Given the stringent compliance standards for user verification in the US, our voicebot must excel in this aspect, leaving no room for error. Our top priority is to avoid any inadvertent verification of false users, which could potentially lead to the exposure of sensitive debt-related information. On a biased dataset, curated to tackle this problem, false consumer verifications are estimated to occur in ~0.67% of samples.

Technical Analysis

We wanted to do a deep dive of the problem from an ML standpoint. We noticed that we needed to be more confident in our predictions. But what do we mean by confidence in our predictions? It is defined as the ability of the model to provide an accurate probability of correctness for any of its predictions. For example, if our SLU model predicts that the intent is _confirm_ with a probability of 0.3 then the prediction has a 30% of being correct provided the model has been calibrated properly.

This plot shows how under-confident we are on lower probabilities. Naturally, classes with low probabilities should be classified as wrong class predictions but our uncalibrated model is unable to do so.

Metrics and Diagrams

We realized that model miscalibration and rejecting low confidence prediction is the major problem that we need to solve. But how do we quantify model calibration?

Expected and Max Calibration Error

Expected Calibration Error (ECE) measures the disparity between a model’s confidence and its accuracy. It can be computed directly using a closed-form formula or approximated by dividing predictions into bins based on their confidence scores. Within each bin, the average confidence and accuracy differences are calculated, and then ECE is obtained by taking a weighted average of these bin-wise differences proportional to the bin sizes.

Maximum Calibration Error (MCE) is similar to ECE but is equal to the maximum difference between average confidence and accuracy across bins.

\[ECE = \sum_{i=1}^M \frac{|Bin_{i}|}{N}.|a_{i} - c_{i}|\] \[MCE = \max_{i \in \{1,\ldots,M\}} |a_{i} - c_{i}|\]

The lower the ECE and MCE values, the better calibrated the model is.

Reliability Diagrams

Reliability diagrams depict accuracy on the y-axis and average confidence scores on the x-axis. A line plot is formed using the accuracy and the confidence points. A diagonal line through the origin indicates a perfectly calibrated model where confidence is equal to accuracy for each bin.

Reliability diagram of our deployed model. The dashed line represents perfect calibration. The blue bins are the actual bins with these many predictions and their corresponding accuracy falling under one bin

Adopted Solution

Upon careful analysis of our data and model, we have found evidence suggesting that calibrating our SLU (Spoken Language Understanding) models could effectively mitigate false positives and enhance the overall confidence of our model predictions. Subsequently, we delved into the existing literature to explore potential solutions to address this issue.

We came across multiple solutions that tackle model miscalibration such as Ensemble based Calibration [5], mixup [3], using Bayesian Neural Networks [4] etc. We narrowed it down to one solution that was easy to integrate and did not require us to re-train our models i.e. Temperature Scaling.

Temperature Scaling [1]

Temperature scaling involves tuning a softmax temperature value to minimize Negative Log-likelihood loss on a held-out validation set. This value is then used to soften the output of the softmax layer.

\[\text{Softmax}(y) = \frac{\exp(y_i)}{\sum_j \exp(y_j)}\] \[\text{Temp-Softmax}(y) = \frac{\exp(y_i/T)}{\sum_j \exp(y_j/T)}\]

The intuition behind temperature scaling is that the T value penalizes high probability scores thereby resulting in better confidence in these values.

After tuning a temperature scaling value on our validation dataset, we noticed that our model was better calibrated(Lower ECE and MCE values as well) and was able to better classify low-confidence score predictions.

Post calibration, we notice that we are able to better classify low-confidence scores under wrong prediction bins.

Old Reliability Graph (on Validation Set)

New Reliability Graph

Thresholding

Keeping our current objective in mind where we reject low-confidence predictions, we realized that thresholding individual intents on a temperature-scaled model could work quite well for us. We wanted an increment in our precision numbers without hitting recall for our intents. This is because we want to maximize our confidence in the current prediction without inadvertently increasing our False Negatives i.e. we still want to be accurate while predicting our positive class (could be any intent here). We plotted precision-recall curves with thresholds on the x-axis and precision/recall on the y-axis. We have default recipes to generate these threshold values that ensure that we maximize precision without affecting the recall. The data scientist or an ML engineer then can have a look at these plots and accordingly decide which threshold values to go with if they feel that

A precision-recall curve for the `_confirm_` intent

Results

After performing the above experiments on our datasets, we noticed a bump of 10 points in macro-precision without affecting the macro-recall (\(\pm\) 1 point in recall). Our model became more robust to mis-classifications and was able to reject low-confidence predictions thereby increasing the confidence in the model to handle sensitive compliance-related turns. Our product metrics also improved by a margin - after implementing the above solution we were able to bring down false consumer verification from ~0.67% to ~0.18% on the dataset.

Caveats

Temperature Scaling does not work well when the dataset distributions differ i.e. whenever there’s data drift between the production data and the validation data, the calibration of the model is not accurate. If there’s a data drift, the above methodology needs to be performed again to maintain data sanity. Thresholding affects recall (read reduce) when we go forward with maximizing precision gains.

Citations

Guo, Chuan, et al. “On calibration of modern neural networks.” International Conference on Machine Learning. PMLR, 2017
https://towardsdatascience.com/confidence-calibration-for-deep-networks-why-and-how-e2cd4fe4a086
Zhang, Hongyi, et al. “mixup: Beyond empirical risk minimization.” arXiv preprint arXiv:1710.09412 (2017)
Neal, Radford M. Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media, 2012.
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. “Simple and scalable predictive uncertainty estimation using deep ensembles.” Advances in neural information processing systems 30 (2017).

Speech-First Conversational AI Revisited

2023-05-11T00:00:00+00:00

Around last year, we shared our views on how nuances of spoken conversations make voicebots different than chatbots. With the recent advancements in conversational technology, thanks to Large Language Models (LLMs), we wanted to revisit the implications on what we call Speech-First Conversational AI. This post is one of many such reexaminations.

We will try quoting the older blog post wherever possible, but if you haven’t read the older post, you are encouraged to do so here before going any further.

What’s Changed?

In one line, the problem of having believable and reasonable conversations is solved with the current generation of LLMs. You would still get factual problems and minor niggles, but I could ask my grandmother to sit and chat with an LLM based text bot without breaking her mental model of how human conversations could happen, at all.

Internally, we use the phrase “text conversations are solved” to describe the impact of LLMs on our technology. But how does this reflect in spoken conversations? Will they also be solved? Sooner than later, sure. But there are some details to look in that go beyond raw textual models to do this well.

Beyond the statement of “text conversations are solved”, there are more upgrades that make us excited about their implications for spoken conversations. The most important being the hugely improved capability to model any behavior that can be meaningfully-translated in text. For example, if you want to do speech backchannel modeling right now, you might get very far by connecting a perception system with an LLM rather than building something else altogether. This pattern is part of the promises of AGI, and knowing that we are getting there gradually is very stimulating.

Spoken Dialog

Let’s revisit the points that make spoken conversations different than textual ones, as described in the post last year. In the main, all the points are still relevant, but the complexities involved in solutions are different now. As a Speech AI company, this is helping us get better answers to the question of how should we go about more natural interactions between humans and machines.

1. Signal

Speech isn’t merely a redundant modality, but adds valuable extra information. Different styles of uttering the same utterance can drastically change the meaning, something that’s used a lot in human-human conversations.

This is still relevant and important. While speech recognition systems have started to become better in transcribing content, robust consumption of non-lexical content is still a problem to solve for doing spoken conversations.

One of our fresh research works (releasing soon) involves utilizing prosodic information along with lexical to increase language understanding and the gain we got is still significant.

2. Noise

Speech recognition systems have come a long way from 2022. WER performance, even in noisy audios, is extremely good and one could trust ASRs a lot more for downstream consumption than one could do last year.

More non-speech markers, timing information, etc. are accessible easily and accurately which could be clubbed with LLMs directly to simplify modeling behaviors like disfluencies.

3. Interaction Behavior

We don’t take turns in a half-duplex manner while talking. Even then, most dialog management systems are designed like sequential turn-taking state machines where party A says something, then hands over control to party B, then takes back after B is done. The way we take turns in true spoken conversations is more full-duplex and that’s where a lot of interesting conversational phenomena happen.

While conversing, we freely barge-in, attempt corrections, and show other backchannel behaviors. When the other party also starts doing the same and utilizing these both parties can have much more effective and grounded conversations.

Simulacrums of full duplex systems Google Duplex already have hinted at why this is important. While the product impact of full-duplex conversations has been elusive because of technology’s brittleness, with LLMs and better speech models, the practical viability of this is pretty high now.

A natural thread of work is modeling conversations speech to speech which is already happening in the research community. But even before perfecting that, we can significantly get better in spoken interactions with currently available technologies and some clever engineering.

4. Runtime Performance

In chat conversations, model latencies and the variance over a sample don’t impact user experience a lot. Humans look at chat differently and latency, even in seconds doesn’t change the user experience as much as in voice.

This makes it important for the voice stack to run much faster to avoid any violation of implicit contracts of spoken conversations.

This is something where heavy LLMs don’t do well natively. Large, high-quality models often require a GPU and optimization to meet speech latency requirements efficiently.

5. Personalization and Adaptation

With all the extra added richness in the signals, the potential of personalization and adaptation goes up. A human talking to another human does many micro-adaptations including the choice of words (common with text conversations) and the acoustics of their voices based on the ongoing conversation.

Sometimes these adaptations get ossified and form sub-languages that need different approaches for designing conversations. In our experience, people talking to voice bots talks in a different sub-language, a relatively understudied phenomenon.

As LLMs reduce the complexity and effort needed to model and design behaviors, we should get more product-level work on this in both textual and speech conversations. You might already see a bunch of AI talking heads, personas, etc. with the promise of adapting to you. Something like this was possible earlier but with much more effort than now.

6. Response Generation

With LLMs, the quality of responses content is extremely high and natural. And if they are not, you can always make them so by providing a few examples. Specifically for speech, LLMs are good substrate for modelling inputs to speech synthesis. Instead of hand-tuning SSMLs, we can now let an LLM model high-level markers to guide the right generation of spoken responses at the right time.

Additionally, similar to speech recognition, speech synthesis has got huge upgrades from last year. Systems like bark provide a glimpse of the high quality of utterance along with the higher order control that could be driven by an LLM.

7. Development

This stays the same as before. Audio datasets still have more information than text and the maintenance burden is higher.

Though there is a general reduction of complexity in the language understanding side because of one model handling many problems together. Thus reducing annotation, system maintenance, and other related efforts.

What’s Next?

With higher order emergent behaviors coming in LLMs, there is a general lifting up of problems that we solve in ML. All this has led to an unlocking of a sort where everyone is rethinking the limits of automation. For a product like ours—goal-oriented voicebots—we expect the reduction in modeling complexity to increase the extent of automation, even for dialogs that were considered forte of human-agents.

Technologically, the time is ripe to achieve great strides towards truly natural spoken conversations with machines. Something that was always undercut because of the, rightfully present, friction between user experience and technological limitations. Note that the definition of natural here still hangs on the evolving dynamics of human machine interactions, but we will see a phase transition for sure.

Incorporating context to improve SLU

2022-08-04T00:00:00+00:00

Introduction

In task-oriented dialogue systems, the spoken language understanding, or SLU, refers to the task of parsing the natural language utterances into semantic frames. The problem of contextual SLU majorly focuses on effectively incorporating dialogue information. Current SLU systems that work in tandem with ASR (voice bots) only incorporate the asr transcription as an input for the SLU systems to predict intent. As such, the amount of information these transcripts have is quite less.

Why context?

The bot prompts are a treasure-trove of contextual information. This information can be used to build better intent classification model. Few examples with the bot prompts and their intents are shown below:

Bot Prompt	User Response	Intent
Hi! I am Divya, your Hathway virtual assistant. Are you looking for a new Hathway Broadband connection?	yes	confirm_new_connection
Okay!. To start with, please tell me if you are next to your Hathway device?	I am	confirm_near_device
For how many people do you want to book the table for?	seven	number_guests

If we observe the first example here, the trancription just consists of ‘yes’, but if we include the bot prompt, we enrich the input to our SLU, thereby increasing the overall intent classification performance. The context of hathway broadband connection helps in enriching the context of the utterance yes.

Using Bot prompts as context

We curated a private dataset of clients wherein we collect user utterances along with all the bot prompts. We then concatenate the bot prompts with the user utterances. After retraining our intent classification model on some clients, we observed a performance jump of 20-30% in the intent-F1 scores.
This probably happened because our user transcriptions are not rich with enough information. By closing that information deficit via the bot prompts, the overall input helped us in achieving better performance. Another probable reason for such a huge jump could be probably because the transcription generated while using voice bots are less accurate as compared to a chat bot’s transcription wherein a user types their response. As a result, the gap increment when supplied with contextual information when working with voice bots is much larger than let’s say a bot. However, this was not the case with all our datasets. We observed that the datasets with large number of classes didn’t perform well or at par with datasets with less number of classes. One probable reason for it could be the dataset having a large amount of granularity with respect to intents and as such there wasn’t any significant bump in the performance. We also observed that the performance with small-talk intents such as confirm, deny etc. had a massive jump in their performance as compared to other types of intents.

Some probable approaches from literature

After observing an improvement in our models by just using a single bot prompt, we decided to a delve a bit further and found out many approaches that can be utilized for our use-case. While doing literature review, we observed that encoding the contextual prompt along with the user prompt gives the best performance amongst all the methods. The current approach of concatenating bot prompts with the user prompt acts as a natural baseline for our subsequent experiments in this direction. We discuss the encoding based approaches from the literature below:

Encoding dialogue History [1, 2]

We can encode [1] the complete dialogue history as shown below. Let us assume that the dialogue is a sequence of \(D_{t} = {u_{1}, u_{2}.. u_{t}}\) bot and user utterance and at every time step \(t\) we are trying to output the classification for the user utterance \(u_{t}\), given \(D_{t}\). We then divide the model into 2 components, the context encoder that acts on \(D_{t}\) to produce a vector representation of the dialogue context denoted by \(h_{t} = H(D_{t})\) and the tagger, which takes this context encoding \(h_{t}\), and the current utterance \(u_{t}\) as input and produces the intent output.

Context Encoder Architecture

The baseline context encoder is just encoding the previous bot prompt \(u_{t-1}\) into a single bidirectional RNN (BiRNN) layer with Gated Recurrent Unit (GRU). The final state of the context encoder GRU is used the dialogue context, \(h_{t} = BiGRU(u_{t-1})\). For memory networks, we encode all the dialogue context utterances, \({u_{1}, u_{2}.. u_{t}}\) into memory networks denoted by \({m_{1}, m_{2}.. m_{t}}\) using a BiGRU encoder. We add temporal context to the dialogue history utterances, for that we append special positional tokens to each utterance, \(m_{k} = BiGRU_{m}(u_{k}) \: \: 0 <= k <= t-1\). The current utterance is also encoded using a BiGRU and is denoted by \(c\). Let \(M\) be the matrix wherein the \(i\)th row given by \(m_{i}\). A cosine similarity is obtained between each memory vector, \(m_{i}\), and the context vector \(c\). The softmax of this similarity is used as an attention distribution over the memory \(M\), and an attention distribution over the memory \(M\), and an attention weighted sum of \(M\) is used to produce the dialogue context vector \(h_{i}\). \(a = softmax(M_{c})\) \(h_{t} = a^{T}M\)

Tagger Architecture

A stacked BiRNN tagger is then used to model intent classification.

Results

This approach was benchmarked on a multi-turn dialogue sessions and for intent classification specifically the task of reserving tables at the restaurant. The intent F1 scores with memory network as the contextual encoder is 0.890 and just by encoding the last prompt is 0.865.

Another approach [2] is to have a different encoding mechanism for bot and user utterances [2]. This approach uses a system act encoder to obtain a vector representation \(a^{t}\) of all system dialogue acts \(A^{t}\). An utterance encoder is then used to generate the user utterance encoding \(u^{t}\) by processing the user utterance token embeddings \(x^{t}\). We then have a dialogue encoder that summarizes the content of the dialogue using \(a^{t}\) and \(u^{t}\), and its previous hidden state \(s^{t-1}\) to generate the dialogue context vector \(o^{t}\), and also update the hidden state. The dialogue context vector is then used for intent classification. Both the encoders use a hierarchical RNN that processes a single utterance at a time.

System Act Encoder

The system act encoder encodes the set of dialogue acts \(A^{t}\) at turn \(t\) into a vector \(a^{t}\) invariant to the order in which they appear.

Utterance Encoder

The utterance encoder takes in the list of user utterance tokens as input. Let \(x^{t}\) denote the utterance token embeddings, which is encoded using a bi-directional GRU. \(u^{t}, u^{t}_{o} = BRNN_{GRU}(x^{t})\) We get the embedding representation \(u^{t}\) of the user utterance and \(u^{t}_{o}\) is the concatenation of the final states and the intermediate outputs of the forward and backward RNNs respectively.

Dialogue Encoder

The dialogue encoder incrementally generated the embedded representation of the dialogue context at every turn. As shown in the figure below, it takes in \(a^{t} \bigoplus u^{t}\) and its previous state \(s^{t-1}\) as inputs and outputs the updated state \(s^{t}\) and the encoded representation of the dialogue context \(o^{t}\).

The above encoded feature is then flattened to the number of intent classes using a linear layer. \(p_{i}^{t} = softmax(W_{i}.o^{t} + b_{i})\)

Results

The dialogues are obtained from simulated dialogues dataset.The dataset has dialogues from restaurant and movie domains with total of 3 intents. The baseline for this approach was getting results without any context and the overall intent accuracy was 84.76% whereas using the previous dialog encoder (\(o^{t-1}\)) and the current system encoder (\(a^{t}\)) was 99.54%.

Probable approaches

Another approach that has not been discussed in literature is using a time decay function to decay the effect of older bot prompts. This would help in focusing more towards the recent prompts and reduce the effect of older prompts.
We can also experiment by fusing different modalities (text and speech) with utterance and dialogue. The emotion from the speech modality could help in infusing much better context into the input for the intent classification.

Conclusion

The above approaches and experiments show that context for SLU predictions can prove to be extremely useful for improving intent F1 scores. These above approaches are also not computationally expensive and can be easily deployed at scale for various use-cases.

References

[1] Ankur Bapna, Gokhan Tür, Dilek Hakkani-Tür, and Larry Heck. 2017. Sequential Dialogue Context Modeling for Spoken Language Understanding. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 103–114, Saarbrücken, Germany. Association for Computational Linguistics.
[2] Gupta, R., Rastogi, A., & Hakkani-Tür, D.Z. (2018). An Efficient Approach to Encoding Context for Spoken Language Understanding. ArXiv, abs/1807.00267.

Investigating Label Noise in intent classification datasets and fixing it

2022-06-19T00:00:00+00:00

Introduction

Label noise has been a consistent problem even in the most widely used open source datasets. Several papers have come up various deep learning techniques to make models more robust to label noise present in their train sets. Even so, identifying label noise in your dataset and investigating it’s cause is an important process to further understand model behaviour and prevent label noise in future datasets.

In this blog, we discuss why we decided to fix label noise in our datasets followed by some statistic cleaning methods we tested to narrow down regions within the dataset where label noise could be present.

Why fix label noise?

Test sets should be clean to serve as a benchmark for future decisions. To measure the impact of noisy train sets, we plot a graph of model performance versus % label noise. To conduct this experiment, we retagged an old dataset in one of our clients and thoroughly reviewed it to identify and fix mislabelled examples. The total number of mislabelled examples was 13% of the whole dataset (7591 instances). We flipped the gold labels into their noisy counterparts, trained a model on the newly formed dataset and plotted the results.

Impact of train set label noise on our model performance

In the above graph. we pbserve that at 0% label noise, the model performance is around 73.8% F1 and at ~13% label noise, the model performance drops to 70.8% F1.

Different cleaning methods to fix the label noise

By measuring the reduction in cleaning effort from the baseline, we can assess the efficacy of the cleaning method. We plotted label noise recall vs % of samples retagged (or annotator effort) - Relating these metrics with previous impact graph allows us to reach interesting conclusions like - clean y% of the dataset using a method M, and you will get some x% bump in model performance.

Random Sampling

Here we can sample some fixed number of instances and get them retagged. This serves as our baseline for other methods. The label noise we capture will be around 13% of each partial sample, and hence the recall will be the fraction of the partial sample (in the whole). On average, the plot will look similar to y=x, like this one for our dataset:

Biased Sampling

This requires intermittent involvement from ops (tagging after every sampling iteration)

In this method, we first randomly retag x % of samples. Then we identify the major tag confusions (as shown in the Dataset section) and pick the top 5 noisy tags and increase the weights associated to these tags in the sampling function. Then we sample again - pick top 5 tags - repeat.

We see an improvement over the baseline. We can capture around 60% of the total label noise by just tagging around 32% of the total dataset by this heuristic.

Datamaps

This paper introduces datamaps - a tool to diagnose training sets during the training process itself. They introduce two metrics - confidence and variability to understand training dynamics. They further plot each instance on a confidence vs variability graph and create hard-to-learn, ambigous and easy regions. These regions correspond to how easy it is for the model to learn the particular instance. They also observe that the hard-to-learn regions also corresponded instances that had label noise.

Confidence - This is defined as the mean model probability of the true label across epochs.
Variability - This measures the spread of model probability across epochs, using the standard deviation.

The intuition is that instances with consistently lower confidence scores throughout the training process are hard for a model to learn. This could be because the model is not capable of learning the target label or that the target label was incorrect.

We leverage the training artefacts from the paper to define a label score for each sample - as the Euclidean distance between (0,0) and (confidence, variability). Following the hypothesis of hard-to-learn regions, we expect noisy samples to have a lower label score.

Threshold on label-score
Fixing a threshold on the label score means that all samples that score below it are considered label noise, and those that score above are considered clean. Assuming we do Human Retagging of all samples predicted as label noisy, fixing a threshold essentially fixes both the % of samples retagged and (given the clean tags-) label noise recall. Varying the threshold, we get a plot for our dataset:

Looking at our metrics, we see an improvement in the partial recleaning process. Lets read the above plots. Say we fix the threshold at 0.43 which means we would be retagging around 28% of our dataset. This corresponds to a label noise recall of 60%, giving us a resulting dataset with 5.2% label noise from 13%. (= 0.40*13).
n-consecutive correct instances
Here, we will use the ordering of the label scores. Our added assumption here, is that the ordering within the regions are also useful. Based on this, we sort our samples by label score, and in ascending order. This means the noisy samples should be nearer to the top, and we base our heuristic on this. We start Human Retagging from the top of the sorted list of samples, and stop once we see N-consecutive clean samples. Varying N, we get a plot for our dataset:

Again, we see an improvement in the partial recleaning process. Lets read the above plots. Say we fix N at ~38, which means we would be retagging around 35% of our dataset. This corresponds to a corresponds to a label noise recall of 60%, which means we would capture and clean 76% of the label noise. Giving us a resulting dataset with 3.1% label noise (= (1-0.76)*0.13).

Cleanlab

This is a label noise prediction tool. We have evaluated the accuracy of this tool instead. But we won’t be able to capture all the noisy labels via this tool. This tool takes in predicted probabilities. Since cleanlab depends on output model probabilities it can’t be used to correct train sets.

Confident Learning

Confident Learning high level idea - When the predicted probability of an example is greater than a per-class-threshold, we confidently count that example as actually belonging to that threshold’s class. The thresholds for each class are the average predicted probability of examples in that class.

Confident Learning estimates a joint distribution between noisy observed labels and the true latent labels. It assumes that the predicted probabilities are out-of-sample holdout probabilities (eg. K-fold cross validation). If this isn’t the case then overfitting may occur. Their algorithm also assumes that class-conditional label noise transitions are data independent.

Metrics using a model trained on noisy labels.

Tested on a separate test set

Results are slightly better when the model is trained on clean data

We expect cleanlab to perform even better once our model test accuracies improve. Cleanlab wont be very useful if the model is performing poorly even on a clean dataset.

Minimizing tagging errors at source

To understand why our datasets had noisy labels, we conducted several review sessions with our annotators after they retagged datasets across multiple clients. We further classified each mislabelled example into a list of possible reasons as shown below. Here, gold tag refers to the ground truth tag. Each instance was tagged X times (X is the number of annotators) and the highest tag was chosen as the correct tag.

since our intent classifiers were not multi-label, we wanted to capture the total % of multiple intent scenarios.

We observed that the label noise patterns for each of our clients were quite different, which made the problem of generalizing label noise prediction even more difficult.

Conclusion

To conclude, we quantified how using datamaps helps in reducing effort taken to clean our existing train sets. We also correlated this reduced cleaning effort with the expected improvement in model performance with the help of some plots.

Theory of Mind and Implications for Conversational AI

2022-05-19T00:00:00+00:00

When a diplomat says yes, he means ‘perhaps’;
When he says perhaps, he means ‘no’;
When he says no, he is not a diplomat.

—Voltaire (Quoted, in Spanish, in Escandell 1993.)

Introduction

Consider this example: You’re out in the street in a crowded area. A stranger walks upto you and asks for directions in your local language, L. You responded, you notice the facial expressions of the stranger and that they seem to be confused, and do not understand what you said. Now, you’re confused as well, and try to clarify your instructions, but the stranger later reveals that he isn’t very fluent in the language L; hence you ask for whether they understand a globally-used language E, the stranger confirms, and the conversation continues.

Let’s breakdown what occurred here.

The stranger asked a question in a local language L.
You now have a belief that the stranger is speaks the local language L.
Due to your belief, you respond in the same language L, expecting the stranger to understand the information you’re trying to convey. This is a false belief, as it is later revealed.
You look for verbal/non-verbal cues from the stranger that they understood.
However, the stranger denies your expectation by showing absence of such cues and instead, show cues of confusion.
You attempt to further elaborate your instructions taking these cues into account, however the stranger still seems confused.
The conversation at this point feels “awkward” since your expectations of the conversation were being denied multiple times.
The stranger reveals that they aren’t very fluent in L.
This confirms that your belief that the stranger understood L was false. This brings a sense of comfort since now you understand why your expectations were being denied.
You now correct for your false belief and ask whether the stranger understands a globally-used language E.
The stranger confirms.
And the conversation continues.

This mechanism of having expectations from the other participant is a basis for successful conversation. If we lacked such an ability, the emergence of mutually accepted meanings of words, and language itself would be impossible. This also applies to non-verbal communication, such as body language and facial expressions, and to some degree, is observed in many animal species.

We now dive deeper into this aspect of communication, and formalize why both, human-human and human-machine conversations breakdown and/or lead to frustration of the participants.

Theory of Mind

Whenever we converse, we take into account what we expect the other person to understand through our words as well as their possible responses. The ability to conceive such “theories” of other participants’ mental states is termed as the Theory of Mind (ToM).

Having ToM requires the agent to acknowledge the fact that others (including the agent itself) can believe in things which are not true. These beliefs are called false beliefs. An agent possessing ToM can identify their own as well as other’s false beliefs and take actions to confirm and hence correct these false-beliefs.

What makes a conversation human-like?

Proposition:

A dialogue is human-like if both agents participating have some degree of Theory of Mind.

Theory of Mind is not limited to the content of the speech (such as the words spoken), but also addresses the mannerism of speech (prosody), facial and other non-verbal cues etc. It is easy to see that if any one of the agents lack ToM or have a poor ability, the conversation becomes uncomfortable and frustrating.

However, Theory of Mind is an acquired skill, expertise of humans on ToM matures over the lifespan [2], in-addition to depending on the amount of socialization the person part-takes in. This makes quantifying the degree of expertise over ToM difficult, hence quantifying the degree of human-likeness is also difficult, in-addition to being be subjective.

Testing the Presence of Theory of Mind

Testing whether or not an agent is capable of modeling mental states of others is important for many reasons, one such application is diagonosing mental disorders. Such tests are called false-belief tasks. These tests check whether the agent can model other’s false-beliefs and/or confirm and correct its own false-beliefs. We will discuss two popular false-belief tasks: “Sally-Anne” and “Smarties” tasks .

Sally-Anne Task

The participating agent is told the following scenario:

Sally and Anne are inside a room.
Sally has a basket with one marble inside it.
Anne has an empty box with her.
Sally leaves the room without her basket.
Anne takes the marble out of Sally’s basket and puts it in her own box.
Sally comes back inside the room.

Now, the participant is asked, “Where will Sally look for her marble?”. If the participant replies with “the basket”, this means that the participant is able to model the mental state of a fictional character Sally, and that she doesn’t know that Anne took her marble. Children below the age of 3-4 answer with “the box”, however, older children answer with “the basket”. Some children with mental disabilities such as Down syndrome and Autism are unable to pass this test.

Smarties Task

Smarties is a popular brand of candies.

The participant is presented with a box labelled “Smarties”.
The participant is asked “what is in the box?”.
The participant replies with “candies”.
The box is opened and is revealed that the box actually contains pencils.
The participant is asked “What would someone else think is inside the box?”.
The participant passes the test if they respond with “candies”.

Theory of Mind is an acquired skill, and is not innate, i.e., we aren’t born with the ability to model other’s mental states. A study [1] shows that children first pass False-belief tasks at around 3-4 years of age, around the same time as children first learn to tell lies, suggesting that learning to lie is a pre-cursor to possessing ToM. This does make sense, as lying would only help if the other participant is capable of having false-beliefs. Language and communication are also acquired skills.

Theory of Mind: Relevance to Conversational AI

Having ToM allows for certain mechanisms that would not be possible otherwise. Some are listed below:

The ability of the agent to recognize its own errors in perceiving (mis-hearing), i.e., discover its own false-beliefs and ask for clarifications. This also leads to a higher order reasoning capability of the agent.
The ability of the agent to dynamically model its counterpart throughout the conversation and adjust its own behaviour inorder to maximize the success of the dialog. Dynamic response and prosody generation, turn-taking, barge-in handling, etc. are such examples.

Do Machines have a Theory of Mind?

One of the important goals of AI is to blend in the lives of Humans and solve problems with humans-in-the-loop, achieving this requires modeling humans and other machines around the agent, similar to how we humans do.

Some studies [3, 4] have shown that specially designed Multi-Agent Reinforcement Learning algorithms pass the Sally-Anne False-belief task. However, False-belief tasks have not been designed for/tested against chat/voice bots. In this section, we test multiple Language Models (LM) against the Sally-Anne and Smarties tasks, and check whether they pass the tests or not.

Methodology

All of the experiments were done using Huggingface’s Hub has inference interface. These experiments can be easily re-ran, however, it is not guarrenteed to get the same results since the inference is non-detereministic. The tasks are widely used and are available in Wikipedia and other scientific papers on which some/all of the LMs may have been trained on, hence these tests are not conclusive.

Sally-Anne Task

Input Text: Sally and Anne are inside a room. Sally has a basket with one marble inside it. Anne has an empty box with her. Sally leaves the room without her basket. Anne takes the marble out of Sally's basket and puts it in her own box. Sally comes back inside the room. Sally will look for her marble in If the LMs continue with her basket, the basket or anything similar, the LM passes the test, else it doesn’t.

Smarties Task

Input Text: Sally is presented with a box labeled "Candies". Sally is asked, "what is in the box?". Sally replies with "candies". The box is opened and is revealed that the box actually contains pencils. Sally is asked, "What would someone else think is in the box?". Sally answers

If the LMs continue with candies or anything similar, the LM passes the test, else it doesn’t.

Results

Language Model	# Params	Sally-Anne	Smarties
DistilGPT2	82M	Fail	Fail
GPT-Neo-125M	125M	Fail	Fail
GPT-Neo-1.3B	1.3B	Fail	Fail
GPT-2	1.5B	Pass	Pass
GPT-Neo-2.7B	2.7B	Pass	Pass
GPT-J-6B	6B	Pass	Pass

The largest three of the models pass both the tests. This suggests that scale might help LMs achieve some basic reasoning capabilities. This result is not surprising, since larger LMs usually do better in reasoning benchmarks.

P.S. The most entertaining response award goes to DistilGPT2 for "I don't give a fig about the box" for the Sally-Anne task. This is not made up, I swear!

Implications for Goal-Oriented Conversational AI

Open-domain chat has one important goal, engagement with the user. The user engages with the bot if the bot is entertaining the user. For this statement to hold true, the bot should appropriately make responses, which in-turn requires modeling the user, i.e., having a Theory of Mind. The degree of engagement can be seen as a measure of degree of ToM of the bot.

Testing ToM is straight-forward for Open-domain (chit-chat) bot. However, this is tricky for goal-oriented bots, as they are designed to handle dialog under a specific domain. False-belief task defined on one domain maybe out-of-domain for another domain.

Open-domain dialog is a strict generalization of goal-oriented dialog. However, goal-oriented may have goals which are defined differently from engagement. In many call-center settings, call resolution is the most important goal. However, when voice bots are used in-place of human agents in call centers, a new and different behaviour of users arises: call drop. Users simply drop from the call if they:

get frustrated (due to mishearing, poor reasoning capabilities etc.).
think the bot is incapable of answering their queries, even if the bot is capable. This is a false-belief of the user, and the bot is unable to correct the user’s false-belief. Call drops occur in a major chunk of the calls (40-50%).

Most bots in the industry are designed in a way that assumes the user trusts the bot and has infinite patience. The bot’s behaviour is apparrently designed to optimize for resolving queries of the user, however, not to inspire trust in the user that the bot is capable to resolve queries.

There are two possible ways to “solve” this problem:

Explicit: Design the product in a way that inspires trust. Come up with the best possible responses for all possible combination of dialog history and user states.
Implicit: Design the product in a top-down fashion rather than a bottom-up. Many believe that optimizing components with their local objective (Word-error-rate for ASR, F1 Scores for Intent classifiers etc.) would lead to a higher resolution rate. In biological systems, the higher-order function (survival) dictates lower order function (communication, language). Learning to communicate better can not ensure survival on its own. However, learning to survive may lead to a better ability to communicate. In other words, optimize ML models against objective (resolution rate) in-addition to the local objective. This will force the bot to behave in a way that inspires trust from the user and effectively learn to have theory of mind of the users.

The first method is the industry standard, and it doesn’t seem to be working well. The second method has the clear advantage of being data-driven and scalable.

References

[1] Astington, J.W., & Edward, M.J. (2010). The Development of Theory of Mind in Early Childhood.

[2] Demetriou, A., Mouyi, A., & Spanoudis, G. (2010). “The development of mental processing”, Nesselroade, J. R. (2010). “Methods in the study of life-span human development: Issues and answers.” In W. F. Overton (Ed.), Biology, cognition and methods across the life-span. Volume 1 of the Handbook of life-span development (pp. 36–55), Editor-in-chief: R. M. Lerner. Hoboken, New Jersey: Wiley.

[3] Rabinowitz, N.C., Perbet, F., Song, H.F., Zhang, C., Eslami, S.M., & Botvinick, M.M. (2018). Machine Theory of Mind. ICML.

[4] Nguyen, T.N., & González, C. (2020). Cognitive Machine Theory of Mind. CogSci.

End of Utterance Detection

2022-04-24T00:00:00+00:00

This blog post is based on the work done by Anirudh Thatipelli as an ML research fellow at Skit.ai

End Of Utterance Detection - When does a speaker stop speaking?

End-of-utterance detection is the problem of detecting when a user has stopped speaking in a conversation.

In the above image, there are four turns in total that are time-aligned.. The system initiates the conversation by speaking first (“How may I help you?”), then the user (“I want to go to Miami.”), then the system again (“Miami?”) and finally the system (“Yes.”).

The speaker who utters the first unilateral sound both initiates the conversation and gains possession of the floor. Having gained possession, a speaker maintains it until the first unilateral sounds by another speaker, at which time the latter gains possession of the floor.

Motivation

Despite going through many advances, the performance of spoken dialogue systems remains unsatisfactory. For example, turn-taking is a fundamental aspect of natural human conversation that helps to decide which participant has the floor in a conversation and who can speak next. Humans use many multimodal cues like prosodic features, gaze, etc to determine who has the floor in a particular conversation. The interaction is very smooth with very less gaps and overlaps between participants’ speech, making its modeling difficult. Currently, dialogue systems use a silence threshold to determine whether it should start speaking. This approach is too simplistic and can lead to many issues. The system can interrupt the user mid-utterance, known as cut-in. Or it can wait too long and leads to sluggish responses and possible misrecognition, causing an increase in latency.

As speech-dialogue systems become more ubiquitous, it is essential to design dialogue systems that can predict end of utterance and predict turns.

A dialogue system designer should also consider the trade-offs between cut-ins and latency. For Skit, an effective turn-taking system will improve customer service and decrease call-drop rate. Imbibing turn-taking capabilities into our product will make it more natural and improve the conversations with customers.

Previous approaches to solve the problem

One of the earliest models to study conversations was designed by Harvey Sacks et al in which he divided a conversation into two units of speech: Turn-constructional units (TCU) and Transition-relevant place (TRP) respectively.

Turn-constructional units are utterances from one speaker during which other participants assume the role as listeners. And each TCU is followed by a TRP, where a turn-shift can occur by the following rules:

The current speaker may select a next speaker (other-select), using for example gaze or an address term. In the case of dyadic conversation, this may default to the other speaker.

If the current speaker does not select a next speaker, then any participant can self-select. The first to start gains the turn.

If no other party self-selects, the current speaker may continue.

To identify these TCUs and TRPs, researchers segment the speech into Inter-Pausal Units (IPUs), which are stretches of audio from one speaker without any silence exceeding a stipulated amount(say, 200 ms). A voice activity detection(VAD) can detect these IPUs. Hence, a turn can be considered as a sequence of IPUs from a speaker, that are not interrupted by IPUs from another speaker.

To identify TRPs(turn-yielding cues) and non-TRPs(turn-hlding)cues, many cues such as syntactic completion, prosody and non-verbal cues like eye-contact have been investigated. However, it is very complicated to directly detect such cues from the data. This problem is compounded by the absence of facial cues in our data. End of utterance task can be also defined as the detection of TRPs, i.e. when the user’s turn is yielded and the system can start to speak. There are a multitude of works done in this regard, that can be divided into three types:

Silence-based models. The end of the user’s utterance is detected using a VAD. A silence duration threshold is used to determine when to take the turn. As discussed above, this is too simplistic and can lead to misrecognitions.
IPU-based models. Potential turn-taking points (IPUs) are detecting using a VAD. Turn-taking cues in the user’s speech are processed to determine whether the turn is yielded or not (potentially also considering the length of the pause).
Continuous models. The user’s speech is processed continuously to find suitable places to take the turn, but also for identifying backchannel relevant places (BRP), or for making projections.

We will go through each of the approaches in the following sections:

Silence-based models

As mentioned above, existing architectures use a fixed silence duration detection threshold to determine if the speech has ended. VAD utilizes energy and spectral features to distinguish between noise and speech in the audio. Two types of parameters are taken into consideration while designing these kinds of models.

After the system has yielded the turn, it awaits a user response, allowing for a certain silence (a gap). If this silence exceeds the no-input-timeout threshold (such as 5 s), the system should continue speaking, for example by repeating the last question.
Once the user has started to speak, the end-silence-timeout (such as 700ms) marks the end of the turn. As the figure shows, this allows for brief pauses (shorter than the end-silence-timeout) within the user’s speech.

These simplistic models break down if the user takes too long to respond. Or when the system might interrupt the user’s speech.

Tuning the threshold for different domains is extremely difficult and user satisfaction will be affected.

IPU-based models

These systems are built on an assumption that the system should not start to speak while the user is speaking. Turn-taking cues at the end of pauses are used to determine whether a turn has ended. These approaches run the gamut from hand-crafted rule-based semantic parsers to machine-learning and reinforcement learning models.

Sato et al’s work inputs over 100 different kinds of features like syntactic, semantic, final word, and prosody to decision trees to model when to take a turn. Albeit simplistic, their model achieved an accuracy of 83.9%, compared to the baseline of 76.2%. However, this approach can misclassify the IPU as a pause and uses a fixed threshold of 750 ms for pauses. To overcome this limitation, Ferrer et al condition a decision-tree classifier on the length of the pause after IPU continuously and classify on the prosodic features and n-grams of the words. Raux and Eskenazi cluster silences based on dialogue features and set a single threshold for each cluster, minimizing the overall latency by over 50% on the Let’s Go dataset.

Another shortcoming with the above approaches is that they are trained on human-computer dialogue corpus. But we want to learn a model for human-human dialogues. Transferring models from human-human to human-computer based systems is not feasible. So, some authors like (Raux, Eskenazi & Meena et al. use bootstrapping. First, a more simplistic model of turn-taking is implemented in a system and interactions are recorded. Then, the data is then manually annotated with suitable TRPs, and trained using a machine learning model like LSTM. Another approach is a Wizard-of-Oz setup, where a hidden operator controls the system and makes the turn-taking decisions as used in Maier et al.

Some previous approaches utilize reinforcement learning as well. For example, Jonsdottir et al train two agents to talk to each other, picking up prosodic cues and develop turn-taking skills. Khouzaimi et al. train a dialogue management model intending to minimize the dialogue duration and maximize the completion task ratio. But these approaches are trained in simulated environments and it is unclear if they transfer to real users.

Continuous models

Continuous models process the utterances in an incremental manner. These modules process the input frame-by-frame and pass their results to subsequent modules. It enables the system to make continuous TRP predictions, project turn completions and backchannels. Unlike previous approaches, the processing starts before the input is complete. The processing time is improved, and the output becomes more natural. There is no need to train the model for end-of-turn detection. It enables a deeper understanding of utterances and project backchannels and even interrupt the user.

One of the first works in incremental processing was Skantze and Schlangen on the task of number dictation. A benefit of incremental models is revision, as shown by Skantze and Hjalmarsson. For example, the word “four” might be amended with more speech, resulting in a revision to the word “forty”.

Another work by Skantze doesn’t train the model for end-of-turn detection. The audio from the speakers is processed frame-by-frame (20 frames per second) and fed to an LSTM. The LSTM predicts the speech activity for the two speakers for each frame in a future 3s window. The model outperforms human judges in this task. In an extension to this work, Roddy et al. propose a new LSTM architcture where the acoustic and linguistic features get processed in separate LSTM systems with different timescales.

Datasets

Most of the aforementioned works evaluate their performance on dialogue based datasets like:

that have a limited purpose and may not generalize well to our problem.

Conclusion

While significant work has been done in end-of-utterance detection, most of these models have shortcomings. Firstly, most are trained on dialogue-based datasets only without accounting for speech-level features. Secondly, these datasets are well-curated with less noise in the background which is not the case for our datasets. To account for noise and model audio and text jointly, we will need to retrain our models with new baselines.

References

TTS Enhancement

2022-03-09T00:00:00+00:00

Problem Statement

Text-To-Speech (TTS) systems of Skit, as well as TTS systems in general, have a tendency to mix some ambient noise along with the speech it outputs. This aim of this research project was to remove that noise and quantify how well the noise has been removed using standard metrics.

Listen to the clean speech sample here for reference-

and the distorted sample-

Introduction

Speech enhancement can be done using the traditional signal processing techniques or using deep learning techniques. In this project, we mainly focused on the signal processing aspects of noise reduction. Signal processing techniques can be further divided into 3 more categories-

Spectral Subtractive algorithms

The main principle is as follows- assuming additive noise, one can obtain an estimate of the clean signal spectrum by subtracting an estimate of the noise spectrum from the noisy speech spectrum. The noise spectrum can be estimated and updated, during periods when the signal is absent. The assumption made is that noise is stationary or a slowly varying process and that the noise spectrum does not change significantly between the updating periods. The enhanced signal is obtained by computing the IDFT of the estimated signal spectrum using the phase of the noisy signal.

Statistical Model based algorithms

Given a set of measurements that depend on an unknown parameter, we wish to find a nonlinear estimator of the parameter of interest. These measurements correspond to the set of DFT coefficients of the noisy signal and the parameters of interest are the set of DFT coefficients of the clean signal. Various techniques from estimation theory which include maximum-likelihood (ML) estimators and the Bayesian estimators like MMSE and MAP estimators are used for this purpose.

Subspace algorithms

These algorithms are based on the principle that the clean signal might be confined to a subspace of the noisy Euclidean space. Given a method for decomposing the vector space of the noisy signal into a direct sum of the subspace that is occupied by the clean signal and a subspace occupied by the noise signal, for example SVD, we could estimate the clean signal simply by nulling the component of the noisy vector residing in the noisy subspace.

Contributions

Filters

Speech enhancement can be done using the traditional signal processing techniques or using deep learning techniques. We hypothesised that signal processing techniques would be suitable for task and tested them out. We implemented some of the popular speech enhancement methods which were suitably modified to tackle the problem at hand.

Wiener Filter

Block Diagram of Wiener Filter

The input signal w[n] goes through a linear and time-invariant system to produce an output signal x[n]. We are to design the system in such a way that the output signal, x[n], is as close to the desired signal, s[n], as possible. This can be done by computing the estimation error, e[n], and making it as small as possible. The optimal filter that minimizes the estimation error is called the Wiener filter.

MMSE and MMSE Log Filter

These fall under the umbrella of Bayesian estimation techniques. We saw above that the Wiener estimator can be derived by minimizing the error between a linear model of the clean spectrum and the true spectrum. The Wiener estimator is considered to be the optimal (in the mean-square-error sense) complex spectral estimator, but is not the optimal spectral magnitude estimator. Acknowledging the importance of the short-time spectral amplitude (STSA) on speech intelligibility and quality, several authors have proposed optimal methods for obtaining the spectral amplitudes from noisy observations. In particular, we are looking for sought that minimized the mean-square error between the estimated and true magnitudes:

\[e = E{ (\hat{X_k} - X_k)^2 }\]

where \(\hat{X_k}\) is the estimate spectral magnitude at frequency \(\omega_k\) and \(X_k\) is the true magnitude of the clean signal.

The MMSE Log is an improvement upon the MMSE estimator. Although a metric based on the squared error of the magnitude spectra is mathematically tractable, it may not be subjectively meaningful. It has been suggested that a metric based on the squared error of the log-magnitude spectra may be more suitable for speech processing. So we minimize :

\[e = E \{ (log \hat X_k - log X_k)^2 \}\]

and we notice a significant improvement in the results compared to the original MMSE estimator.

Berouti’s Oversubstraction

This method consists of subtracting an overestimate of the noise power spectrum, while preventing the resultant spectral components from going below a preset minimum value (spectral floor).

\[\hat X(\omega)=\begin{cases} |Y(\omega)|^2 - \alpha |\hat D(\omega)|^2& \text{if } |Y(\omega)|^2 \geq (\alpha + \beta) |D(\omega)|^2 \\ \beta |\hat D(\omega)|^2 & \text{else} \end{cases}\]

where \(\alpha (\geq 1)\) is the oversubtraction factor and \(0 \leq \beta \leq 1\) is the spectral floor parameter.

When we subtract the estimate of the noise spectrum from the noisy speech spectrum, there remain peaks in the spectrum. Some of those peaks are broadband (encompassing a wide range of frequencies) whereas others are narrow band, appearing as spikes in the spectrum. By oversubtracting the noise spectrum, that is, by using \(\alpha\), we can reduce the amplitude of the broadband peaks and, in some cases, eliminate them altogether. This by itself, however, is not sufficient because the deep valleys surrounding the peaks still remain in the spectrum. For that reason, spectral flooring is used to “fill in” the spectral valleys and possibly mask the remaining peaks by the neighbouring spectral components of comparable value. The valleys between peaks are no longer deep when \(\beta > 0\) compared to when \(\beta = 0\).

The parameter \(\beta\) controls the amount of remaining residual noise and the amount of perceived musical noise. If the spectral floor parameter \(\beta\) is too large, then the residual noise will be audible but the musical noise will not be perceptible. Conversely, if \(\beta\) is too small, the musical noise will become annoying but the residual noise will be markedly reduced.

The parameter \(\alpha\) affects the amount of speech spectral distortion caused by the subtraction. If \(\alpha\) is too large, then the resulting signal will be severely distorted to the point that intelligibility may suffer.

\[\alpha = \alpha_0 - \frac{3}{20} \textit{SNR} : \text{ for} -5 \leq \textit{SNR} \leq -20\]

where \(\alpha_0\) is the desired value of \(\alpha\) at 0 dB SNR and the \(\textit{SNR}\) is the short term SNR estimated at each frame.

The Kalman filter is a general recursive state estimation technique which is modified to work on the speech denoising problem.

Intelligibility Metrics

Along with techniques for speech enhancement, it is important to quantify the degree of enhancement which our methods provide. For this, we tested several metrics as discussed below-

Perceptual Evaluation of Speech Quality (PESQ) is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. PESQ results essentially model mean opinion score (MOS) that cover a scale from 1 (bad) to 5 (excellent).

Short-Time Objective Intelligibility (STOI) is an objective metric showing high correlation (\(\rho=0.95\)) with the intelligibility of both noisy, and TF-weighted noisy speech.

Gross Pitch Error (GPE) is the proportion of frames, considered voiced by both pitch tracker and ground truth, for which the relative pitch error is higher than a certain threshold, which is usually set to 20%.

Voicing Error Decision (VED) is the proportion of frames for which an incorrect voiced/unvoiced decision is made.

F0 Frame Error (FFE) is the proportion of frames for which an error (either according to the GPE or the VDE criterion) is made. FFE can be seen as a single measure for assessing the overall performance of a pitch tracker.

Mel Cepstral Distortion (MCD) is a measure of how different two sequences of mel cepstra are. It is used in assessing the quality of parametric speech synthesis systems, including statistical parametric speech synthesis systems, the idea being that the smaller the MCD between synthesized and natural mel cepstral sequences, the closer the synthetic speech is to reproducing natural speech.

Results

We apply our methods on 2 different datasets: First on the public NOIZEUS dataset and next on a dataset created by the in-house TTS systems of Skit. The results are quite satisfactory when we apply our methods on the NOIZEUS dataset and we found that the Wiener Filter and the Kalman Filters perform the best outperforming one another for different signal-to-noise ratios (SNR).

Effect of filters wrt MCD metric

Effect of filters wrt PESQ metric

Conclusion

However they do not perform as well as we want on the TTS dataset. In fact, we observe that our models adversely affecting the input speech. There can be various reasons attributed to this, the primary one being that speech denoising of real life data and TTS Systems are quite different, since both have different noise types. Real life noise is either additive and can be subtracted by noise estimation or can be decomposed as a direct sum of a clean subspace and a pure noise subspace. But the noise in TTS systems are much more subtle and the noise cannot be modelled to be simply additive. Here the noise is generated along with the speech. Hence most of the traditional filters which although work well for real life noise separation, do not work well for this use case. This is where we planned to resort to deep learning models like the Facebook denoiser and SeGAN.

Code

You can find more information on the Github Repository.

References

Author: Ananyapam De, a final year student at IISER Kolkata, majoring in Statistics, while minoring in the Computational Sciences.

Turn Taking Dynamics in Voice Bots

2022-03-07T00:00:00+00:00

One of the challenges in building an interactive voice bots is accounting for turn taking behaviour. Turn-taking is a difficult problem to get right, even for humans. In all our circles, we’d know of at least one person who likes to interrupt a lot and doesn’t have good turn taking etiquette. Having a conversation with such a person can be quite irritating as one feels one is not getting heard or even getting a chance to finish one’s sentence.

Turn-taking is even more difficult in a multi-party setting. You might remember the last group call you had and just when you were about to take the turn, someone else jumped right in (because you waited for a tad bit too long) and you never got to speak. Turn-taking behaviour also differs culturally. In some cultures, interruptions and barge-ins are a lot more natural. There is also a difference in the inter-turn pause duration. These factors often lead to an unnatural conversation flow when speaking to a person from a different culture.

Note : Bots with explicit turn-taking signalling like wake-words are out of scope for this blog.

Natural Turn Taking Dynamics

Irrespective of nuances, there are aspects of turn taking behaviour which are globally present in natural human-human conversation and one’s that we would want to imbibe in a human-bot interaction as well.

Barge-ins: These are situations when one agent interrupts the other. They occur very commonly. Examples of situations are : when one feels the other person is making a mistake or when ones feels the need to add some essential information, one naturally barges in.
Full Duplex Conversations : A half duplex conversation is one where turns are alternatively taken, like playing a tennis match, however in natural conversations, there are often instances when both people are saying something at the same time.
- backchannels : words and fillers like “okay”, “alright” or “hmm” provide a lot of context about the state of the other person(for example attentiveness), especially when one is talking over the phone and visual cues are absent.
- corrections : at times, when a person is saying something, one might want to make a small correction. For example, if there is an announcement being made “for the next meeting, you are supposed to finish submissions by 12th December, at so and so time….”. When the person is saying 12th December someone might correct by saying 13th December. This information is assimilated by the person and they often correct themselves. So, humans have the ability to hear and understand even while speaking and are active listeners.

Fig 1: Full duplex vs half duplex conversations.

Minimal inter-turn pauses : if you’ve ever spoken with a voice assistant, one of the first observations is that it takes too long to start speaking after you are done and the other way around. Human conversations have a much lower turn taking latency. If this latency is near optimal, it also lends to a feeling that the other person is understanding you and the left over impression is that of a conversation gone well. Human’s have an average pause duration of 200ms as shown below, while bots have a much higher latency.

Fig 2: Turn Taking Pause duration as measured from the Switchboard corpus. Image is taken from [1].

Turn taking cues : often in natural conversations, people produce small vocal cues like filler words “umm” or “uhhh” to convey that they want to say something and take the turn.
Turn yielding cues : there are markers is conversations when one knows that the person is done speaking. This is how we are able to separate pauses, which happen when a person is thinking in between his utterance vs one when he is done speaking.

Turn-Taking Dynamics in Voice Bots

Below, we discuss different versions of turn-taking dynamics implemented in voice-bots each with more features and increasing levels of difficulty.

Version - 1.0

These are some characteristics of a bare bone turn-taking behaviour that one would need in a voice bot deployment.

Initial patience : the time that the bot waits for the person to starts speaking
Silence detection : if the bot detects silence for a certain duration after the person has started speaking, it assumes the person’s turn is over.
Max turn duration : it doesn’t make sense to just be listening (because of error compounding, loss of context, maybe one is hearing just noise), so usually voice bots have a maximum duration to which they listen to the user.

Version - 2.0

This version add robustness for real life situations, make the bot more human-like and tries to reduce the latency between turns.

VAD instead of silence detection : Often existence of background noise, speech and other signals causes the bot to keep listening. Instead one could train a Voice-Activity Detection system rather than use silence detection, to have robustness to background events and to listen to the user only when they are speaking.
Variable thresholds for silence detection and max duration : In some states for example, when the bot is expecting a yes/no answer, it makes sense to use smaller thresholds. In general dynamic thresholding should be used.
For turn-switching, instead of a simple VAD, use an IPU based model discussed here. This uses a smaller VAD threshold + cues to predict the turn is over. One could start with some verbal cues for example phrase completion.
Adding backchannels as bot responses : So far we’ve only discussed aspects of perception, but backchannels are a very useful response feature. It makes the user feel that the bot is more attentive and is actively listening.
- One could also add filler words in the main channel, when the bot is taking too long to produce a response in cases of high latency. This would prevent the user from asking a question to verify if the bot is there or not. Without this, the user’s speech would lead to further increase in latency as it would be perceived as a case when the user wants to take the turn and say something useful.

Version - 3.0

There are no good baselines for these and working on improvements would constitute state of the art performance.

Multi-party situations : These are a lot more complex and require modelling multiple parties. An application could be when the bot is overseeing a human-machine interaction say, between a call centre agent and a human. Another common use is when during a typical 2 party interaction, someone interrupts the user. This requires the bot being aware that the user is speaking to someone else and then waiting.
Full - Duplex Conversations : Unlike human-human conversations a bots can attentively listen at the same time, while saying something. This offers a possibility of redesigning interactions which can leverage this feature.
Personalisation of Turn taking behaviour : This involves changing the parameters based on user characteristics. One could entrain one’s system to be more in line with the user’s behaviour. At times when the user is angry it might involve changing the durations to feel that they are being heard.

References :

Turn-taking in Conversational Systems and Human-Robot Interaction: A Review

Feature Disentanglement - I

2022-02-22T00:00:00+00:00

The main advantage of deep learning is the ability to learn from the data in an end-to-end manner. The core of deep learning is representation, the deep learning models transform the representation of the data at each layer into a condensed representation with reduced dimension. Deep Learning models are often also termed as black-box models as these representations are difficult to interpret, understanding these representations can give us an insight about which feature of the data is more important and will allow us to control the learning process. Recently there has been a lot of interest in representation learning and controlling the learned representations which give an edge over multiple tasks like controlled synthesis, better representations for specific downstream tasks.

Data Representation and Latent Code

An image \((x)\) from the MNIST dataset has 28x28 = 784 dimensions which is a sparse representation of the image that can be visualized. But all these dimensions are not required to represent the image. The content of the images can be represented in a condensed form using lesser dimensions called latent code. Although the actual image has 784 dimensions \(x \in R^{784}\), one way of representing MNIST image can be with just an integer ie: \(z \in \{0, 1, 2, …, 9\}\). This representation \(z\) reduces the dimension of representing the image \(x\) to 1 which captures the content of which number is present in the image and the variability in the dataset. This is one example of discrete latent code for the MNIST dataset, a continuous latent code will contain more information about the image such as the style of the image, position of the number, size of the number in the image, etc.

Fig 2: Sample Images of MNIST from [1]

AutoEncoder

Autoencoder[2] models are popularly used to learn such latent code in an unsupervised manner by compressing the image to a fixed dimension code \(z\) and generating the image back using this latent code with an encoder-decoder model.

Fig 2: Autoencoder architecture from Autoencoders[2]

The encoder \(q_{\phi}(z \mid x)\) of the autoencoder compresses the image to a fixed dimension\((d)\) latent code\((z)\), and the decoder \(p_{\theta}(x \mid z)\) is a conditional image generator. The dimension of z has to be such that, the image can be completely reconstructed by the decoder with the latent code. Choosing the dimension of the latent code is a problem on its own[3].

The autoencoder models trained will successfully encode the images into a latent code \(z\), but there is no guarantee that the latent code can be easily inferred, ie: we do not know where in the d-dimensional space the model encoded the image into, and thus difficult to choose a latent code to generate image during inference. So the conclusion is we have no idea how and where the encoder encodes the images, so we do not have control over synthesis during inference. The following figure shows the latent code learned by the AutoEncoder model with different training, as we can observe the latent space keep changing the range and quadrant and thus difficult to infer.

Fig 3: Latent code of MNIST images learned by an Auto Encoder [4]

Variational AutoEncoder(VAE)

Variational autoencoders(VAE) [6] solve this problem by forcing the latent code (z) to be close to a known prior distribution(Gaussian), this gives us control over the latent space. During inference, the latent space can be sampled from this known distribution for image generation. The following figure shows the latent code learned by VAE with different training, and the latent space across training is centered to the mean 0 across dimensions.

Fig 4: Latent code of MNIST images learned by a VAE [4]

VAE allows us to have control over the latent space and sample from the known prior distribution. But this again does not give us control over the generation of the image. Say if you want to generate an image of the number ‘3’ or ‘7’, you cannot do that(at least not directly). This is where the term “disentanglement” comes into play.

Disentanglement

Feature disentanglement is isolating the source of variation in observation data. There is a lot more factors/feature of an MNIST image other than the number itself, such as the location of the number in the image, size of the image, angle of the number, etc. These factors are independent of each other.

Feature disentanglement involves separating underlying concepts of “Big one in the left”: ie: size(big), number(one), location(left). Our interest here is to see if we can isolate these factors in the latent code so that we can have control over the generation of the images. So we want the encoder to disentangle the representation into different factors and then we generate the image with desired factors say “small seven in the top rotated 30 degrees”.

Fig 5: Generated MNIST images by InfoGAN [5] varied digit, thickness and roatation.

Beta-VAE

Beta-VAE is a variant of VAE which allows disentanglement of the learned latent code. Beta-VAE adds hyperparameter to the loss function which modulates the learning constraint of VAE.

Fig 6: Loss function of beta-VAE [7].

The first part of the loss function takes care of the reconstruction of the image, it is the second term that learns the latent code of VAE. Different dimensions that span across Gaussians are independent, so by making the prior distribution gaussian, we force the dimensions of the latent code to be independent of each other. So increasing the weight of the second part of the loss, makes the latent code to be disentangled and independent. But this also brings a tradeoff between disentanglement and the reconstruction capability of the VAE. Although Beta-VAE models are good in disentangling the features, the reconstruction ability of this model is not the best.

Fig 7: Samples generated by beta-VAE [7].

Beta-TCVAE

beta-TCVAE decomposes the KL divergence[10] term of the loss function of VAE into reconstruction loss, Index-code mutual information[8] between data and latent variable, Total Correlation[9] of z, and Dimension wise KL divergence[10] of \(z\)(respectively in the following formula). This helps to break the overall KL Divergence of \(z\) into dimension-wise quantities, which will focus on each dimension of the latent code \(z\). In this formulation, the beta hyperparameter is only on the Total Correlation term which is more important for disentanglement without affecting the reconstruction. So, Beta-TCVAE has better reconstruction ability than Beta-VAE with similar disentanglement property.

\[\mathcal{L}_{\beta-\mathrm{TC}}:=\mathbb{E}_{q(z \mid n) p(n)}[\log p(n \mid z)]-\alpha I_{q}(z ; n)-\beta \operatorname{KL}\left(q(z) \| \prod_{j} q\left(z_{j}\right)\right)-\gamma \sum_{j} \operatorname{KL}\left(q\left(z_{j}\right) \| p\left(z_{j}\right)\right)\]

where \(\alpha = \gamma = 1\) and only \(\beta\) is varies as the hyperparameter.

Fig 8: Samples generated by beta-TCVAE [8].

In future posts, we will examine many new methods for feature disentanglement and how these methods can be applied to speech signals.