<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.2.0">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2024-07-01T08:12:27+00:00</updated><id>/feed.xml</id><title type="html">Skit Tech</title><subtitle>Speech Technology from Skit</subtitle><entry><title type="html">Speech LLMs for Conversations</title><link href="/speech-conversational-llms/" rel="alternate" type="text/html" title="Speech LLMs for Conversations" /><published>2024-05-09T00:00:00+00:00</published><updated>2024-05-09T00:00:00+00:00</updated><id>/speech-conversational-llms</id><content type="html" xml:base="/speech-conversational-llms/">&lt;p&gt;With LLMs making conversational systems has become easier. You no longer need to
focus on the low-level details of categorizing semantics and designing
responses. Instead, you can concentrate on controlling high-level behaviors via
an LLM. This is the trend that we see most of the world moving towards as
products are using vendor combinations of ASR, LLM, and TTS with some dialog
management stitched in between. While this is going to be the norm soon, we want
to keep exploring areas from where the next set of quality improvements will
come.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;/speech-first-conversational-ai-revisited/&quot;&gt;Earlier&lt;/a&gt; we discussed how spoken
conversations are richer than pure text and how the gap would be not bridged by
LLMs purely working on transcriptions. In one of our recent experiments we built
an efficient multi-modal LLM that takes speech directly to provide better
conversational experience. For production usage, the constraint here is that
this should happen without losing the flexibility that you get in a text-only
LLM around writing prompts, making changes, evaluating, and debugging.&lt;/p&gt;

&lt;p&gt;Below is a conversation with our recent in-house Speech LLM based conversational
system. Notice that because of the extra information in speech some micro
personalizations can happen like usage of gendered pronouns&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. You also get
lower impact of transcription errors and in general better responses in
non-speech signals. With access to both speech and text domains, the model
allows for more fluent turn-taking, though not demonstrated in the current
conversation. In addition, our approach also reduces the combined model size
(&amp;lt;2B) for taking speech to response, leading to lower compute latency as
compared to larger systems.&lt;/p&gt;

&lt;style&gt;
.webvtt-player .media {
  display: unset;
}

.webvtt-player .container {
  width: unset;
}

.webvtt-player {
  font-family: sans-serif;
  font-size: 0.8em;
}
&lt;/style&gt;

&lt;div id=&quot;webvtt-player&quot; data-audio=&quot;../assets/audios/posts/speech-conversational-llms/audio.m4a&quot; data-transcript=&quot;../assets/audios/posts/speech-conversational-llms/transcript.vtt&quot; data-metadata=&quot;../assets/audios/posts/speech-conversational-llms/metadata.vtt&quot; /&gt;

&lt;script src=&quot;https://umd-mith.github.io/webvtt-player/webvtt-player.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;The model above doesn’t yet control speech synthesis beyond the textual markers
it can generate, but that’s something to be added soon (you might have noticed
erratic pitch shifts in the call above since TTS vendors don’t contextualize
based on past conversations). Stay tuned for more details on how we take this
and similar research areas forward.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Of course concerns around paralinguistic prediction accuracies are
extremely important to take something like this in production. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>Shangeth Rajaa</name></author><category term="Machine Learning" /><category term="llm" /><category term="speech" /><category term="conversations" /><summary type="html">With LLMs making conversational systems has become easier. You no longer need to focus on the low-level details of categorizing semantics and designing responses. Instead, you can concentrate on controlling high-level behaviors via an LLM. This is the trend that we see most of the world moving towards as products are using vendor combinations of ASR, LLM, and TTS with some dialog management stitched in between. While this is going to be the norm soon, we want to keep exploring areas from where the next set of quality improvements will come.</summary></entry><entry><title type="html">Improving consumer verification using confidence calibration and thresholding</title><link href="/confidence-calibration/" rel="alternate" type="text/html" title="Improving consumer verification using confidence calibration and thresholding" /><published>2024-01-09T00:00:00+00:00</published><updated>2024-01-09T00:00:00+00:00</updated><id>/confidence-calibration</id><content type="html" xml:base="/confidence-calibration/">&lt;p&gt;In the past year, our team’s current focus has shifted to building robust and scalable voice-bots for US companies.
In particular, we are honing in on the use case of facilitating the collection of borrowed funds. Given the stringent 
compliance standards for user verification in the US, our voicebot must excel in this aspect, leaving no room for error. 
Our top priority is to avoid any inadvertent verification of false users, which could potentially lead to the exposure 
of sensitive debt-related information. On a biased dataset, curated to tackle this problem, false consumer verifications 
are estimated to occur in &lt;strong&gt;~0.67% of samples&lt;/strong&gt;.&lt;/p&gt;

&lt;h1 id=&quot;technical-analysis&quot;&gt;Technical Analysis&lt;/h1&gt;

&lt;p&gt;We wanted to do a deep dive of the problem from an ML standpoint. We noticed that we needed to be more confident in our 
predictions. But what do we mean by confidence in our predictions? It is defined as the ability of the model to provide 
an accurate probability of correctness for any of its predictions. For example, if our SLU model predicts that the intent 
is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_confirm_&lt;/code&gt; with a probability of 0.3 then the prediction has a 30% of being correct provided the model has been calibrated 
properly.&lt;/p&gt;

&lt;figure&gt;
  &lt;img width=&quot;800&quot; alt=&quot;Can't See? Something went wrong!&quot; src=&quot;../assets/images/confidence-calibration-blog/slu-model.png&quot; /&gt;
&lt;/figure&gt;

&lt;figure&gt;
  &lt;img alt=&quot;Can't See? Something went wrong!&quot; src=&quot;../assets/images/confidence-calibration-blog/uncal-score-plot.png&quot; /&gt;
  &lt;figcaption style=&quot;font-style: italic; text-align: center;&quot;&gt;
    This plot shows how under-confident we are on lower probabilities. Naturally, classes with low probabilities should be 
    classified as wrong class predictions but our uncalibrated model is unable to do so.
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;metrics-and-diagrams&quot;&gt;Metrics and Diagrams&lt;/h2&gt;

&lt;p&gt;We realized that &lt;strong&gt;model miscalibration&lt;/strong&gt; and rejecting &lt;strong&gt;low confidence prediction&lt;/strong&gt; is the major problem that we need to solve. 
But how do we quantify model calibration?&lt;/p&gt;

&lt;h3 id=&quot;expected-and-max-calibration-error&quot;&gt;Expected and Max Calibration Error&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Expected Calibration Error (ECE)&lt;/strong&gt; measures the disparity between a model’s confidence and its accuracy. It can be computed directly 
using a closed-form formula or approximated by dividing predictions into bins based on their confidence scores. Within each bin, the 
average confidence and accuracy differences are calculated, and then ECE is obtained by taking a weighted average of these bin-wise 
differences proportional to the bin sizes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maximum Calibration Error (MCE)&lt;/strong&gt; is similar to ECE but is equal to the maximum difference between average confidence and accuracy across bins.&lt;/p&gt;

\[ECE = \sum_{i=1}^M \frac{|Bin_{i}|}{N}.|a_{i} - c_{i}|\]

\[MCE = \max_{i \in \{1,\ldots,M\}} |a_{i} - c_{i}|\]

&lt;p&gt;The lower the ECE and MCE values, the better calibrated the model is.&lt;/p&gt;

&lt;h3 id=&quot;reliability-diagrams&quot;&gt;Reliability Diagrams&lt;/h3&gt;

&lt;p&gt;Reliability diagrams depict accuracy on the y-axis and average confidence scores on the x-axis. A line plot is formed using the accuracy 
and the confidence points. A diagonal line through the origin indicates a perfectly calibrated model where confidence is equal to accuracy for each bin.&lt;/p&gt;

&lt;figure&gt;
  &lt;img alt=&quot;Can't See? Something went wrong!&quot; src=&quot;../assets/images/confidence-calibration-blog/reliability-graph-deployed.png&quot; /&gt;
  &lt;figcaption style=&quot;font-style: italic; text-align: center;&quot;&gt;
    Reliability diagram of our deployed model. The dashed line represents perfect calibration. The blue bins are 
    the actual bins with these many predictions and their corresponding accuracy falling under one bin
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h1 id=&quot;adopted-solution&quot;&gt;Adopted Solution&lt;/h1&gt;

&lt;p&gt;Upon careful analysis of our data and model, we have found evidence suggesting that calibrating our SLU (Spoken Language Understanding) 
models could effectively mitigate false positives and enhance the overall confidence of our model predictions. Subsequently, we delved 
into the existing literature to explore potential solutions to address this issue.&lt;/p&gt;

&lt;p&gt;We came across multiple solutions that tackle model miscalibration such as Ensemble based Calibration [5], mixup [3], using 
Bayesian Neural Networks [4] etc. We narrowed it down to one solution that was easy to integrate and did not require us to re-train our 
models i.e. Temperature Scaling.&lt;/p&gt;

&lt;h2 id=&quot;temperature-scaling-1&quot;&gt;Temperature Scaling [1]&lt;/h2&gt;

&lt;p&gt;Temperature scaling involves tuning a softmax temperature value to minimize Negative Log-likelihood loss on a held-out validation set. 
This value is then used to &lt;em&gt;soften&lt;/em&gt; the output of the softmax layer.&lt;/p&gt;

\[\text{Softmax}(y) = \frac{\exp(y_i)}{\sum_j \exp(y_j)}\]

\[\text{Temp-Softmax}(y) = \frac{\exp(y_i/T)}{\sum_j \exp(y_j/T)}\]

&lt;p&gt;The intuition behind temperature scaling is that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;T&lt;/code&gt; value penalizes high probability scores thereby resulting in better confidence in these values.&lt;/p&gt;

&lt;p&gt;After tuning a temperature scaling value on our validation dataset, we noticed that our model was better calibrated(Lower ECE and MCE values as well) 
and was able to better classify low-confidence score predictions.&lt;/p&gt;

&lt;figure&gt;
  &lt;img alt=&quot;Can't See? Something went wrong!&quot; src=&quot;../assets/images/confidence-calibration-blog/cal-score-plot.png&quot; /&gt;
  &lt;figcaption style=&quot;font-style: italic; text-align: center;&quot;&gt;
    Post calibration, we notice that we are able to better classify low-confidence scores under wrong prediction bins.
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;figure style=&quot;display: flex; flex-direction: row; justify-content: space-between;&quot;&gt;

  &lt;div style=&quot;width: 48%;&quot;&gt;
    &lt;img alt=&quot;Old Reliability Graph on Validation Set&quot; src=&quot;../assets/images/confidence-calibration-blog/old_reliability_graph.png&quot; /&gt;
    &lt;figcaption style=&quot;font-style: italic; text-align: center;&quot;&gt;
      Old Reliability Graph (on Validation Set)
    &lt;/figcaption&gt;
  &lt;/div&gt;

  &lt;div style=&quot;width: 48%;&quot;&gt;
    &lt;img alt=&quot;New Reliability Graph on Validation Set&quot; src=&quot;../assets/images/confidence-calibration-blog/new_reliability_graph.png&quot; /&gt;
    &lt;figcaption style=&quot;font-style: italic; text-align: center;&quot;&gt;
      New Reliability Graph
    &lt;/figcaption&gt;
  &lt;/div&gt;

&lt;/figure&gt;

&lt;h2 id=&quot;thresholding&quot;&gt;Thresholding&lt;/h2&gt;

&lt;p&gt;Keeping our current objective in mind where we reject low-confidence predictions, we realized that thresholding individual intents 
on a temperature-scaled model could work quite well for us. We wanted an increment in our precision numbers without hitting recall 
for our intents. This is because we want to maximize our confidence in the current prediction without inadvertently increasing our 
&lt;strong&gt;False Negatives&lt;/strong&gt; i.e. we still want to be accurate while predicting our positive class (could be any intent here). We plotted 
precision-recall curves with thresholds on the x-axis and precision/recall on the y-axis. We have default recipes to generate these 
threshold values that ensure that we maximize precision without affecting the recall. The data scientist or an ML engineer then can 
have a look at these plots and accordingly decide which threshold values to go with if they feel that&lt;/p&gt;

&lt;figure&gt;
  &lt;img alt=&quot;Can't See? Something went wrong!&quot; src=&quot;../assets/images/confidence-calibration-blog/precision-recall-curve.png&quot; /&gt;
  &lt;figcaption style=&quot;font-style: italic; text-align: center;&quot;&gt;
    A precision-recall curve for the `_confirm_` intent
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h1 id=&quot;results&quot;&gt;Results&lt;/h1&gt;

&lt;p&gt;After performing the above experiments on our datasets, we noticed a bump of &lt;strong&gt;10 points&lt;/strong&gt; in macro-precision without affecting the 
macro-recall (\(\pm\) 1 point in recall). Our model became more robust to mis-classifications and was able to reject low-confidence 
predictions thereby increasing the confidence in the model to handle sensitive compliance-related turns. Our product metrics also 
improved by a margin - after implementing the above solution we were able to bring down false consumer verification from
&lt;strong&gt;~0.67% to ~0.18%&lt;/strong&gt; on the dataset.&lt;/p&gt;

&lt;h1 id=&quot;caveats&quot;&gt;Caveats&lt;/h1&gt;

&lt;p&gt;Temperature Scaling does not work well when the dataset distributions differ i.e. whenever there’s data drift between the production data and the validation data, the calibration of the model is not accurate.  If there’s a data drift, the above methodology needs to be performed again to maintain data sanity. Thresholding affects recall (read reduce) when we go forward with maximizing precision gains.&lt;/p&gt;

&lt;h1 id=&quot;citations&quot;&gt;Citations&lt;/h1&gt;

&lt;ol&gt;
  &lt;li&gt;Guo, Chuan, et al. “On calibration of modern neural networks.” &lt;em&gt;International Conference on Machine Learning&lt;/em&gt;. PMLR, 2017&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://towardsdatascience.com/confidence-calibration-for-deep-networks-why-and-how-e2cd4fe4a086&quot;&gt;https://towardsdatascience.com/confidence-calibration-for-deep-networks-why-and-how-e2cd4fe4a086&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Zhang, Hongyi, et al. “mixup: Beyond empirical risk minimization.” &lt;em&gt;arXiv preprint arXiv:1710.09412&lt;/em&gt; (2017)&lt;/li&gt;
  &lt;li&gt;Neal, Radford M. &lt;em&gt;Bayesian learning for neural networks&lt;/em&gt;. Vol. 118. Springer Science &amp;amp; Business Media, 2012.&lt;/li&gt;
  &lt;li&gt;Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. “Simple and scalable predictive uncertainty estimation using deep ensembles.” &lt;em&gt;Advances in neural information processing systems&lt;/em&gt; 30 (2017).&lt;/li&gt;
&lt;/ol&gt;</content><author><name>Sanchit Ahuja</name></author><category term="Machine Learning" /><category term="slu" /><category term="compliance" /><category term="nlp" /><summary type="html">In the past year, our team’s current focus has shifted to building robust and scalable voice-bots for US companies. In particular, we are honing in on the use case of facilitating the collection of borrowed funds. Given the stringent compliance standards for user verification in the US, our voicebot must excel in this aspect, leaving no room for error. Our top priority is to avoid any inadvertent verification of false users, which could potentially lead to the exposure of sensitive debt-related information. On a biased dataset, curated to tackle this problem, false consumer verifications are estimated to occur in ~0.67% of samples.</summary></entry><entry><title type="html">Speech-First Conversational AI Revisited</title><link href="/speech-first-conversational-ai-revisited/" rel="alternate" type="text/html" title="Speech-First Conversational AI Revisited" /><published>2023-05-11T00:00:00+00:00</published><updated>2023-05-11T00:00:00+00:00</updated><id>/speech-first-conversational-ai-revisited</id><content type="html" xml:base="/speech-first-conversational-ai-revisited/">&lt;p&gt;Around last year, &lt;a href=&quot;https://tech.skit.ai/speech-first-conversational-ai/&quot;&gt;we shared our views&lt;/a&gt; on how nuances of spoken conversations make voicebots different than chatbots. With the recent advancements in conversational technology, thanks to Large Language Models (LLMs), we wanted to revisit the implications on what we call Speech-First Conversational AI. This post is one of many such reexaminations.&lt;/p&gt;

&lt;p&gt;We will try quoting the older blog post wherever possible, but if you haven’t read the older post, you are encouraged to &lt;a href=&quot;https://tech.skit.ai/speech-first-conversational-ai/&quot;&gt;do so here&lt;/a&gt; before going any further.&lt;/p&gt;

&lt;h2 id=&quot;whats-changed&quot;&gt;What’s Changed?&lt;/h2&gt;

&lt;p&gt;In one line, the problem of having believable and reasonable conversations is solved with the current generation of LLMs. You would still get factual problems and minor niggles, but I could ask my grandmother to sit and chat with an LLM based text bot without breaking her mental model of how human conversations could happen, at all.&lt;/p&gt;

&lt;p&gt;Internally, we use the phrase “text conversations are solved” to describe the impact of LLMs on our technology. But how does this reflect in spoken conversations? Will they also be solved? Sooner than later, sure. But there are some details to look in that go beyond raw textual models to do this well.&lt;/p&gt;

&lt;p&gt;Beyond the statement of “text conversations are solved”, there are more upgrades that make us excited about their implications for spoken conversations. The most important being the hugely improved capability to model any behavior that can be &lt;em&gt;meaningfully-translated&lt;/em&gt; in text. For example, if you want to do speech backchannel modeling right now, you might get very far by connecting a perception system with an LLM rather than building something else altogether. This pattern is part of the promises of AGI, and knowing that we are getting there gradually is very stimulating.&lt;/p&gt;

&lt;h2 id=&quot;spoken-dialog&quot;&gt;Spoken Dialog&lt;/h2&gt;

&lt;p&gt;Let’s revisit the points that make spoken conversations different than textual ones, as described in &lt;a href=&quot;https://tech.skit.ai/speech-first-conversational-ai/&quot;&gt;the post last year&lt;/a&gt;. In the main, all the points are still relevant, but the complexities involved in solutions are different now. As a Speech AI company, this is helping us get better answers to the question of how should we go about more natural interactions between humans and machines.&lt;/p&gt;

&lt;h3 id=&quot;1-signal&quot;&gt;1. Signal&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;Speech isn’t merely a redundant modality, but adds valuable extra information. Different styles of uttering the same utterance can drastically change the meaning, something that’s used a lot in human-human conversations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is still relevant and important. While speech recognition systems have started to become better in transcribing content, robust consumption of non-lexical content is still a problem to solve for doing &lt;em&gt;spoken conversations&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;One of our fresh research works (releasing soon) involves utilizing &lt;em&gt;prosodic&lt;/em&gt; information along with lexical to increase language understanding and the &lt;em&gt;gain we got is still significant&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;2-noise&quot;&gt;2. Noise&lt;/h3&gt;

&lt;p&gt;Speech recognition systems have come a long way from 2022. WER performance, even in noisy audios, is extremely good and one could trust ASRs a lot more for downstream consumption than one could do last year.&lt;/p&gt;

&lt;p&gt;More non-speech markers, timing information, etc. are accessible easily and accurately which could be clubbed with LLMs directly to simplify modeling behaviors like disfluencies.&lt;/p&gt;

&lt;h3 id=&quot;3-interaction-behavior&quot;&gt;3. Interaction Behavior&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;We don’t take turns in a half-duplex manner while talking. Even then, most dialog management systems are designed like sequential turn-taking state machines where party A says something, then hands over control to party B, then takes back after B is done. The way we take turns in true spoken conversations is more &lt;em&gt;full-duplex&lt;/em&gt; and that’s where a lot of interesting conversational phenomena happen.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;While conversing, we freely barge-in, attempt corrections, and show other backchannel behaviors. When the other party also starts doing the same and utilizing these both parties can have much more effective and grounded conversations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Simulacrums of full duplex systems Google Duplex already have hinted at why this is important. While the product impact of full-duplex conversations has been elusive because of technology’s brittleness, with LLMs and better speech models, the practical viability of this is pretty high now.&lt;/p&gt;

&lt;p&gt;A natural thread of work is modeling conversations speech to speech which is already happening in the research community. But even before perfecting that, we can significantly get better in spoken interactions with currently available technologies and some clever engineering.&lt;/p&gt;

&lt;h3 id=&quot;4-runtime-performance&quot;&gt;4. Runtime Performance&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;In chat conversations, model latencies and the variance over a sample don’t impact user experience a lot. Humans look at chat differently and latency, even in seconds doesn’t change the user experience as much as in voice.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;This makes it important for the voice stack to run much faster to avoid any violation of implicit contracts of spoken conversations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is something where heavy LLMs don’t do well natively. Large, high-quality models often require a GPU and optimization to meet speech latency requirements efficiently.&lt;/p&gt;

&lt;h3 id=&quot;5-personalization-and-adaptation&quot;&gt;5. Personalization and Adaptation&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;With all the extra added richness in the signals, the potential of personalization and adaptation goes up. A human talking to another human does many micro-adaptations including the choice of words (common with text conversations) and the acoustics of their voices based on the ongoing conversation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Sometimes these adaptations get ossified and form &lt;em&gt;sub-languages&lt;/em&gt; that need different approaches for designing conversations. In our experience, people talking to voice bots talks in a different sub-language, a relatively understudied phenomenon.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As LLMs reduce the complexity and effort needed to model and design behaviors, we should get more product-level work on this in both textual and speech conversations. You might already see a bunch of AI talking heads, personas, etc. with the promise of adapting to &lt;em&gt;you&lt;/em&gt;. Something like this was possible earlier but with much more effort than now.&lt;/p&gt;

&lt;h3 id=&quot;6-response-generation&quot;&gt;6. Response Generation&lt;/h3&gt;

&lt;p&gt;With LLMs, the quality of responses content is extremely high and natural. And if they are not, you can always make them so by providing a few examples. Specifically for speech, LLMs are good substrate for modelling inputs to speech synthesis. Instead of hand-tuning SSMLs, we can now let an LLM model high-level markers to guide the right generation of spoken responses at the right time.&lt;/p&gt;

&lt;p&gt;Additionally, similar to speech recognition, speech synthesis has got huge upgrades from last year. Systems like &lt;a href=&quot;https://github.com/suno-ai/bark&quot;&gt;bark&lt;/a&gt; provide a glimpse of the high quality of utterance along with the higher order control that could be driven by an LLM.&lt;/p&gt;

&lt;h3 id=&quot;7-development&quot;&gt;7. Development&lt;/h3&gt;

&lt;p&gt;This stays the same as before. Audio datasets still have more information than text and the maintenance burden is higher.&lt;/p&gt;

&lt;p&gt;Though there is a general reduction of complexity in the language understanding side because of one model handling many problems together. Thus reducing annotation, system maintenance, and other related efforts.&lt;/p&gt;

&lt;h2 id=&quot;whats-next&quot;&gt;What’s Next?&lt;/h2&gt;

&lt;p&gt;With higher order emergent behaviors coming in LLMs, there is a general &lt;em&gt;lifting up&lt;/em&gt; of problems that we solve in ML. All this has led to an unlocking of a sort where everyone is rethinking the limits of automation. For a product like ours—goal-oriented voicebots—we expect the reduction in modeling complexity to increase the extent of automation, even for dialogs that were considered forte of &lt;em&gt;human-agents.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Technologically, the time is ripe to achieve great strides towards truly natural spoken conversations with machines. Something that was always undercut because of the, rightfully present, friction between user experience and technological limitations. Note that the definition of &lt;em&gt;natural&lt;/em&gt; here still hangs on the evolving dynamics of human machine interactions, but we will see a phase transition for sure.&lt;/p&gt;</content><author><name>Abhinav Tushar</name></author><category term="Machine Learning" /><category term="speech" /><category term="llm" /><summary type="html">Around last year, we shared our views on how nuances of spoken conversations make voicebots different than chatbots. With the recent advancements in conversational technology, thanks to Large Language Models (LLMs), we wanted to revisit the implications on what we call Speech-First Conversational AI. This post is one of many such reexaminations.</summary></entry><entry><title type="html">Incorporating context to improve SLU</title><link href="/contextual-slu/" rel="alternate" type="text/html" title="Incorporating context to improve SLU" /><published>2022-08-04T00:00:00+00:00</published><updated>2022-08-04T00:00:00+00:00</updated><id>/contextual-slu</id><content type="html" xml:base="/contextual-slu/">&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;In task-oriented dialogue systems, the spoken language understanding, or SLU, refers to the task of parsing the natural language utterances into
semantic frames. The problem of contextual SLU majorly focuses on effectively incorporating dialogue information. Current SLU systems that work in tandem with
ASR (voice bots) only incorporate the asr transcription as an input for the SLU systems to predict intent. As such, the amount of information these transcripts have is quite less.&lt;/p&gt;

&lt;h2 id=&quot;why-context&quot;&gt;Why context?&lt;/h2&gt;
&lt;p&gt;The bot prompts are a treasure-trove of contextual information. This information can be used to build better intent classification model. Few examples with the bot prompts and their intents are shown below:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;&lt;strong&gt;Bot Prompt&lt;/strong&gt;&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;&lt;strong&gt;User Response&lt;/strong&gt;&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;&lt;strong&gt;Intent&lt;/strong&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Hi! I am Divya, your Hathway virtual assistant. Are you looking for a new Hathway Broadband connection?&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;yes&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;em&gt;confirm_new_connection&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Okay!. To start with, please tell me if you are next to your Hathway device?&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;I am&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;em&gt;confirm_near_device&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;For how many people do you want to book the table for?&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;seven&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;em&gt;number_guests&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;If we observe the first example here, the trancription just consists of &lt;em&gt;‘yes’&lt;/em&gt;, but if we include the bot prompt, we enrich the input to our SLU, thereby increasing the overall intent classification performance. The context of &lt;em&gt;hathway broadband connection&lt;/em&gt; helps in enriching the context of the utterance &lt;em&gt;yes&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;using-bot-prompts-as-context&quot;&gt;Using Bot prompts as context&lt;/h2&gt;
&lt;p&gt;We curated a private dataset of clients wherein we collect user utterances along with all the bot prompts. 
We then concatenate the bot prompts with the user utterances. After retraining our intent classification model on some clients,
we observed a performance jump of &lt;strong&gt;20-30%&lt;/strong&gt; in the intent-F1 scores. &lt;br /&gt; 
This probably happened because our user transcriptions are not rich with enough information. By
closing that information deficit via the bot prompts, the overall input helped us in achieving better performance. Another probable reason for such a huge jump could be probably because the transcription generated while using voice bots are less accurate as compared to a chat bot’s transcription wherein a user types their response. As a result, the gap increment when supplied with contextual information when working with voice bots is much larger than let’s say a bot. However, this was not the case with all our datasets. We observed that
the datasets with large number of classes didn’t perform well or at par with datasets with less number of classes. One probable reason for it could be the dataset having
a large amount of granularity with respect to intents and as such there wasn’t any significant bump in the performance. We also observed that the performance with small-talk intents such as confirm, deny etc. had a massive jump in their performance as compared to other types of intents.&lt;/p&gt;

&lt;!-- We curate a private data of our clients wherein we collect user utterances along with all the bot prompts. Earlier, our systems
were dependent on solely using user utterances to build our Intent classifier but after conctenating bot prompts to our utterances, we
observed around 30% jump in our intent-F1 scores for some of our clients. --&gt;

&lt;h2 id=&quot;some-probable-approaches-from-literature&quot;&gt;Some probable approaches from literature&lt;/h2&gt;

&lt;p&gt;After observing an improvement in our models by just using a single bot prompt, we decided to a delve a bit further and found out many 
approaches that can be utilized for our use-case. While doing literature review, we observed that encoding the contextual prompt along with the user prompt gives the best performance amongst all the methods. The current approach of concatenating bot prompts with the user prompt acts as a natural baseline for our subsequent experiments in this direction. We discuss the encoding based approaches from the literature below:&lt;/p&gt;

&lt;h3 id=&quot;encoding-dialogue-history-1-2&quot;&gt;Encoding dialogue History [1, 2]&lt;/h3&gt;
&lt;ol&gt;
  &lt;li&gt;We can encode [1] the complete dialogue history as shown below. Let us assume that the dialogue is
a sequence of \(D_{t} = {u_{1}, u_{2}.. u_{t}}\) bot and user utterance and at every time 
step \(t\) we are trying to output the classification for the user utterance \(u_{t}\), given \(D_{t}\).
We then divide the model into 2 components, the context encoder that acts on \(D_{t}\) to produce
a vector representation of the dialogue context denoted by \(h_{t} = H(D_{t})\) and the tagger, which takes
this context encoding \(h_{t}\), and the current utterance \(u_{t}\) as input and produces the intent output.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4 id=&quot;context-encoder-architecture&quot;&gt;&lt;strong&gt;Context Encoder Architecture&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The &lt;strong&gt;baseline context encoder&lt;/strong&gt; is just encoding the previous bot prompt \(u_{t-1}\) into a single bidirectional RNN (BiRNN) layer with Gated Recurrent Unit (GRU). The final state of the context encoder GRU is used the dialogue context, \(h_{t} = BiGRU(u_{t-1})\).
For &lt;strong&gt;memory networks&lt;/strong&gt;, we encode all the dialogue context utterances, \({u_{1}, u_{2}.. u_{t}}\) into memory networks denoted by \({m_{1}, m_{2}.. m_{t}}\) using a BiGRU encoder. We add temporal context to the dialogue history utterances, for that we append special positional tokens to each utterance, \(m_{k} = BiGRU_{m}(u_{k}) \: \: 0 &amp;lt;= k &amp;lt;= t-1\).
The current utterance is also encoded using a BiGRU and is denoted by \(c\). Let \(M\) be the matrix wherein the \(i\)th row given by \(m_{i}\). A cosine similarity is obtained between each memory vector, \(m_{i}\), and the context vector \(c\). The softmax of this similarity is used as an attention distribution over the memory \(M\), and an attention distribution over the memory \(M\), and an attention weighted sum of \(M\) is used to produce the dialogue context vector \(h_{i}\).
      \(a = softmax(M_{c})\)
      \(h_{t} = a^{T}M\)&lt;/p&gt;

&lt;h4 id=&quot;tagger-architecture&quot;&gt;&lt;strong&gt;Tagger Architecture&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;A stacked BiRNN tagger is then used to model intent classification.&lt;/p&gt;

&lt;h4 id=&quot;results&quot;&gt;&lt;strong&gt;Results&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;This approach was benchmarked on a multi-turn dialogue sessions and for intent classification specifically the task of reserving tables at the 
restaurant. The intent F1 scores with memory network as the contextual encoder is &lt;strong&gt;0.890&lt;/strong&gt; and just by encoding the last prompt is &lt;strong&gt;0.865&lt;/strong&gt;. &lt;br /&gt;
&lt;img src=&quot;../assets/images/contextual/encoder_context.png&quot; alt=&quot;drawing&quot; width=&quot;600&quot; /&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Another approach [2] is to have a different encoding mechanism for bot and user utterances [2]. This approach uses a system act encoder to obtain a vector representation \(a^{t}\) of all system dialogue acts \(A^{t}\). An utterance encoder is then used
to generate the user utterance encoding \(u^{t}\) by processing the user utterance token embeddings \(x^{t}\).
We then have a dialogue encoder that summarizes the content of the dialogue using \(a^{t}\) and \(u^{t}\), and its previous
hidden state \(s^{t-1}\) to generate the dialogue context vector \(o^{t}\), and also update the hidden state.
The dialogue context vector is then used for intent classification. Both the encoders use a hierarchical RNN that processes a single utterance at a time.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4 id=&quot;system-act-encoder&quot;&gt;&lt;strong&gt;System Act Encoder&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The system act encoder encodes the set of dialogue acts \(A^{t}\) at turn \(t\) into a vector \(a^{t}\) invariant to the order in which they appear.&lt;/p&gt;

&lt;h4 id=&quot;utterance-encoder&quot;&gt;&lt;strong&gt;Utterance Encoder&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The utterance encoder takes in the list of user utterance tokens as input. Let \(x^{t}\) denote the utterance token embeddings, which is encoded using a bi-directional GRU.
\(u^{t}, u^{t}_{o} = BRNN_{GRU}(x^{t})\)
We get the embedding representation \(u^{t}\) of the user utterance and \(u^{t}_{o}\) is the concatenation of the final states and the intermediate outputs of the forward and backward RNNs respectively.&lt;/p&gt;

&lt;h4 id=&quot;dialogue-encoder&quot;&gt;&lt;strong&gt;Dialogue Encoder&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The dialogue encoder incrementally generated the embedded representation of the dialogue context at every turn. As shown in the figure below, it takes in \(a^{t} \bigoplus u^{t}\) and its previous state \(s^{t-1}\) as inputs
and outputs the updated state \(s^{t}\) and the encoded representation of the dialogue context \(o^{t}\).&lt;/p&gt;

&lt;p&gt;The above encoded feature is then flattened to the number of intent classes using a linear layer. 
    \(p_{i}^{t} = softmax(W_{i}.o^{t} + b_{i})\)&lt;/p&gt;

&lt;h4 id=&quot;results-1&quot;&gt;&lt;strong&gt;Results&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The dialogues are obtained from simulated dialogues dataset.The dataset has dialogues from restaurant and movie domains with total of 3 intents. The baseline for this approach was getting results without any context and the overall intent accuracy was &lt;strong&gt;84.76%&lt;/strong&gt; whereas using the previous dialog encoder (\(o^{t-1}\)) and the current system encoder  (\(a^{t}\)) was &lt;strong&gt;99.54%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../assets/images/contextual/encoder_context_2.png&quot; alt=&quot;drawing&quot; width=&quot;600&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;probable-approaches&quot;&gt;Probable approaches&lt;/h2&gt;
&lt;ol&gt;
  &lt;li&gt;Another approach that has not been discussed in literature is using a time decay function to decay the effect of older bot prompts. This would help in focusing more
towards the recent prompts and reduce the effect of older prompts.&lt;/li&gt;
  &lt;li&gt;We can also experiment by fusing different modalities (text and speech) with utterance and dialogue. The emotion from the speech modality could help in infusing much better context into the input for the intent classification.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The above approaches and experiments show that context for SLU predictions can prove to be extremely useful for improving 
intent F1 scores. These above approaches are also not computationally expensive and can be easily deployed at scale for various use-cases.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Ankur Bapna, Gokhan Tür, Dilek Hakkani-Tür, and Larry Heck. 2017. Sequential Dialogue Context Modeling for Spoken Language Understanding. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 103–114, Saarbrücken, Germany. Association for Computational Linguistics. &lt;br /&gt;
[2] Gupta, R., Rastogi, A., &amp;amp; Hakkani-Tür, D.Z. (2018). An Efficient Approach to Encoding Context for Spoken Language Understanding. ArXiv, abs/1807.00267.&lt;/p&gt;</content><author><name>Sanchit Ahuja</name></author><category term="Machine Learning" /><category term="slu" /><category term="context" /><category term="nlp" /><summary type="html">Introduction In task-oriented dialogue systems, the spoken language understanding, or SLU, refers to the task of parsing the natural language utterances into semantic frames. The problem of contextual SLU majorly focuses on effectively incorporating dialogue information. Current SLU systems that work in tandem with ASR (voice bots) only incorporate the asr transcription as an input for the SLU systems to predict intent. As such, the amount of information these transcripts have is quite less.</summary></entry><entry><title type="html">Investigating Label Noise in intent classification datasets and fixing it</title><link href="/label-noise-intro/" rel="alternate" type="text/html" title="Investigating Label Noise in intent classification datasets and fixing it" /><published>2022-06-19T00:00:00+00:00</published><updated>2022-06-19T00:00:00+00:00</updated><id>/label-noise-intro</id><content type="html" xml:base="/label-noise-intro/">&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;Label noise has been a consistent problem even in the &lt;a href=&quot;https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/f2217062e9a397a1dca429e7d70bc6ca-Paper-round1.pdf&quot;&gt;most widely used open source datasets&lt;/a&gt;. Several papers have come up various &lt;a href=&quot;https://arxiv.org/pdf/2007.08199.pdf&quot;&gt;deep learning techniques&lt;/a&gt; to make models more robust to label noise present in their train sets. Even so, identifying label noise in your dataset and investigating it’s cause is an important process to further understand model behaviour and prevent label noise in future datasets.&lt;/p&gt;

&lt;p&gt;In this blog, we discuss why we decided to fix label noise in our datasets followed by some statistic cleaning methods we tested to narrow down regions within the dataset where label noise could be present.&lt;/p&gt;

&lt;h2 id=&quot;why-fix-label-noise&quot;&gt;Why fix label noise?&lt;/h2&gt;

&lt;p&gt;Test sets should be clean to serve as a benchmark for future decisions. To measure the impact of noisy train sets, we plot a graph of model performance versus % label noise. To conduct this experiment, we retagged an old dataset in one of our clients and thoroughly reviewed it to identify and fix mislabelled examples. The total number of mislabelled examples was 13% of the whole dataset (7591 instances). We flipped the gold labels into their noisy counterparts, trained a model on the newly formed dataset and plotted the results.&lt;/p&gt;

&lt;h3 id=&quot;impact-of-train-set-label-noise-on-our-model-performance&quot;&gt;Impact of train set label noise on our model performance&lt;/h3&gt;

&lt;p&gt;&lt;img src=&quot;../assets/images/label-noise-blog/training_noise.png&quot; alt=&quot;image info&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In the above graph. we pbserve that at 0% label noise, the model performance is around 73.8% F1 and at ~13% label noise, the model performance drops to 70.8% F1.&lt;/p&gt;

&lt;h2 id=&quot;different-cleaning-methods-to-fix-the-label-noise&quot;&gt;Different cleaning methods to fix the label noise&lt;/h2&gt;

&lt;p&gt;By measuring the reduction in cleaning effort from the baseline, we can assess the efficacy of the cleaning method. We plotted label noise recall vs % of samples retagged (or annotator effort) - Relating these metrics with previous impact graph allows us to reach interesting conclusions like - &lt;em&gt;clean y% of the dataset using a method M, and you will get some x% bump in model performance&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;random-sampling&quot;&gt;Random Sampling&lt;/h3&gt;
&lt;p&gt;Here we can sample some fixed number of instances and get them retagged. This serves as our baseline for other methods. The label noise we capture will be around 13% of each partial sample, and hence the recall will be the fraction of the partial sample (in the whole). On average, the plot will look similar to y=x, like this one for our dataset:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../assets/images/label-noise-blog/random-sampling-plot.png&quot; alt=&quot;image info&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;biased-sampling&quot;&gt;Biased Sampling&lt;/h3&gt;
&lt;p&gt;This requires intermittent involvement from ops (tagging after every sampling iteration)&lt;/p&gt;

&lt;p&gt;In this method, we first randomly retag x % of samples. Then we identify the major tag confusions (as shown in the Dataset section) and pick the top 5 noisy tags and increase the weights associated to these tags in the sampling function. Then we sample again - pick top 5 tags - repeat.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../assets/images/label-noise-blog/random-sampling-weights-plot.png&quot; alt=&quot;image info&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We see an improvement over the baseline. We can capture around 60% of the total label noise by just tagging around 32% of the total dataset by this heuristic.&lt;/p&gt;

&lt;h3 id=&quot;datamaps&quot;&gt;Datamaps&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://arxiv.org/pdf/2009.10795.pdf&quot;&gt;This&lt;/a&gt; paper introduces datamaps - a tool to diagnose training sets during the training process itself. They introduce two metrics - confidence and variability to understand training dynamics. They further plot each instance on a confidence vs variability graph and create hard-to-learn, ambigous and easy regions. These regions correspond to how easy it is for the model to learn the particular instance. They also observe that the hard-to-learn regions also corresponded instances that had label noise.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Confidence&lt;/em&gt; - This is defined as the mean model probability of the true label across epochs.&lt;br /&gt;
&lt;em&gt;Variability&lt;/em&gt; - This measures the spread of model probability across epochs, using the standard deviation.&lt;/p&gt;

&lt;p&gt;The intuition is that instances with consistently lower confidence scores throughout the training process are hard for a model to learn. This could be because the model is not capable of learning the target label or that the target label was incorrect.&lt;/p&gt;

&lt;p&gt;We leverage the training artefacts from the paper to define a label score for each sample - as the Euclidean distance between (0,0) and (confidence, variability). Following the hypothesis of hard-to-learn regions, we expect noisy samples to have a lower label score.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Threshold on label-score&lt;/strong&gt; &lt;br /&gt;
Fixing a threshold on the label score means that all samples that score below it are considered label noise, and those that score above are considered clean. Assuming we do Human Retagging of all samples predicted as label noisy, fixing a threshold essentially fixes both the % of samples retagged and (given the clean tags-) label noise recall. Varying the threshold, we get a plot for our dataset:&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;../assets/images/label-noise-blog/deja-vu-plot.png&quot; alt=&quot;image info&quot; /&gt;
Looking at our metrics, we see an improvement in the partial recleaning process.
Lets read the above plots. Say we fix the threshold at 0.43 which means we would be retagging around 28% of our dataset. This corresponds to a label noise recall of 60%, giving us a resulting dataset with 5.2% label noise from 13%. (= 0.40*13).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;n-consecutive correct instances&lt;/strong&gt; &lt;br /&gt;
Here, we will use the ordering of the label scores. Our added assumption here, is that the ordering within the regions are also useful. Based on this, we sort our samples by label score, and in ascending order. This means the noisy samples should be nearer to the top, and we base our heuristic on this. We start Human Retagging from the top of the sorted list of samples, and stop once we see N-consecutive clean samples.
Varying N, we get a plot for our dataset:&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;../assets/images/label-noise-blog/deja-vu-n-consecutive.png&quot; alt=&quot;image info&quot; /&gt;&lt;br /&gt;
Again, we see an improvement in the partial recleaning process. Lets read the above plots. Say we fix N at ~38, which means we would be retagging around 35% of our dataset. This corresponds to a corresponds to a label noise recall of 60%, which means we would capture and clean 76% of the label noise. Giving us a resulting dataset with 3.1% label noise (= (1-0.76)*0.13).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;cleanlab&quot;&gt;Cleanlab&lt;/h3&gt;

&lt;p&gt;This is a label noise prediction tool. We have evaluated the accuracy of this tool instead. But we won’t be able to capture all the noisy labels via this tool. This tool takes in predicted probabilities. Since cleanlab depends on output model probabilities it can’t be used to correct train sets.&lt;/p&gt;

&lt;h3 id=&quot;confident-learning&quot;&gt;Confident Learning&lt;/h3&gt;

&lt;p&gt;Confident Learning high level idea - When the predicted probability of an example is greater than a per-class-threshold, we confidently count that example as actually belonging to that threshold’s class. The thresholds for each class are the average predicted probability of examples in that class.&lt;/p&gt;

&lt;p&gt;Confident Learning estimates a joint distribution between noisy observed labels and the true latent labels. It assumes that the predicted probabilities are out-of-sample holdout probabilities (eg. K-fold cross validation). If this isn’t the case then overfitting may occur. Their algorithm also assumes that class-conditional label noise transitions are data independent.&lt;/p&gt;

&lt;p&gt;Metrics using a model trained on noisy labels.&lt;/p&gt;

&lt;p&gt;Tested on a separate test set&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../assets/images/label-noise-blog/clean-set-table.png&quot; alt=&quot;image info&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Results are slightly better when the model is trained on clean data&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../assets/images/label-noise-blog/noisy-set-table.png&quot; alt=&quot;image info&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We expect cleanlab to perform even better once our model test accuracies improve. Cleanlab wont be very useful if the model is performing poorly even on a clean dataset.&lt;/p&gt;

&lt;h2 id=&quot;minimizing-tagging-errors-at-source&quot;&gt;Minimizing tagging errors at source&lt;/h2&gt;

&lt;p&gt;To understand why our datasets had noisy labels, we conducted several review sessions with our annotators after they retagged datasets across multiple clients. We further classified each mislabelled example into a list of possible reasons as shown below. Here, gold tag refers to the ground truth tag. Each instance was tagged X times (X is the number of annotators) and the highest tag was chosen as the correct tag.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../assets/images/label-noise-blog/tagging-errors-fix.png&quot; alt=&quot;image info&quot; /&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;since our intent classifiers were not multi-label, we wanted to capture the total % of multiple intent scenarios.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We observed that the label noise patterns for each of our clients were quite different, which made the problem of generalizing label noise prediction even more difficult.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;To conclude, we quantified how using datamaps helps in reducing effort taken to clean our existing train sets. We also correlated this reduced cleaning effort with the expected improvement in model performance with the help of some plots.&lt;/p&gt;</content><author><name>Kriti Anandan</name></author><category term="Machine Learning" /><category term="label-noise" /><summary type="html">Introduction</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="/assets/images/label-noise-blog/label-noise.png" /><media:content medium="image" url="/assets/images/label-noise-blog/label-noise.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Theory of Mind and Implications for Conversational AI</title><link href="/theory-of-mind/" rel="alternate" type="text/html" title="Theory of Mind and Implications for Conversational AI" /><published>2022-05-19T00:00:00+00:00</published><updated>2022-05-19T00:00:00+00:00</updated><id>/theory-of-mind</id><content type="html" xml:base="/theory-of-mind/">&lt;p&gt;When a diplomat says &lt;em&gt;yes&lt;/em&gt;, he means ‘perhaps’;&lt;br /&gt;
When he says &lt;em&gt;perhaps&lt;/em&gt;, he means ‘no’;&lt;br /&gt;
When he says &lt;em&gt;no&lt;/em&gt;, he is not a diplomat.&lt;/p&gt;

&lt;p&gt;                        —&lt;em&gt;Voltaire&lt;/em&gt; (Quoted, in Spanish, in Escandell 1993.)&lt;/p&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Consider this example: You’re out in the street in a crowded area. A stranger walks upto you and asks for directions in your local language, &lt;em&gt;L&lt;/em&gt;. You responded, you notice the facial expressions of the stranger and that they seem to be confused, and do not understand what you said. Now, you’re confused as well, and try to clarify your instructions, but the stranger later reveals that he isn’t very fluent in the language &lt;em&gt;L&lt;/em&gt;; hence you ask for whether they understand a globally-used language &lt;em&gt;E&lt;/em&gt;, the stranger confirms, and the conversation continues.&lt;/p&gt;

&lt;p&gt;Let’s breakdown what occurred here.&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;The stranger asked a question in a local language &lt;em&gt;L&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;You now have a belief that the stranger is speaks the local language &lt;em&gt;L&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Due to your belief, you respond in the same language &lt;em&gt;L&lt;/em&gt;, &lt;em&gt;expecting&lt;/em&gt; the stranger to understand the information you’re trying to convey. This is a &lt;em&gt;false belief&lt;/em&gt;, as it is later revealed.&lt;/li&gt;
  &lt;li&gt;You look for verbal/non-verbal cues from the stranger that they understood.&lt;/li&gt;
  &lt;li&gt;However, the stranger &lt;em&gt;denies&lt;/em&gt; your expectation by showing absence of such cues and instead, show cues of confusion.&lt;/li&gt;
  &lt;li&gt;You attempt to further elaborate your instructions taking these cues into account, however the stranger still seems confused.&lt;/li&gt;
  &lt;li&gt;The conversation at this point feels “awkward” since your expectations of the conversation were being denied multiple times.&lt;/li&gt;
  &lt;li&gt;The stranger reveals that they aren’t very fluent in &lt;em&gt;L&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;This confirms that your &lt;em&gt;belief&lt;/em&gt; that the stranger understood &lt;em&gt;L&lt;/em&gt; was &lt;em&gt;false&lt;/em&gt;. This brings a sense of comfort since now you understand why your expectations were being denied.&lt;/li&gt;
  &lt;li&gt;You now correct for your &lt;em&gt;false belief&lt;/em&gt; and ask whether the stranger understands a globally-used language &lt;em&gt;E&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;The stranger confirms.&lt;/li&gt;
  &lt;li&gt;And the conversation continues.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This mechanism of having expectations from the other participant is a basis for successful conversation. If we lacked such an ability, the emergence of mutually accepted meanings of words, and language itself would be impossible. This also applies to non-verbal communication, such as body language and facial expressions, and to some degree, is observed in many animal species.&lt;/p&gt;

&lt;p&gt;We now dive deeper into this aspect of communication, and formalize why both, human-human and human-machine conversations breakdown and/or lead to frustration of the participants.&lt;/p&gt;

&lt;h1 id=&quot;theory-of-mind&quot;&gt;Theory of Mind&lt;/h1&gt;

&lt;p&gt;Whenever we converse, we take into account what we expect the other person to understand through our words as well as their possible responses. The ability to conceive such “theories” of other participants’ mental states is termed as the Theory of Mind (ToM).&lt;/p&gt;

&lt;p&gt;Having ToM requires the agent to acknowledge the fact that others (including the agent itself) can &lt;em&gt;believe&lt;/em&gt; in things which are not true. These beliefs are called &lt;em&gt;false beliefs&lt;/em&gt;. An agent possessing ToM can identify their own as well as other’s false beliefs and take actions to confirm and hence correct these false-beliefs.&lt;/p&gt;

&lt;h1 id=&quot;what-makes-a-conversation-human-like&quot;&gt;What makes a conversation &lt;em&gt;human-like&lt;/em&gt;?&lt;/h1&gt;

&lt;p&gt;Proposition:&lt;/p&gt;

&lt;p&gt;               &lt;em&gt;A dialogue is human-like if both agents participating have some degree of Theory of Mind.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Theory of Mind is not limited to the content of the speech (such as the words spoken), but also addresses the mannerism of speech (prosody), facial and other non-verbal cues etc. It is easy to see that if any one of the agents lack ToM or have a poor ability, the conversation becomes uncomfortable and frustrating.&lt;/p&gt;

&lt;p&gt;However, Theory of Mind is an acquired skill, expertise of humans on ToM matures over the lifespan [2], in-addition to depending on the amount of socialization the person part-takes in. This makes quantifying the degree of expertise over ToM difficult, hence quantifying the degree of &lt;em&gt;human-likeness&lt;/em&gt; is also difficult, in-addition to being be subjective.&lt;/p&gt;

&lt;h2 id=&quot;testing-the-presence-of-theory-of-mind&quot;&gt;Testing the Presence of Theory of Mind&lt;/h2&gt;

&lt;p&gt;Testing whether or not an agent is capable of modeling mental states of others is important for many reasons, one such application is diagonosing mental disorders. Such tests are called false-belief tasks. These tests check whether the agent can model other’s false-beliefs and/or confirm and correct its own false-beliefs. We will discuss two popular false-belief tasks: “Sally-Anne” and “Smarties” tasks .&lt;/p&gt;

&lt;h3 id=&quot;sally-anne-task&quot;&gt;Sally-Anne Task&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://upload.wikimedia.org/wikipedia/en/a/ac/Sally-Anne_test.jpg&quot; alt=&quot;Illustration of the &amp;quot;Sally-Anne&amp;quot; Test&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The participating agent is told the following scenario:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Sally and Anne are inside a room.&lt;/li&gt;
  &lt;li&gt;Sally has a basket with one marble inside it.&lt;/li&gt;
  &lt;li&gt;Anne has an empty box with her.&lt;/li&gt;
  &lt;li&gt;Sally leaves the room without her basket.&lt;/li&gt;
  &lt;li&gt;Anne takes the marble out of Sally’s basket and puts it in her own box.&lt;/li&gt;
  &lt;li&gt;Sally comes back inside the room.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now, the participant is asked, “Where will Sally look for her marble?”. If the participant replies with “the basket”, this means that the participant is able to model the mental state of a fictional character Sally, and that she doesn’t know that Anne took her marble. Children below the age of 3-4 answer with “the box”, however, older children answer with “the basket”. Some children with mental disabilities such as Down syndrome and Autism are unable to pass this test.&lt;/p&gt;

&lt;h3 id=&quot;smarties-task&quot;&gt;Smarties Task&lt;/h3&gt;
&lt;p&gt;Smarties is a popular brand of candies.&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;The participant is presented with a box labelled “Smarties”.&lt;/li&gt;
  &lt;li&gt;The participant is asked “what is in the box?”.&lt;/li&gt;
  &lt;li&gt;The participant replies with “candies”.&lt;/li&gt;
  &lt;li&gt;The box is opened and is revealed that the box actually contains pencils.&lt;/li&gt;
  &lt;li&gt;The participant is asked “What would someone else think is inside the box?”.&lt;/li&gt;
  &lt;li&gt;The participant passes the test if they respond with “candies”.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Theory of Mind is an acquired skill, and is not innate, i.e., we aren’t born with the ability to model other’s mental states. A study [1] shows that children first pass False-belief tasks at around 3-4 years of age, around the same time as children first learn to tell lies, suggesting that learning to lie is a pre-cursor to possessing ToM. This does make sense, as lying would &lt;em&gt;only help if the other participant is capable of having false-beliefs&lt;/em&gt;. Language and communication are also acquired skills.&lt;/p&gt;

&lt;h2 id=&quot;theory-of-mind-relevance-to-conversational-ai&quot;&gt;Theory of Mind: Relevance to Conversational AI&lt;/h2&gt;
&lt;p&gt;Having ToM allows for certain mechanisms that would not be possible otherwise. Some are listed below:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;The ability of the agent to recognize its own errors in perceiving (mis-hearing), i.e., discover its own false-beliefs and ask for clarifications. This also leads to a higher order reasoning capability of the agent.&lt;/li&gt;
  &lt;li&gt;The ability of the agent to dynamically model its counterpart throughout the conversation and adjust its own behaviour inorder to maximize the success of the dialog. Dynamic response and prosody generation, turn-taking, barge-in handling, etc. are such examples.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;do-machines-have-a-theory-of-mind&quot;&gt;Do Machines have a Theory of Mind?&lt;/h2&gt;

&lt;p&gt;One of the important goals of AI is to blend in the lives of Humans and solve problems &lt;em&gt;with humans-in-the-loop&lt;/em&gt;, achieving this requires modeling humans and other machines around the agent, similar to how we humans do.&lt;/p&gt;

&lt;p&gt;Some studies [3, 4] have shown that specially designed Multi-Agent Reinforcement Learning algorithms pass the Sally-Anne False-belief task. However, False-belief tasks have not been designed for/tested against chat/voice bots. In this section, we test multiple Language Models (LM) against the Sally-Anne and Smarties tasks, and check whether they pass the tests or not.&lt;/p&gt;

&lt;h3 id=&quot;methodology&quot;&gt;Methodology&lt;/h3&gt;
&lt;p&gt;All of the experiments were done using Huggingface’s Hub has inference interface. These experiments can be easily re-ran, however, it is not guarrenteed to get the same results since the inference is non-detereministic. The tasks are widely used and are available in Wikipedia and other scientific papers on which some/all of the LMs may have been trained on, hence these tests are not conclusive.&lt;/p&gt;

&lt;h4 id=&quot;sally-anne-task-1&quot;&gt;Sally-Anne Task&lt;/h4&gt;
&lt;p&gt;Input Text: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sally and Anne are inside a room. Sally has a basket with one marble inside it. Anne has an empty box with her. Sally leaves the room without her basket. Anne takes the marble out of Sally's basket and puts it in her own box. Sally comes back inside the room. Sally will look for her marble in &lt;/code&gt;
If the LMs continue with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;her basket&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;the basket&lt;/code&gt; or anything similar, the LM passes the test, else it doesn’t.&lt;/p&gt;

&lt;h4 id=&quot;smarties-task-1&quot;&gt;Smarties Task&lt;/h4&gt;
&lt;p&gt;Input Text: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sally is presented with a box labeled &quot;Candies&quot;. Sally is asked, &quot;what is in the box?&quot;. Sally replies with &quot;candies&quot;. The box is opened and is revealed that the box actually contains pencils. Sally is asked, &quot;What would someone else think is in the box?&quot;. Sally answers &lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If the LMs continue with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;candies&lt;/code&gt; or anything similar, the LM passes the test, else it doesn’t.&lt;/p&gt;

&lt;h3 id=&quot;results&quot;&gt;Results&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Language Model&lt;/th&gt;
      &lt;th&gt;# Params&lt;/th&gt;
      &lt;th&gt;Sally-Anne&lt;/th&gt;
      &lt;th&gt;Smarties&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;DistilGPT2&lt;/td&gt;
      &lt;td&gt;82M&lt;/td&gt;
      &lt;td&gt;Fail&lt;/td&gt;
      &lt;td&gt;Fail&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-Neo-125M&lt;/td&gt;
      &lt;td&gt;125M&lt;/td&gt;
      &lt;td&gt;Fail&lt;/td&gt;
      &lt;td&gt;Fail&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-Neo-1.3B&lt;/td&gt;
      &lt;td&gt;1.3B&lt;/td&gt;
      &lt;td&gt;Fail&lt;/td&gt;
      &lt;td&gt;Fail&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-2&lt;/td&gt;
      &lt;td&gt;1.5B&lt;/td&gt;
      &lt;td&gt;Pass&lt;/td&gt;
      &lt;td&gt;Pass&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-Neo-2.7B&lt;/td&gt;
      &lt;td&gt;2.7B&lt;/td&gt;
      &lt;td&gt;Pass&lt;/td&gt;
      &lt;td&gt;Pass&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-J-6B&lt;/td&gt;
      &lt;td&gt;6B&lt;/td&gt;
      &lt;td&gt;Pass&lt;/td&gt;
      &lt;td&gt;Pass&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The largest three of the models pass both the tests. This suggests that scale might help LMs achieve some basic reasoning capabilities. This result is not surprising, since larger LMs usually do better in reasoning benchmarks.&lt;/p&gt;

&lt;p&gt;P.S. The most entertaining response award goes to DistilGPT2 for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;I don't give a fig about the box&quot;&lt;/code&gt; for the Sally-Anne task. This is not made up, I swear!&lt;/p&gt;

&lt;h2 id=&quot;implications-for-goal-oriented-conversational-ai&quot;&gt;Implications for Goal-Oriented Conversational AI&lt;/h2&gt;

&lt;p&gt;Open-domain chat has one important goal, &lt;em&gt;engagement with the user&lt;/em&gt;. The user engages with the bot &lt;em&gt;if the bot is entertaining the user&lt;/em&gt;. For this statement to hold true, the bot should appropriately make responses, which in-turn requires modeling the user, i.e., having a Theory of Mind. The &lt;em&gt;degree of engagement&lt;/em&gt; can be seen as a measure of &lt;em&gt;degree of ToM&lt;/em&gt; of the bot.&lt;/p&gt;

&lt;p&gt;Testing ToM is straight-forward for Open-domain (chit-chat) bot. However, this is tricky for goal-oriented bots, as they are designed to handle dialog under a specific domain. False-belief task defined on one domain maybe out-of-domain for another domain.&lt;/p&gt;

&lt;p&gt;Open-domain dialog is a strict generalization of goal-oriented dialog. However, goal-oriented may have goals which are defined differently from &lt;em&gt;engagement&lt;/em&gt;. In many call-center settings, &lt;em&gt;call resolution&lt;/em&gt; is the most important goal. However, when voice bots are used in-place of human agents in call centers, a new and different behaviour of users arises: &lt;em&gt;call drop&lt;/em&gt;. Users simply drop from the call if they:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;get frustrated (due to mishearing, poor reasoning capabilities etc.).&lt;/li&gt;
  &lt;li&gt;think the bot is &lt;em&gt;incapable&lt;/em&gt; of answering their queries, even if the bot is capable. This is a false-belief of the user, and the bot is unable to correct the user’s false-belief.
Call drops occur in a major chunk of the calls (40-50%).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most bots in the industry are designed in a way that assumes &lt;em&gt;the user trusts the bot and has infinite patience&lt;/em&gt;. The bot’s behaviour is apparrently designed to optimize for resolving queries of the user, however, not to &lt;em&gt;inspire trust in the user that the bot is capable to resolve queries&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;There are two possible ways to “solve” this problem:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Explicit&lt;/code&gt;: Design the product in a way that inspires trust. Come up with the &lt;em&gt;best possible&lt;/em&gt; responses for all possible combination of dialog history and user states.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Implicit&lt;/code&gt;: Design the product in a top-down fashion rather than a bottom-up. Many believe that optimizing components with their local objective (Word-error-rate for ASR, F1 Scores for Intent classifiers etc.) would lead to a higher resolution rate. In biological systems, the higher-order function (survival) dictates lower order function (communication, language). Learning to communicate better can not ensure survival on its own. However, learning to survive &lt;em&gt;may lead to a better ability to communicate&lt;/em&gt;. In other words, optimize ML models against objective (resolution rate) in-addition to the local objective. This &lt;em&gt;will&lt;/em&gt; force the bot to behave in a way that inspires trust from the user and effectively learn to have theory of mind of the users.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first method is the industry standard, and it doesn’t seem to be working well. The second method has the clear advantage of being data-driven and scalable.&lt;/p&gt;

&lt;h1 id=&quot;references&quot;&gt;References&lt;/h1&gt;
&lt;p&gt;[1] Astington, J.W., &amp;amp; Edward, M.J. (2010). The Development of Theory of Mind in Early Childhood.&lt;/p&gt;

&lt;p&gt;[2] Demetriou, A., Mouyi, A., &amp;amp; Spanoudis, G. (2010). “The development of mental processing”, Nesselroade, J. R. (2010). “Methods in the study of life-span human development: Issues and answers.” In W. F. Overton (Ed.), &lt;em&gt;Biology, cognition and methods across the life-span.&lt;/em&gt; Volume 1 of the &lt;em&gt;Handbook of life-span development&lt;/em&gt; (pp. 36–55), Editor-in-chief: R. M. Lerner. Hoboken, New Jersey: Wiley.&lt;/p&gt;

&lt;p&gt;[3] Rabinowitz, N.C., Perbet, F., Song, H.F., Zhang, C., Eslami, S.M., &amp;amp; Botvinick, M.M. (2018). Machine Theory of Mind. &lt;em&gt;ICML&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;[4] Nguyen, T.N., &amp;amp; González, C. (2020). Cognitive Machine Theory of Mind. &lt;em&gt;CogSci&lt;/em&gt;.&lt;/p&gt;</content><author><name>Surya Kant Sahu</name></author><category term="Machine Learning" /><category term="Theory of Mind" /><category term="conversational ai" /><category term="voicebot" /><category term="chatbot" /><category term="voice assistant" /><summary type="html">When a diplomat says yes, he means ‘perhaps’; When he says perhaps, he means ‘no’; When he says no, he is not a diplomat.</summary></entry><entry><title type="html">End of Utterance Detection</title><link href="/end-of-utterance-detection/" rel="alternate" type="text/html" title="End of Utterance Detection" /><published>2022-04-24T00:00:00+00:00</published><updated>2022-04-24T00:00:00+00:00</updated><id>/end-of-utterance-detection</id><content type="html" xml:base="/end-of-utterance-detection/">&lt;blockquote&gt;
  &lt;p&gt;This blog post is based on the work done by &lt;a href=&quot;https://github.com/Anirudh257&quot;&gt;Anirudh 
Thatipelli&lt;/a&gt; as an ML research fellow at Skit.ai&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1 id=&quot;end-of-utterance-detection---when-does-a-speaker-stop-speaking&quot;&gt;End Of Utterance Detection - When does a speaker stop speaking?&lt;/h1&gt;

&lt;p&gt;End-of-utterance detection is the problem of detecting when a user has stopped speaking in a conversation.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://user-images.githubusercontent.com/16001446/164991645-fadf9a68-3e75-4077-8050-5aabdc30b2d1.png&quot; alt=&quot;image&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In the above image, there are four turns in total that are time-aligned.. The system initiates the conversation by speaking first (“How may I help you?”), then the user (“I want to go to Miami.”), then the system again (“Miami?”) and finally the system (“Yes.”).&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The speaker who utters the first unilateral sound both initiates the conversation and gains possession of the floor. Having gained possession, a speaker maintains it until the first unilateral sounds by another speaker, at which time the latter gains possession of the floor.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1 id=&quot;motivation&quot;&gt;Motivation&lt;/h1&gt;

&lt;p&gt;Despite going through many advances, the performance of spoken dialogue systems remains unsatisfactory. For example, turn-taking is a fundamental aspect of natural human conversation that helps to decide which participant has the floor in a conversation and who can speak next. Humans use many multimodal cues like prosodic features, gaze, etc to determine who has the floor in a particular conversation. The interaction is very smooth with very less gaps and overlaps between participants’ speech, making its modeling difficult. Currently, dialogue systems use a silence threshold to determine whether it should start speaking. This approach is too simplistic and can lead to many issues. The system can interrupt the user mid-utterance, known as &lt;em&gt;cut-in&lt;/em&gt;. Or it can wait too long and leads to sluggish responses and possible misrecognition, causing an increase in &lt;em&gt;latency&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;As speech-dialogue systems become more ubiquitous, it is essential to design dialogue systems that can predict end of utterance and predict turns.&lt;/p&gt;

&lt;p&gt;A dialogue system designer should also consider the trade-offs between cut-ins and latency. For Skit, an effective turn-taking system will improve customer service and decrease call-drop rate. Imbibing turn-taking capabilities into our product will make it more natural and improve the conversations with customers.&lt;/p&gt;

&lt;h1 id=&quot;previous-approaches-to-solve-the-problem&quot;&gt;Previous approaches to solve the problem&lt;/h1&gt;

&lt;p&gt;One of the earliest models to study conversations was designed by &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S088523082030111X&quot;&gt;Harvey Sacks et al&lt;/a&gt; in which he divided a conversation into two units of speech: &lt;strong&gt;Turn-constructional units (TCU)&lt;/strong&gt; and &lt;strong&gt;Transition-relevant place (TRP)&lt;/strong&gt; respectively.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://user-images.githubusercontent.com/16001446/164993172-cc7293f1-5267-434a-9f77-a241b44a0421.png&quot; alt=&quot;image&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Turn-constructional units are utterances from one speaker during which other participants assume the role as listeners. And each TCU is followed by a 
TRP, where a turn-shift can occur by the following rules:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;ol&gt;
    &lt;li&gt;The current speaker may select a next speaker (other-select), using for example gaze or an address term. In the case of
dyadic conversation, this may default to the other speaker.&lt;/li&gt;
  &lt;/ol&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;ol&gt;
    &lt;li&gt;If the current speaker does not select a next speaker, then any participant can self-select. The first to start gains the turn.&lt;/li&gt;
  &lt;/ol&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;ol&gt;
    &lt;li&gt;If no other party self-selects, the current speaker may continue.&lt;/li&gt;
  &lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;To identify these TCUs and TRPs, researchers segment the speech into &lt;strong&gt;Inter-Pausal Units (IPUs)&lt;/strong&gt;, which are stretches of audio from one speaker without any silence exceeding a stipulated amount(say, 200 ms). A voice activity detection(VAD) can detect these IPUs. Hence, a turn can be considered as a sequence of IPUs from a speaker, that are not interrupted by IPUs from another speaker.&lt;/p&gt;

&lt;p&gt;To identify TRPs(turn-yielding cues) and non-TRPs(turn-hlding)cues, many cues such as syntactic completion, prosody and non-verbal cues like eye-contact have been investigated. However, it is very complicated to directly detect such cues from the data. This problem is compounded by the absence of facial cues in our data. End of utterance task can be also defined as the detection of TRPs, i.e. when the user’s turn is yielded and the system can start to speak. There are a multitude of works done in this regard, that can be divided into three types:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Silence-based models. The end of the user’s utterance is detected using a VAD. A silence duration threshold is used to determine when to take the turn. 
As discussed above, this is too simplistic and can lead to misrecognitions.&lt;/li&gt;
  &lt;li&gt;IPU-based models. Potential turn-taking points (IPUs) are detecting using a VAD. Turn-taking cues in the user’s speech are processed to determine whether the turn is yielded or not (potentially also considering the length of the pause).&lt;/li&gt;
  &lt;li&gt;Continuous models. The user’s speech is processed continuously to find suitable places to take the turn, but also for identifying backchannel relevant places (BRP), or for making projections.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;https://user-images.githubusercontent.com/16001446/165028917-d3639f4c-8fa9-44d9-88ec-5dd0928f325a.png&quot; alt=&quot;image&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We will go through each of the approaches in the following sections:&lt;/p&gt;

&lt;h2 id=&quot;silence-based-models&quot;&gt;Silence-based models&lt;/h2&gt;

&lt;p&gt;As mentioned above, existing architectures use a fixed silence duration detection threshold to determine if the speech has ended. VAD utilizes energy and spectral features to distinguish between noise and speech in the audio. Two types of parameters are taken into consideration while designing these kinds of models.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;After the system has yielded the turn, it awaits a user response, allowing for a certain silence (a gap). If this silence exceeds the
no-input-timeout threshold (such as 5 s), the system should continue speaking, for example by repeating the last question.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once the user has started to speak, the end-silence-timeout (such as 700ms) marks the end of the turn. As the figure shows,
this allows for brief pauses (shorter than the end-silence-timeout) within the user’s speech.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;https://user-images.githubusercontent.com/16001446/166442067-1e01892b-de3a-483a-998b-d9aa8b838345.png&quot; alt=&quot;image&quot; /&gt;&lt;/p&gt;

&lt;p&gt;These simplistic models break down if the user takes too long to respond. Or when the system might interrupt the user’s speech.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://user-images.githubusercontent.com/16001446/166442480-fced182c-1d42-4af0-be17-842254c4236a.png&quot; alt=&quot;image&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Tuning the threshold for different domains is extremely difficult and user satisfaction will be affected.&lt;/p&gt;

&lt;h2 id=&quot;ipu-based-models&quot;&gt;IPU-based models&lt;/h2&gt;

&lt;p&gt;These systems are built on an assumption that the system should not start to speak while the user is speaking. Turn-taking cues at the end of pauses are used to determine whether a turn has ended. These approaches run the gamut from hand-crafted rule-based semantic parsers to machine-learning and reinforcement learning models.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.cs.cmu.edu/afs/cs/Web/People/dod/papers/sato-icslp02.pdf&quot;&gt;Sato et al’s&lt;/a&gt; work inputs over 100 different kinds of features like syntactic, semantic, final word, and prosody to decision trees to model when to take a turn. Albeit simplistic, their model achieved an accuracy of 83.9%, compared to the baseline of 76.2%. However, this approach can misclassify the IPU as a pause and uses a fixed threshold of 750 ms for pauses. To overcome this limitation, &lt;a href=&quot;https://www.sri.com/wp-content/uploads/2021/12/is_the_speaker_done_yet.pdf&quot;&gt;Ferrer et al&lt;/a&gt; condition a decision-tree classifier on the length of the pause after IPU continuously and classify on the prosodic features and n-grams of the words. &lt;a href=&quot;https://aclanthology.org/W08-0101.pdf&quot;&gt;Raux and Eskenazi&lt;/a&gt; cluster silences based on dialogue features and set a single threshold for each cluster, minimizing the overall latency by over 50% on the Let’s Go dataset.&lt;/p&gt;

&lt;p&gt;Another shortcoming with the above approaches is that they are trained on human-computer dialogue corpus. But we want to learn a model for human-human dialogues. Transferring models from human-human to human-computer based systems is not feasible. So, some authors like (&lt;a href=&quot;https://aclanthology.org/W08-0101.pdf&quot;&gt;Raux, Eskenazi&lt;/a&gt; &amp;amp; &lt;a href=&quot;https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.704.2085&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Meena et al.&lt;/a&gt; use &lt;strong&gt;bootstrapping&lt;/strong&gt;. First, a more simplistic model of turn-taking is implemented in a system and interactions are recorded. Then, the data is then manually annotated with suitable TRPs, and trained using a machine learning model like LSTM. Another approach is a &lt;strong&gt;Wizard-of-Oz&lt;/strong&gt; setup, where a hidden operator controls the system and makes the turn-taking decisions as used in &lt;a href=&quot;https://qmro.qmul.ac.uk/xmlui/bitstream/handle/123456789/55075/Maier%20et%20al.%202017.%20Towards%20Deep%20End-of-Turn%20Prediction.pdf?sequence=1&quot;&gt;Maier et al.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Some previous approaches utilize reinforcement learning as well. For example, &lt;a href=&quot;https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.6018&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Jonsdottir et al&lt;/a&gt; train two agents to talk to each other, picking up prosodic cues and develop turn-taking skills. &lt;a href=&quot;https://aclanthology.org/W15-4643.pdf&quot;&gt;Khouzaimi et al.&lt;/a&gt; train a dialogue management model intending to minimize the dialogue duration and maximize the completion task ratio. But these approaches are trained in simulated environments and it is unclear if they transfer to real users.&lt;/p&gt;

&lt;h2 id=&quot;continuous-models&quot;&gt;Continuous models&lt;/h2&gt;

&lt;p&gt;Continuous models process the utterances in an incremental manner. These modules process the input frame-by-frame and pass their results to subsequent modules. It enables the system to make continuous TRP predictions, project turn completions and backchannels. Unlike previous approaches, the processing starts before the input is complete. The processing time is improved, and the output becomes more &lt;em&gt;natural&lt;/em&gt;. There is no need to train the model for end-of-turn detection. It enables a deeper understanding of utterances and project backchannels and even interrupt the user.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://user-images.githubusercontent.com/16001446/165454581-fceb250f-342f-4ca8-981d-bd635b922478.png&quot; alt=&quot;image&quot; /&gt;&lt;/p&gt;

&lt;p&gt;One of the first works in incremental processing was &lt;a href=&quot;https://aclanthology.org/E09-1085.pdf&quot;&gt;Skantze and Schlangen&lt;/a&gt; on the task of number dictation. A benefit of incremental models is revision, as shown by  &lt;a href=&quot;https://www.researchgate.net/profile/Gabriel-Skantze/publication/257267620_Towards_incremental_speech_generation_in_conversational_systems/links/5c473188299bf12be3db10e6/Towards-incremental-speech-generation-in-conversational-systems.pdf&quot;&gt;Skantze and Hjalmarsson&lt;/a&gt;. For example, the word “four” might be amended with more speech, resulting in a revision to the word “forty”.&lt;/p&gt;

&lt;p&gt;Another work by &lt;a href=&quot;https://www.diva-portal.org/smash/get/diva2:1141130/FULLTEXT01.pdf&quot;&gt;Skantze&lt;/a&gt; doesn’t train the model for end-of-turn detection. The audio from the speakers is processed frame-by-frame (20 frames per second) and fed to an LSTM. The LSTM predicts the speech activity for the two speakers for each frame in a future 3s window. The model outperforms human judges in this task. In an extension to this work, &lt;a href=&quot;https://arxiv.org/pdf/1808.10785.pdf&quot;&gt;Roddy et al.&lt;/a&gt; propose a new LSTM architcture where the acoustic and linguistic features get processed in separate LSTM systems with different timescales.&lt;/p&gt;

&lt;h2 id=&quot;datasets&quot;&gt;Datasets&lt;/h2&gt;

&lt;p&gt;Most of the aforementioned works evaluate their performance on dialogue based datasets like:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.inf.ed.ac.uk/cgi/maptask/estimate.cgi&quot;&gt;HCRC MapTask Corpus&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://mahnob-db.eu/mimicry/&quot;&gt;Mahnob Corpus&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;that have a limited purpose and may not generalize well to our problem.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;While significant work has been done in end-of-utterance detection, most of these models have shortcomings. Firstly, most are trained on dialogue-based datasets only without accounting for speech-level features. Secondly, these datasets are well-curated with less noise in the background which is not the case for our datasets. To account for noise and model audio and text jointly, we will need to retrain our models with new baselines.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.lti.cs.cmu.edu/sites/default/files/research/thesis/2008/antoine_raux_flexible_turn-taking_for_spoken_dialog_systems.pdf&quot;&gt;Flexible Turn-Taking for Spoken Dialog Systems&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S088523082030111X&quot;&gt;Turn-taking in Conversational Systems and Human-Robot Interaction: A Review&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.cs.cmu.edu/afs/cs/Web/People/dod/papers/sato-icslp02.pdf&quot;&gt;Learning decision trees to determine turn-taking by spoken dialogue systems&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.384.968&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Rhythms of Dialogue.&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://pure.mpg.de/rest/items/item_2376846/component/file_2376845/content&quot;&gt; simplest systematics for the organization of turn-taking for conversation.&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.diva-portal.org/smash/get/diva2:1141130/FULLTEXT01.pdf&quot;&gt;Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.sri.com/wp-content/uploads/2021/12/is_the_speaker_done_yet.pdf&quot;&gt;IS THE SPEAKER DONE YET? FASTER AND MORE ACCURATE END-OF-UTTERANCE DETECTION USING PROSODY&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://aclanthology.org/W08-0101.pdf&quot;&gt;Optimizing Endpointing Thresholds using Dialogue 2Features in a Spoken Dialogue System&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://qmro.qmul.ac.uk/xmlui/bitstream/handle/123456789/55075/Maier%20et%20al.%202017.%20Towards%20Deep%20End-of-Turn%20Prediction.pdf?sequence=1&quot;&gt;Towards Deep End-of-Turn Prediction for Situated Spoken Dialogue Systems&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.6018&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Learning smooth, human-like turntaking in realtime dialogue&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://aclanthology.org/W15-4643.pdf&quot;&gt;Optimising Turn-Taking Strategies With Reinforcement Learning&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://aclanthology.org/E09-1085.pdf&quot;&gt;Incremental Dialogue Processing in a Micro-Domain &lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.researchgate.net/profile/Gabriel-Skantze/publication/257267620_Towards_incremental_speech_generation_in_conversational_systems/links/5c473188299bf12be3db10e6/Towards-incremental-speech-generation-in-conversational-systems.pdf&quot;&gt;Towards incremental speech generation in conversational systems&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.diva-portal.org/smash/get/diva2:1141130/FULLTEXT01.pdf&quot;&gt;Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/1808.10785.pdf&quot;&gt;Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><author><name>anirudhthatipelli</name></author><category term="Machine Learning" /><category term="end-of-utterance" /><category term="turn-taking" /><summary type="html">This blog post is based on the work done by Anirudh Thatipelli as an ML research fellow at Skit.ai</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="/assets/images/demo1.jpg" /><media:content medium="image" url="/assets/images/demo1.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">TTS Enhancement</title><link href="/woc/" rel="alternate" type="text/html" title="TTS Enhancement" /><published>2022-03-09T00:00:00+00:00</published><updated>2022-03-09T00:00:00+00:00</updated><id>/woc</id><content type="html" xml:base="/woc/">&lt;h1 id=&quot;problem-statement&quot;&gt;Problem Statement&lt;/h1&gt;

&lt;p&gt;Text-To-Speech (TTS) systems of Skit, as well as TTS systems in general, have a tendency to mix some ambient noise along with the speech it outputs. This aim of this research project was to remove that noise and quantify how well the noise has been removed using standard metrics.&lt;/p&gt;

&lt;p&gt;Listen to the clean speech sample here for reference-&lt;/p&gt;

&lt;script&gt;
       $(document).ready(function () {
         var ws4 = WaveSurfer.create({
           container: '#waveform-4',
           backend: 'MediaElement'
         });
         ws4.load('/assets/audios/posts/woc/sp01.wav');

         ws4.on('audioprocess', function () {
           let progressText = ws4.getCurrentTime().toFixed(2) + ' / ' + ws4.getDuration().toFixed(2)
           document.getElementById('player-progress-4').innerHTML = progressText
         });

         ws4.on('ready', function () {
           let progressText = ws4.getCurrentTime().toFixed(2) + ' / ' + ws4.getDuration().toFixed(2)
           document.getElementById('player-progress-4').innerHTML = progressText
         });

         ws4.on('finish', function () {
           let button = $('#controls-4 &gt; [data-action=&quot;play-pause&quot;]')
           button.find('i:first').toggleClass('fa-play')
           button.find('i:first').toggleClass('fa-pause')
           button.toggleClass('btn-dark')
         });

         for (let button of document.getElementById('controls-4').children) {
           button.onclick = function (e) {
             let action = button.getAttribute('data-action')
             switch (action) {
               case 'play-pause':
                 ws4.playPause()
                 $(button).find('i:first').toggleClass('fa-play')
                 $(button).find('i:first').toggleClass('fa-pause')
                 $(button).toggleClass('btn-dark')
                 break
               case 'backward':
                 ws4.skipBackward()
                 break
               case 'forward':
                 ws4.skipForward()
                 break
             }
           }
         }
       });
     &lt;/script&gt;

&lt;style&gt;
  .player-controls {
    margin: 20px 0;
  }
&lt;/style&gt;

&lt;div id=&quot;waveform-4&quot;&gt;&lt;/div&gt;
&lt;div class=&quot;player-controls&quot; id=&quot;controls-4&quot;&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;backward&quot;&gt;&lt;i class=&quot;fa fa-backward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;play-pause&quot;&gt;&lt;i class=&quot;fa fa-play&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;forward&quot;&gt;&lt;i class=&quot;fa fa-forward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;code class=&quot;btn btn-sml disabled&quot; id=&quot;player-progress-4&quot;&gt;&lt;/code&gt;
&lt;/div&gt;

&lt;p&gt;and the distorted sample-&lt;/p&gt;

&lt;script&gt;
       $(document).ready(function () {
         var ws5 = WaveSurfer.create({
           container: '#waveform-5',
           backend: 'MediaElement'
         });
         ws5.load('/assets/audios/posts/woc/sp01_car_sn0.wav');

         ws5.on('audioprocess', function () {
           let progressText = ws5.getCurrentTime().toFixed(2) + ' / ' + ws5.getDuration().toFixed(2)
           document.getElementById('player-progress-5').innerHTML = progressText
         });

         ws5.on('ready', function () {
           let progressText = ws5.getCurrentTime().toFixed(2) + ' / ' + ws5.getDuration().toFixed(2)
           document.getElementById('player-progress-5').innerHTML = progressText
         });

         ws5.on('finish', function () {
           let button = $('#controls-5 &gt; [data-action=&quot;play-pause&quot;]')
           button.find('i:first').toggleClass('fa-play')
           button.find('i:first').toggleClass('fa-pause')
           button.toggleClass('btn-dark')
         });

         for (let button of document.getElementById('controls-5').children) {
           button.onclick = function (e) {
             let action = button.getAttribute('data-action')
             switch (action) {
               case 'play-pause':
                 ws5.playPause()
                 $(button).find('i:first').toggleClass('fa-play')
                 $(button).find('i:first').toggleClass('fa-pause')
                 $(button).toggleClass('btn-dark')
                 break
               case 'backward':
                 ws5.skipBackward()
                 break
               case 'forward':
                 ws5.skipForward()
                 break
             }
           }
         }
       });
     &lt;/script&gt;

&lt;style&gt;
  .player-controls {
    margin: 20px 0;
  }
&lt;/style&gt;

&lt;div id=&quot;waveform-5&quot;&gt;&lt;/div&gt;
&lt;div class=&quot;player-controls&quot; id=&quot;controls-5&quot;&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;backward&quot;&gt;&lt;i class=&quot;fa fa-backward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;play-pause&quot;&gt;&lt;i class=&quot;fa fa-play&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;forward&quot;&gt;&lt;i class=&quot;fa fa-forward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;code class=&quot;btn btn-sml disabled&quot; id=&quot;player-progress-5&quot;&gt;&lt;/code&gt;
&lt;/div&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Speech enhancement can be done using the traditional signal processing techniques or using deep learning techniques. In this project, we mainly focused on the signal processing aspects of noise reduction. Signal processing techniques can be further divided into 3 more categories-&lt;/p&gt;

&lt;h2 id=&quot;spectral-subtractive-algorithms&quot;&gt;Spectral Subtractive algorithms&lt;/h2&gt;

&lt;p&gt;The main principle is as follows- assuming additive noise, one can obtain an estimate of the clean signal spectrum by subtracting an estimate of the noise spectrum from the noisy speech spectrum. The noise spectrum can be estimated and updated, during periods when the signal is absent. The assumption made is that noise is stationary or a slowly varying process and that the noise spectrum does not change significantly between the updating periods. The enhanced signal is obtained by computing the IDFT of the estimated signal spectrum using the phase of the noisy signal.&lt;/p&gt;

&lt;h2 id=&quot;statistical-model-based-algorithms&quot;&gt;Statistical Model based algorithms&lt;/h2&gt;

&lt;p&gt;Given a set of measurements that depend on an unknown parameter, we wish to find a nonlinear estimator of the parameter of interest. These measurements correspond to the set of DFT coefficients of the noisy signal and the parameters of interest are the set of DFT coefficients of the clean signal. Various techniques from estimation theory which include maximum-likelihood (ML) estimators and the Bayesian estimators like MMSE and MAP estimators are used for this purpose.&lt;/p&gt;

&lt;h2 id=&quot;subspace-algorithms&quot;&gt;Subspace algorithms&lt;/h2&gt;

&lt;p&gt;These algorithms are based on the principle that the clean signal might be confined to a subspace of the noisy Euclidean space. Given a method for decomposing the vector space of the noisy signal into a direct sum of the subspace that is occupied by the clean signal and a subspace occupied by the noise signal, for example SVD, we could estimate the clean signal simply by nulling the component of the noisy vector residing in the noisy subspace.&lt;/p&gt;

&lt;h1 id=&quot;contributions&quot;&gt;Contributions&lt;/h1&gt;
&lt;h2 id=&quot;filters&quot;&gt;Filters&lt;/h2&gt;

&lt;p&gt;Speech enhancement can be done using the traditional signal processing techniques or using deep learning techniques. We hypothesised that signal processing techniques would be suitable for task and tested them out. We implemented some of the popular speech enhancement methods which were suitably modified to tackle the problem at hand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wiener Filter&lt;/strong&gt;&lt;/p&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img alt=&quot;Block Diagram of Wiener Filter&quot; src=&quot;/assets/images/posts/woc/wiener.png&quot; /&gt;
  &lt;figcaption&gt;Block Diagram of Wiener Filter&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;p&gt;The input signal w[n] goes through a linear and time-invariant system to produce an output signal x[n]. We are to design the system in such a way that the output signal, x[n], is as close to the desired signal, s[n], as possible. This can be done by computing the estimation error, e[n], and making it as small as possible. The optimal filter that minimizes the estimation error is called the &lt;em&gt;Wiener filter.&lt;/em&gt;&lt;/p&gt;

&lt;script&gt;
       $(document).ready(function () {
         var ws6 = WaveSurfer.create({
           container: '#waveform-6',
           backend: 'MediaElement'
         });
         ws6.load('/assets/audios/posts/woc/wiener_filtered_sp01_car_sn0.wav');

         ws6.on('audioprocess', function () {
           let progressText = ws6.getCurrentTime().toFixed(2) + ' / ' + ws6.getDuration().toFixed(2)
           document.getElementById('player-progress-6').innerHTML = progressText
         });

         ws6.on('ready', function () {
           let progressText = ws6.getCurrentTime().toFixed(2) + ' / ' + ws6.getDuration().toFixed(2)
           document.getElementById('player-progress-6').innerHTML = progressText
         });

         ws6.on('finish', function () {
           let button = $('#controls-6 &gt; [data-action=&quot;play-pause&quot;]')
           button.find('i:first').toggleClass('fa-play')
           button.find('i:first').toggleClass('fa-pause')
           button.toggleClass('btn-dark')
         });

         for (let button of document.getElementById('controls-6').children) {
           button.onclick = function (e) {
             let action = button.getAttribute('data-action')
             switch (action) {
               case 'play-pause':
                 ws6.playPause()
                 $(button).find('i:first').toggleClass('fa-play')
                 $(button).find('i:first').toggleClass('fa-pause')
                 $(button).toggleClass('btn-dark')
                 break
               case 'backward':
                 ws6.skipBackward()
                 break
               case 'forward':
                 ws6.skipForward()
                 break
             }
           }
         }
       });
     &lt;/script&gt;

&lt;style&gt;
  .player-controls {
    margin: 20px 0;
  }
&lt;/style&gt;

&lt;div id=&quot;waveform-6&quot;&gt;&lt;/div&gt;
&lt;div class=&quot;player-controls&quot; id=&quot;controls-6&quot;&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;backward&quot;&gt;&lt;i class=&quot;fa fa-backward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;play-pause&quot;&gt;&lt;i class=&quot;fa fa-play&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;forward&quot;&gt;&lt;i class=&quot;fa fa-forward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;code class=&quot;btn btn-sml disabled&quot; id=&quot;player-progress-6&quot;&gt;&lt;/code&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;MMSE and MMSE Log Filter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;These fall under the umbrella of Bayesian estimation techniques. We saw above that the Wiener estimator can be derived by minimizing the error between a linear model of the clean spectrum and the true spectrum. The Wiener estimator is considered to be the optimal (in the mean-square-error sense) complex spectral estimator, but is not the optimal spectral magnitude estimator. Acknowledging the importance of the short-time spectral amplitude (STSA) on speech intelligibility and quality, several authors have proposed optimal methods for obtaining the spectral amplitudes from noisy observations. In particular, we are looking for sought that minimized the mean-square error between the estimated and true magnitudes:&lt;/p&gt;

\[e = E{ (\hat{X_k} - X_k)^2 }\]

&lt;p&gt;where \(\hat{X_k}\) is the estimate spectral magnitude at frequency \(\omega_k\) and \(X_k\) is the true magnitude of the clean signal.&lt;/p&gt;

&lt;p&gt;The MMSE Log is an improvement upon the MMSE estimator. Although a metric
based on the squared error of the magnitude spectra is mathematically tractable, it may not be subjectively meaningful. It has been suggested that a metric based on the squared error of the log-magnitude spectra may be more suitable for speech processing. So we minimize :&lt;/p&gt;

\[e = E \{ (log \hat X_k - log X_k)^2 \}\]

&lt;p&gt;and we notice a significant improvement in the results compared to the original MMSE estimator.&lt;/p&gt;

&lt;script&gt;
       $(document).ready(function () {
         var ws7 = WaveSurfer.create({
           container: '#waveform-7',
           backend: 'MediaElement'
         });
         ws7.load('/assets/audios/posts/woc/mmse_filtered_sp01_car_sn0.wav');

         ws7.on('audioprocess', function () {
           let progressText = ws7.getCurrentTime().toFixed(2) + ' / ' + ws7.getDuration().toFixed(2)
           document.getElementById('player-progress-7').innerHTML = progressText
         });

         ws7.on('ready', function () {
           let progressText = ws7.getCurrentTime().toFixed(2) + ' / ' + ws7.getDuration().toFixed(2)
           document.getElementById('player-progress-7').innerHTML = progressText
         });

         ws7.on('finish', function () {
           let button = $('#controls-7 &gt; [data-action=&quot;play-pause&quot;]')
           button.find('i:first').toggleClass('fa-play')
           button.find('i:first').toggleClass('fa-pause')
           button.toggleClass('btn-dark')
         });

         for (let button of document.getElementById('controls-7').children) {
           button.onclick = function (e) {
             let action = button.getAttribute('data-action')
             switch (action) {
               case 'play-pause':
                 ws7.playPause()
                 $(button).find('i:first').toggleClass('fa-play')
                 $(button).find('i:first').toggleClass('fa-pause')
                 $(button).toggleClass('btn-dark')
                 break
               case 'backward':
                 ws7.skipBackward()
                 break
               case 'forward':
                 ws7.skipForward()
                 break
             }
           }
         }
       });
     &lt;/script&gt;

&lt;style&gt;
  .player-controls {
    margin: 20px 0;
  }
&lt;/style&gt;

&lt;div id=&quot;waveform-7&quot;&gt;&lt;/div&gt;
&lt;div class=&quot;player-controls&quot; id=&quot;controls-7&quot;&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;backward&quot;&gt;&lt;i class=&quot;fa fa-backward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;play-pause&quot;&gt;&lt;i class=&quot;fa fa-play&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;forward&quot;&gt;&lt;i class=&quot;fa fa-forward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;code class=&quot;btn btn-sml disabled&quot; id=&quot;player-progress-7&quot;&gt;&lt;/code&gt;
&lt;/div&gt;

&lt;script&gt;
       $(document).ready(function () {
         var ws8 = WaveSurfer.create({
           container: '#waveform-8',
           backend: 'MediaElement'
         });
         ws8.load('/assets/audios/posts/woc/mmse_log_filtered_sp01_car_sn0.wav');

         ws8.on('audioprocess', function () {
           let progressText = ws8.getCurrentTime().toFixed(2) + ' / ' + ws8.getDuration().toFixed(2)
           document.getElementById('player-progress-8').innerHTML = progressText
         });

         ws8.on('ready', function () {
           let progressText = ws8.getCurrentTime().toFixed(2) + ' / ' + ws8.getDuration().toFixed(2)
           document.getElementById('player-progress-8').innerHTML = progressText
         });

         ws8.on('finish', function () {
           let button = $('#controls-8 &gt; [data-action=&quot;play-pause&quot;]')
           button.find('i:first').toggleClass('fa-play')
           button.find('i:first').toggleClass('fa-pause')
           button.toggleClass('btn-dark')
         });

         for (let button of document.getElementById('controls-8').children) {
           button.onclick = function (e) {
             let action = button.getAttribute('data-action')
             switch (action) {
               case 'play-pause':
                 ws8.playPause()
                 $(button).find('i:first').toggleClass('fa-play')
                 $(button).find('i:first').toggleClass('fa-pause')
                 $(button).toggleClass('btn-dark')
                 break
               case 'backward':
                 ws8.skipBackward()
                 break
               case 'forward':
                 ws8.skipForward()
                 break
             }
           }
         }
       });
     &lt;/script&gt;

&lt;style&gt;
  .player-controls {
    margin: 20px 0;
  }
&lt;/style&gt;

&lt;div id=&quot;waveform-8&quot;&gt;&lt;/div&gt;
&lt;div class=&quot;player-controls&quot; id=&quot;controls-8&quot;&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;backward&quot;&gt;&lt;i class=&quot;fa fa-backward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;play-pause&quot;&gt;&lt;i class=&quot;fa fa-play&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;forward&quot;&gt;&lt;i class=&quot;fa fa-forward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;code class=&quot;btn btn-sml disabled&quot; id=&quot;player-progress-8&quot;&gt;&lt;/code&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Berouti’s Oversubstraction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This method consists of subtracting an overestimate of the noise power spectrum, while preventing the resultant spectral components from going below a preset minimum value (spectral floor).&lt;/p&gt;

\[\hat X(\omega)=\begin{cases} |Y(\omega)|^2 - \alpha |\hat D(\omega)|^2&amp;amp; \text{if } |Y(\omega)|^2 \geq (\alpha + \beta) |D(\omega)|^2 \\ \beta |\hat D(\omega)|^2 &amp;amp; \text{else} \end{cases}\]

&lt;p&gt;where \(\alpha (\geq 1)\) is the oversubtraction factor and \(0 \leq \beta \leq 1\) is the spectral floor parameter.&lt;/p&gt;

&lt;p&gt;When we subtract the estimate of the noise spectrum from the noisy speech
spectrum, there remain peaks in the spectrum. Some of those peaks are broadband (encompassing a wide range of frequencies) whereas others are narrow band, appearing as spikes in the spectrum. By oversubtracting the noise spectrum, that is, by using \(\alpha\), we can reduce the amplitude of the broadband peaks and, in some cases, eliminate them altogether. This by itself, however, is not sufficient because the deep valleys surrounding the peaks still remain in the spectrum. For that reason, spectral flooring is used to “fill in” the spectral valleys and possibly mask the remaining peaks by the neighbouring spectral components of comparable value. The valleys between peaks are no longer deep when \(\beta &amp;gt; 0\) compared to when \(\beta = 0\).&lt;/p&gt;

&lt;p&gt;The parameter \(\beta\) controls the amount of remaining residual noise
and the amount of perceived musical noise. If the spectral floor parameter \(\beta\) is too
large, then the residual noise will be audible but the musical noise will not be perceptible. Conversely, if \(\beta\) is too small, the musical noise will become annoying but the residual noise will be markedly reduced.&lt;/p&gt;

&lt;p&gt;The parameter \(\alpha\) affects the amount of speech spectral distortion caused by
the subtraction. If \(\alpha\) is too large, then the resulting signal will be severely distorted to the point that intelligibility may suffer.&lt;/p&gt;

\[\alpha = \alpha_0 - \frac{3}{20}  \textit{SNR} : \text{ for} -5 \leq \textit{SNR} \leq -20\]

&lt;p&gt;where \(\alpha_0\) is the desired value of  \(\alpha\) at 0 dB SNR and the \(\textit{SNR}\) is the short term SNR estimated at each frame.&lt;/p&gt;

&lt;script&gt;
       $(document).ready(function () {
         var ws9 = WaveSurfer.create({
           container: '#waveform-9',
           backend: 'MediaElement'
         });
         ws9.load('/assets/audios/posts/woc/ss_filtered_sp01_car_sn0.wav');

         ws9.on('audioprocess', function () {
           let progressText = ws9.getCurrentTime().toFixed(2) + ' / ' + ws9.getDuration().toFixed(2)
           document.getElementById('player-progress-9').innerHTML = progressText
         });

         ws9.on('ready', function () {
           let progressText = ws9.getCurrentTime().toFixed(2) + ' / ' + ws9.getDuration().toFixed(2)
           document.getElementById('player-progress-9').innerHTML = progressText
         });

         ws9.on('finish', function () {
           let button = $('#controls-9 &gt; [data-action=&quot;play-pause&quot;]')
           button.find('i:first').toggleClass('fa-play')
           button.find('i:first').toggleClass('fa-pause')
           button.toggleClass('btn-dark')
         });

         for (let button of document.getElementById('controls-9').children) {
           button.onclick = function (e) {
             let action = button.getAttribute('data-action')
             switch (action) {
               case 'play-pause':
                 ws9.playPause()
                 $(button).find('i:first').toggleClass('fa-play')
                 $(button).find('i:first').toggleClass('fa-pause')
                 $(button).toggleClass('btn-dark')
                 break
               case 'backward':
                 ws9.skipBackward()
                 break
               case 'forward':
                 ws9.skipForward()
                 break
             }
           }
         }
       });
     &lt;/script&gt;

&lt;style&gt;
  .player-controls {
    margin: 20px 0;
  }
&lt;/style&gt;

&lt;div id=&quot;waveform-9&quot;&gt;&lt;/div&gt;
&lt;div class=&quot;player-controls&quot; id=&quot;controls-9&quot;&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;backward&quot;&gt;&lt;i class=&quot;fa fa-backward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;play-pause&quot;&gt;&lt;i class=&quot;fa fa-play&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;forward&quot;&gt;&lt;i class=&quot;fa fa-forward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;code class=&quot;btn btn-sml disabled&quot; id=&quot;player-progress-9&quot;&gt;&lt;/code&gt;
&lt;/div&gt;

&lt;p&gt;The Kalman filter is a general recursive state estimation technique which is modified to work on the speech denoising problem.&lt;/p&gt;

&lt;script&gt;
       $(document).ready(function () {
         var ws10 = WaveSurfer.create({
           container: '#waveform-10',
           backend: 'MediaElement'
         });
         ws10.load('/assets/audios/posts/woc/kalman_filtered_sp01_car_sn0.wav');

         ws10.on('audioprocess', function () {
           let progressText = ws10.getCurrentTime().toFixed(2) + ' / ' + ws10.getDuration().toFixed(2)
           document.getElementById('player-progress-10').innerHTML = progressText
         });

         ws10.on('ready', function () {
           let progressText = ws10.getCurrentTime().toFixed(2) + ' / ' + ws10.getDuration().toFixed(2)
           document.getElementById('player-progress-10').innerHTML = progressText
         });

         ws10.on('finish', function () {
           let button = $('#controls-10 &gt; [data-action=&quot;play-pause&quot;]')
           button.find('i:first').toggleClass('fa-play')
           button.find('i:first').toggleClass('fa-pause')
           button.toggleClass('btn-dark')
         });

         for (let button of document.getElementById('controls-10').children) {
           button.onclick = function (e) {
             let action = button.getAttribute('data-action')
             switch (action) {
               case 'play-pause':
                 ws10.playPause()
                 $(button).find('i:first').toggleClass('fa-play')
                 $(button).find('i:first').toggleClass('fa-pause')
                 $(button).toggleClass('btn-dark')
                 break
               case 'backward':
                 ws10.skipBackward()
                 break
               case 'forward':
                 ws10.skipForward()
                 break
             }
           }
         }
       });
     &lt;/script&gt;

&lt;style&gt;
  .player-controls {
    margin: 20px 0;
  }
&lt;/style&gt;

&lt;div id=&quot;waveform-10&quot;&gt;&lt;/div&gt;
&lt;div class=&quot;player-controls&quot; id=&quot;controls-10&quot;&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;backward&quot;&gt;&lt;i class=&quot;fa fa-backward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;play-pause&quot;&gt;&lt;i class=&quot;fa fa-play&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;button class=&quot;btn btn-sml&quot; data-action=&quot;forward&quot;&gt;&lt;i class=&quot;fa fa-forward&quot;&gt;&lt;/i&gt;&lt;/button&gt;
  &lt;code class=&quot;btn btn-sml disabled&quot; id=&quot;player-progress-10&quot;&gt;&lt;/code&gt;
&lt;/div&gt;

&lt;h2 id=&quot;intelligibility-metrics&quot;&gt;Intelligibility Metrics&lt;/h2&gt;

&lt;p&gt;Along with techniques for speech enhancement, it is important to quantify the degree of enhancement which our methods provide. For this, we tested several metrics as discussed below-&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perceptual Evaluation of Speech Quality (PESQ)&lt;/strong&gt; is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. PESQ results essentially model mean opinion score (MOS) that cover a scale from 1 (bad) to 5 (excellent).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short-Time Objective Intelligibility (STOI)&lt;/strong&gt; is an objective metric showing high correlation (\(\rho=0.95\)) with the intelligibility of both noisy, and TF-weighted noisy speech&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gross Pitch Error (GPE)&lt;/strong&gt; is the proportion of frames, considered voiced by both pitch tracker and ground truth, for which the relative pitch error is higher than a certain threshold, which is usually set to 20%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Voicing Error Decision (VED)&lt;/strong&gt; is the proportion of frames for which an incorrect voiced/unvoiced decision is made.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;F0 Frame Error (FFE)&lt;/strong&gt; is the proportion of frames for which an error (either according to the GPE or the VDE criterion) is made. FFE can be seen as a single measure for assessing the overall performance of a pitch tracker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mel Cepstral Distortion (MCD)&lt;/strong&gt; is a measure of how different two sequences of mel cepstra are. It is used in assessing the quality of parametric speech synthesis systems, including statistical parametric speech synthesis systems, the idea being that the smaller the MCD between synthesized and natural mel cepstral sequences, the closer the synthetic speech is to reproducing natural speech.&lt;/p&gt;

&lt;h1 id=&quot;results&quot;&gt;Results&lt;/h1&gt;

&lt;p&gt;We apply our methods on 2 different datasets: First on the public &lt;a href=&quot;https://ecs.utdallas.edu/loizou/speech/noizeus/&quot;&gt;NOIZEUS&lt;/a&gt; dataset and next on a dataset created by the in-house TTS systems of Skit. The results are quite satisfactory when we apply our methods on the NOIZEUS dataset and we found that the Wiener Filter and the Kalman Filters perform the best outperforming one another for different signal-to-noise ratios (SNR).&lt;/p&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img alt=&quot;MCD metric&quot; src=&quot;/assets/images/posts/woc/res1.png&quot; /&gt;
  &lt;figcaption&gt;Effect of filters wrt MCD metric&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img alt=&quot;PESQ metric&quot; src=&quot;/assets/images/posts/woc/res2.png&quot; /&gt;
  &lt;figcaption&gt;Effect of filters wrt PESQ metric&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;However they do not perform as well as we want on the TTS dataset. In fact, we observe that our models adversely affecting the input speech. There can be various reasons attributed to this, the primary one being that speech denoising of real life data and TTS Systems are quite different, since both have different noise types. Real life noise is either additive and can be subtracted by noise estimation or can be decomposed as a direct sum of a clean subspace and a pure noise subspace. But the noise in TTS systems are much more subtle and the noise cannot be modelled to be simply additive. Here the noise is generated along with the speech. Hence most of the traditional filters which although work well for real life noise separation, do not work well for this use case. This is where we planned to resort to deep learning models like the Facebook denoiser and SeGAN.&lt;/p&gt;

&lt;h2 id=&quot;code&quot;&gt;Code&lt;/h2&gt;

&lt;p&gt;You can find more information on the &lt;a href=&quot;https://github.com/skit-ai/woc-tts-enhancement&quot;&gt;Github Repository&lt;/a&gt;.&lt;/p&gt;

&lt;h1 id=&quot;references&quot;&gt;References&lt;/h1&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.researchgate.net/publication/224738211_Enhancement_of_speech_corrupted_by_acoustic_noise&quot;&gt;Berouti’s Spectral Subtraction&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://ccrma.stanford.edu/~orchi/Documents/thesis_KF.pdf&quot;&gt;Kalman Filter for Speech Enhancement&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.perlego.com/book/2193587/speech-enhancement-theory-and-practice-second-edition-pdf&quot;&gt;Speech Enhancement by Philipos C. Loizou&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://hal.archives-ouvertes.fr/hal-00923967/document&quot;&gt;A comparative study of pitch extraction algorithms on a large variety of singing sounds&lt;/a&gt;, &lt;a href=&quot;https://ieeexplore.ieee.org/document/941023&quot;&gt;PESQ&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Author: Ananyapam De, a final year student at IISER Kolkata, majoring in Statistics, while minoring in the Computational Sciences.&lt;/p&gt;</content><author><name></name></author><category term="Machine Learning" /><category term="TTS" /><category term="speech-enhancement" /><summary type="html">Problem Statement</summary></entry><entry><title type="html">Turn Taking Dynamics in Voice Bots</title><link href="/Turn_Taking_Dynamics_in_Voice_Bots/" rel="alternate" type="text/html" title="Turn Taking Dynamics in Voice Bots" /><published>2022-03-07T00:00:00+00:00</published><updated>2022-03-07T00:00:00+00:00</updated><id>/Turn_Taking_Dynamics_in_Voice_Bots</id><content type="html" xml:base="/Turn_Taking_Dynamics_in_Voice_Bots/">&lt;p&gt;One of the challenges in building an interactive voice bots is accounting for turn taking behaviour. Turn-taking is a difficult problem to get right, even for humans. In all our circles, we’d know of at least one person who likes to interrupt a lot and doesn’t have good turn taking etiquette.  Having a conversation with such a person can be quite irritating as one feels one is not getting heard or even getting a chance to finish one’s sentence.&lt;/p&gt;

&lt;p&gt;Turn-taking is even more difficult in a multi-party setting. You might remember the last group call you had and just when you were about to take the turn, someone else jumped right in (because you waited for a tad bit too long) and you never got to speak. Turn-taking behaviour also differs culturally. In some cultures, interruptions and barge-ins are a lot more natural. There is also a difference in the inter-turn pause duration. These factors often lead to an unnatural conversation flow when speaking to a person from a different culture.&lt;/p&gt;

&lt;p&gt;Note : Bots with explicit turn-taking signalling like wake-words are out of scope for this blog.&lt;/p&gt;

&lt;h2 id=&quot;natural-turn-taking-dynamics&quot;&gt;Natural Turn Taking Dynamics&lt;/h2&gt;

&lt;p&gt;Irrespective of nuances, there are aspects of turn taking behaviour which are globally present in natural human-human conversation and one’s that we would want to imbibe in a human-bot interaction as well.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Barge-ins: These are situations when one agent interrupts the other. They occur very commonly. Examples of situations are : when one feels the other person is making a mistake or when ones feels the need to add some essential information, one naturally barges in.&lt;/li&gt;
  &lt;li&gt;Full Duplex Conversations : A half duplex conversation is one where turns are alternatively taken, like playing a tennis match, however in natural conversations, there are often instances when both people are saying something at the same time.
    &lt;ul&gt;
      &lt;li&gt;backchannels : words and fillers like “okay”, “alright” or “hmm” provide a lot of context about the state of the other person(for example attentiveness), especially when one is talking over the phone and visual cues are absent.&lt;/li&gt;
      &lt;li&gt;corrections : at times, when a person is saying something, one might want to make a small correction. For example, if there is an announcement being made “for the next meeting, you are supposed to finish submissions by 12th December, at so and so time….”. When the person is saying 12th December someone might correct by saying 13th December. This information is assimilated by the person and they often correct themselves. So, humans have the ability to hear and understand even while speaking and are active listeners.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;figure&gt;
  &lt;center&gt;
    &lt;img alt=&quot;Can't See? Something went wrong!&quot; src=&quot;/assets/images/posts/turn-taking-dynamics/duplex-conversations.png&quot; /&gt;
    &lt;figcaption&gt;Fig 1: Full duplex vs half duplex conversations.&lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;

&lt;ul&gt;
  &lt;li&gt;Minimal inter-turn pauses : if you’ve ever spoken with a voice assistant, one of the first observations is that it takes too long to start speaking after you are done and the other way around. Human conversations have a much lower turn taking latency. If this latency is near optimal, it also lends to a feeling that the other person is understanding you and the left over impression is that of a conversation gone well. Human’s have an average pause duration of 200ms as shown below, while bots have a much higher latency.&lt;/li&gt;
&lt;/ul&gt;

&lt;figure&gt;
  &lt;center&gt;
    &lt;img alt=&quot;Can't See? Something went wrong!&quot; src=&quot;/assets/images/posts/turn-taking-dynamics/pause-duration.png&quot; /&gt;
    &lt;figcaption&gt;Fig 2: Turn Taking Pause duration as measured from the Switchboard corpus. Image is taken from [1].&lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;

&lt;ul&gt;
  &lt;li&gt;Turn taking cues : often in natural conversations, people produce small vocal cues like filler words “umm” or “uhhh” to convey that they want to say something and take the turn.&lt;/li&gt;
  &lt;li&gt;Turn yielding cues : there are markers is conversations when one knows that the person is done speaking. This is how we are able to separate pauses, which happen when a person is thinking in between his utterance vs one when he is done speaking.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;turn-taking-dynamics-in-voice-bots&quot;&gt;Turn-Taking Dynamics in Voice Bots&lt;/h2&gt;

&lt;p&gt;Below, we discuss different versions of turn-taking dynamics implemented in voice-bots each with more features and increasing levels of difficulty.&lt;/p&gt;

&lt;h3 id=&quot;version---10&quot;&gt;Version - 1.0&lt;/h3&gt;

&lt;p&gt;These are some characteristics of a bare bone turn-taking behaviour that one would need in a voice bot deployment.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Initial patience : the time that the bot waits for the person to starts speaking&lt;/li&gt;
  &lt;li&gt;Silence detection : if the bot detects silence for a certain duration after the person has started speaking, it assumes the person’s turn is over.&lt;/li&gt;
  &lt;li&gt;Max turn duration : it doesn’t make sense to just be listening (because of error compounding, loss of context, maybe one is hearing just noise), so usually voice bots have a maximum duration to which they listen to the user.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;version---20&quot;&gt;Version - 2.0&lt;/h3&gt;

&lt;p&gt;This version add robustness for real life situations, make the bot more human-like and tries to reduce the latency between turns.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;VAD instead of silence detection : Often existence of background noise, speech and other signals causes the bot to keep listening. Instead one could train a Voice-Activity Detection system rather than use silence detection, to have robustness to background events and to listen to the user only when they are speaking.&lt;/li&gt;
  &lt;li&gt;Variable thresholds for silence detection and max duration : In some states for example, when the bot is expecting a yes/no answer, it makes sense to use smaller thresholds. In general dynamic thresholding should be used.&lt;/li&gt;
  &lt;li&gt;For turn-switching, instead of a simple VAD, use an IPU based model discussed &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S088523082030111X&quot;&gt;here&lt;/a&gt;. This uses a smaller VAD threshold + cues to predict the turn is over. One could start with some verbal cues for example phrase completion.&lt;/li&gt;
  &lt;li&gt;Adding backchannels as bot responses : So far we’ve only discussed aspects of perception, but backchannels are a very useful response feature. It makes the user feel that the bot is more attentive and is actively listening.
    &lt;ul&gt;
      &lt;li&gt;One could also add filler words in the main channel, when the bot is taking too long to produce a response in cases of high latency. This would prevent the user from asking a question to verify if the bot is there or not. Without this, the user’s speech would lead to further increase in latency as it would be perceived as a case when the user wants to take the turn and say something useful.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;version---30&quot;&gt;Version - 3.0&lt;/h3&gt;

&lt;p&gt;There are no good baselines for these and working on improvements would constitute state of the art performance.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Multi-party situations : These are a lot more complex and require modelling multiple parties. An application could be when the bot is overseeing a human-machine interaction say, between a call centre agent and a human. Another common use is when during a typical 2 party interaction, someone interrupts the user. This requires the bot being aware that the user is speaking to someone else and then waiting.&lt;/li&gt;
  &lt;li&gt;Full - Duplex Conversations : Unlike human-human conversations a bots can attentively listen at the same time, while saying something. This offers a possibility of redesigning interactions which can leverage this feature.&lt;/li&gt;
  &lt;li&gt;Personalisation of Turn taking behaviour : This involves changing the parameters based on user characteristics. One could entrain one’s system to be more in line with the user’s behaviour. At times when the user is angry it might involve changing the durations to feel that they are being heard.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;references-&quot;&gt;References :&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S088523082030111X&quot;&gt;Turn-taking in Conversational Systems and Human-Robot Interaction: A Review&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><author><name>Swaraj Dalmia</name></author><category term="Machine Learning" /><category term="Turn-taking" /><category term="barge-in" /><category term="duplex conversations" /><summary type="html">One of the challenges in building an interactive voice bots is accounting for turn taking behaviour. Turn-taking is a difficult problem to get right, even for humans. In all our circles, we’d know of at least one person who likes to interrupt a lot and doesn’t have good turn taking etiquette. Having a conversation with such a person can be quite irritating as one feels one is not getting heard or even getting a chance to finish one’s sentence.</summary></entry><entry><title type="html">Feature Disentanglement - I</title><link href="/feature-disentanglement1/" rel="alternate" type="text/html" title="Feature Disentanglement - I" /><published>2022-02-22T00:00:00+00:00</published><updated>2022-02-22T00:00:00+00:00</updated><id>/feature-disentanglement1</id><content type="html" xml:base="/feature-disentanglement1/">&lt;p&gt;The main advantage of deep learning is the ability to learn from the data in an end-to-end manner. The core of deep learning is representation, the deep learning models transform the representation of the data at each layer into a condensed representation with reduced dimension. Deep Learning models are often also termed as black-box models as these representations are difficult to interpret, understanding these representations can give us an insight about which feature of the data is more important and will allow us to control the learning process. Recently there has been a lot of interest in representation learning and controlling the learned representations which give an edge over multiple tasks like controlled synthesis, better representations for specific downstream tasks.&lt;/p&gt;

&lt;h1 id=&quot;data-representation-and-latent-code&quot;&gt;Data Representation and Latent Code&lt;/h1&gt;
&lt;p&gt;An image \((x)\) from the MNIST dataset has 28x28 = 784 dimensions which is a sparse representation of the image that can be visualized. But all these dimensions are not required to represent the image. The content of the images can be represented in a condensed form using lesser dimensions called latent code. Although the actual image has 784 dimensions \(x \in R^{784}\), one way of representing MNIST image can be with just an integer ie: \(z \in \{0, 1, 2, …, 9\}\). This representation \(z\) reduces the dimension of representing the image \(x\) to 1 which captures the content of which number is present in the image and the variability in the dataset. This is one example of discrete latent code for the MNIST dataset, a continuous latent code will contain more information about the image such as the style of the image, position of the number, size of the number in the image, etc.&lt;/p&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img width=&quot;600&quot; height=&quot;300&quot; alt=&quot;Can't See? Something went wrong!&quot; src=&quot;https://www.mdpi.com/applsci/applsci-09-03169/article_deploy/html/images/applsci-09-03169-g001.png&quot; /&gt;
  &lt;figcaption&gt;Fig 2:  Sample Images of MNIST from [1]&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;autoencoder&quot;&gt;AutoEncoder&lt;/h2&gt;

&lt;p&gt;Autoencoder[2] models are popularly used to learn such latent code in an unsupervised manner by compressing the image to a fixed dimension code \(z\) and generating the image back using this latent code with an encoder-decoder model.&lt;/p&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img alt=&quot;Can't See? Something went wrong!&quot; src=&quot;https://d3i71xaburhd42.cloudfront.net/08b0b21725c236fb1860285677a00248f77c7587/2-Figure1-1.png&quot; /&gt;
  &lt;figcaption&gt;Fig 2: Autoencoder architecture from Autoencoders[2]&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;p&gt;The encoder \(q_{\phi}(z \mid x)\) of the autoencoder compresses the image to a fixed dimension\((d)\) latent code\((z)\), and the decoder \(p_{\theta}(x \mid z)\) is a conditional image generator. The dimension of z has to be such that, the image can be completely reconstructed by the decoder with the latent code. Choosing the dimension of the latent code is a problem on its own[3].&lt;/p&gt;

&lt;p&gt;The autoencoder models trained will successfully encode the images into a latent code \(z\), but there is no guarantee that the latent code can be easily inferred, ie: we do not know where in the d-dimensional space the model encoded the image into, and thus difficult to choose a latent code to generate image during inference. So the conclusion is we have no idea how and where the encoder encodes the images, so we do not have control over synthesis during inference. The following figure shows the latent code learned by the AutoEncoder model with different training, as we can observe the latent space keep changing the range and quadrant and thus difficult to infer.&lt;/p&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img alt=&quot;Can't See? Something went wrong!&quot; src=&quot;/assets/images/posts/feature-disentanglement/fig1.png&quot; /&gt;
  &lt;figcaption&gt;Fig 3: Latent code of MNIST images learned by an Auto Encoder [4]&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;variational-autoencodervae&quot;&gt;Variational AutoEncoder(VAE)&lt;/h2&gt;

&lt;p&gt;Variational autoencoders(VAE) [6] solve this problem by forcing the latent code (z) to be close to a known prior distribution(Gaussian), this gives us control over the latent space. During inference, the latent space can be sampled from this known distribution for image generation. The following figure shows the latent code learned by VAE with different training, and the latent space across training is centered to the mean 0 across dimensions.&lt;/p&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img alt=&quot;Can't See? Something went wrong!&quot; src=&quot;/assets/images/posts/feature-disentanglement/fig2.png&quot; /&gt;
  &lt;figcaption&gt;Fig 4: Latent code of MNIST images learned by a VAE [4]&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;p&gt;VAE allows us to have control over the latent space and sample from the known prior distribution. But this again does not give us control over the generation of the image. Say if you want to generate an image of the number ‘3’ or ‘7’, you cannot do that(at least not directly). This is where the term “disentanglement” comes into play.&lt;/p&gt;

&lt;h1 id=&quot;disentanglement&quot;&gt;Disentanglement&lt;/h1&gt;
&lt;p&gt;Feature disentanglement is isolating the source of variation in observation data. There is a lot more factors/feature of an MNIST image other than the number itself, such as the location of the number in the image, size of the image, angle of the number, etc. These factors are independent of each other.&lt;/p&gt;

&lt;p&gt;Feature disentanglement involves separating underlying concepts of “Big one in the left”: ie: size(big), number(one), location(left).
Our interest here is to see if we can isolate these factors in the latent code so that we can have control over the generation of the images. So we want the encoder to disentangle the representation into different factors and then we generate the image with desired factors say “small seven in the top rotated 30 degrees”.&lt;/p&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img width=&quot;800&quot; height=&quot;400&quot; alt=&quot;Can't See? Something went wrong!&quot; src=&quot;https://d3i71xaburhd42.cloudfront.net/35da0a2001eea88486a5de677ab97868c93d0824/6-Figure2-1.png&quot; /&gt;
  &lt;figcaption&gt;Fig 5: Generated MNIST images by InfoGAN [5] varied digit, thickness and roatation.&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;beta-vae&quot;&gt;Beta-VAE&lt;/h2&gt;
&lt;p&gt;Beta-VAE is a variant of VAE which allows disentanglement of the learned latent code. Beta-VAE adds hyperparameter to the loss function which modulates the learning constraint of VAE.&lt;/p&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img width=&quot;800&quot; alt=&quot;Can't See? Something went wrong!&quot; src=&quot;https://miro.medium.com/max/1400/1*Z6tj5bVoArekVgv65gfkfg.png&quot; /&gt;
  &lt;figcaption&gt;Fig 6: Loss function of beta-VAE [7].&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;p&gt;The first part of the loss function takes care of the reconstruction of the image, it is the second term that learns the latent code of VAE. Different dimensions that span across Gaussians are independent, so by making the prior distribution gaussian, we force the dimensions of the latent code to be independent of each other. So increasing the weight of the second part of the loss, makes the latent code to be disentangled and independent. But this also brings a tradeoff between disentanglement and the reconstruction capability of the VAE. Although Beta-VAE models are good in disentangling the features, the reconstruction ability of this model is not the best.&lt;/p&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img width=&quot;800&quot; alt=&quot;Can't See? Something went wrong!&quot; src=&quot;https://production-media.paperswithcode.com/methods/Screen_Shot_2020-06-28_at_4.00.13_PM.png&quot; /&gt;
  &lt;figcaption&gt;Fig 7: Samples generated by beta-VAE [7].&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;beta-tcvae&quot;&gt;Beta-TCVAE&lt;/h2&gt;
&lt;p&gt;beta-TCVAE decomposes the KL divergence[10] term of the loss function of VAE into reconstruction loss, Index-code mutual information[8] between data and latent variable, Total Correlation[9] of z, and Dimension wise KL divergence[10] of \(z\)(respectively in the following formula). This helps to break the overall KL Divergence of \(z\) into dimension-wise quantities, which will focus on each dimension of the latent code \(z\). In this formulation, the beta hyperparameter is only on the Total Correlation term which is more important for disentanglement without affecting the reconstruction. So, Beta-TCVAE has better reconstruction ability than Beta-VAE with similar disentanglement property.&lt;/p&gt;

\[\mathcal{L}_{\beta-\mathrm{TC}}:=\mathbb{E}_{q(z \mid n) p(n)}[\log p(n \mid z)]-\alpha I_{q}(z ; n)-\beta \operatorname{KL}\left(q(z) \| \prod_{j} q\left(z_{j}\right)\right)-\gamma \sum_{j} \operatorname{KL}\left(q\left(z_{j}\right) \| p\left(z_{j}\right)\right)\]

&lt;p&gt;where \(\alpha = \gamma = 1\) and only \(\beta\) is varies as the hyperparameter.&lt;/p&gt;

&lt;figure&gt;
&lt;center&gt;
  &lt;img width=&quot;800&quot; alt=&quot;Can't See? Something went wrong!&quot; src=&quot;https://vitalab.github.io/article/images/IsolatingSourcesOfDisentanglementInVAEs/figure1.jpg&quot; /&gt;
  &lt;figcaption&gt;Fig 8: Samples generated by beta-TCVAE [8].&lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;

&lt;p&gt;In future posts, we will examine many new methods for feature disentanglement and how these methods can be applied to speech signals.&lt;/p&gt;

&lt;h2 id=&quot;references-&quot;&gt;References :&lt;/h2&gt;

&lt;p&gt;[1] : &lt;a href=&quot;https://www.mdpi.com/2076-3417/9/15/3169/htm&quot;&gt;A Survey of Handwritten Character Recognition with MNIST and EMNIST&lt;/a&gt; (2019)&lt;/p&gt;

&lt;p&gt;[2] : &lt;a href=&quot;https://arxiv.org/abs/2003.05991&quot;&gt;Autoencoders&lt;/a&gt; (2021)&lt;/p&gt;

&lt;p&gt;[3] : &lt;a href=&quot;https://www.sciencedirect.com/science/article/abs/pii/S0925231215015994&quot;&gt;Squeezing bottlenecks: Exploring the limits of autoencoder semantic representation capabilities&lt;/a&gt; (2016)&lt;/p&gt;

&lt;p&gt;[4] : &lt;a href=&quot;https://www.youtube.com/watch?v=itOlzH9FHkI&quot;&gt;Disentangled Representations - How to do Interpretable Compression with Neural Models&lt;/a&gt; (2020)&lt;/p&gt;

&lt;p&gt;[5] : &lt;a href=&quot;https://arxiv.org/abs/1606.03657&quot;&gt;InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets&lt;/a&gt; (2016)&lt;/p&gt;

&lt;p&gt;[6] : &lt;a href=&quot;https://arxiv.org/abs/1312.6114&quot;&gt;Auto-Encoding Variational Bayes&lt;/a&gt; (2013)&lt;/p&gt;

&lt;p&gt;[7] : &lt;a href=&quot;https://openreview.net/forum?id=Sy2fzU9gl&quot;&gt;beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework &lt;/a&gt; (2017)&lt;/p&gt;

&lt;p&gt;[8] : &lt;a href=&quot;https://arxiv.org/abs/1802.04942&quot;&gt;Isolating Sources of Disentanglement in Variational Autoencoders&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[9] : &lt;a href=&quot;https://en.wikipedia.org/wiki/Total_correlation&quot;&gt;Wikipedia&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[10] : &lt;a href=&quot;https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence&quot;&gt;Wikipedia&lt;/a&gt;&lt;/p&gt;</content><author><name>Shangeth Rajaa</name></author><category term="Machine Learning" /><summary type="html">The main advantage of deep learning is the ability to learn from the data in an end-to-end manner. The core of deep learning is representation, the deep learning models transform the representation of the data at each layer into a condensed representation with reduced dimension. Deep Learning models are often also termed as black-box models as these representations are difficult to interpret, understanding these representations can give us an insight about which feature of the data is more important and will allow us to control the learning process. Recently there has been a lot of interest in representation learning and controlling the learned representations which give an edge over multiple tasks like controlled synthesis, better representations for specific downstream tasks.</summary></entry></feed>