Google Summer of Code - 2022
This page contains ideas which we’d like to get help from GSoC Contributors. But before all that, if you haven’t heard about Skit.
What is Skit?
We are a series B funded, AI-first SaaS voice automation company specializing in delivering multilingual voice bots for contact center automation. We have our Speech Bots deployed in major banks and large enterprises in several verticals. We have a foothold in India and are expanding in the US and South-East Asia.
We have been listed in Forbes 30 Under 30 Asia 2021 and have been named by Gartner as a Cool vendor in Conversational and NLT Widen Use Cases.
Our goal is to build the most natural and robust multi lingual voice bot with state of the art human-machine interaction capabilities.
We build voicebots, so that agents don’t have to sit and answer user queries 24/7 for 365 days.
You can get to know more about us over here
Our tech blog is present here
You can reach out to us at our skit-gsoc community discord, link to join here : https://discord.gg/Y9sJwz5Sw8
GSoC 2022 Ideas
Idea 1: Enhancements to dialogy via core code or plugins.
Dialogy is a framework to build machine-learning solutions for speech applications speech dialogue systems.
The main principles which form the backbone of dialogy are:
- Plugin-based: Makes it easy to import/export components to projects.
- Stack-agnostic: No assumptions made on ML stack; your choice of machine learning library will not be affected by using Dialogy.
- Progressive: Minimal boilerplate writing to let you focus on your machine learning problems.
At current shape, dialogy allows one to train models, test performance on metrics and deploy applications. We want to add more features which are desirable in an complete SLU framework. These features could include
- Hyperparameter Tuning- There are multiple hyperparameters involved in using a transformers. A hyperparameter tuning integration would allow developer with a seamless way to experiment with hyperparams and train optimal models.
- Model interpretability via Captum, LIT integration- Interpretability is necessary from both business and development perspectives. Such tools would allow developers to explain why a certain prediction was made, as well as discover biases and faults in the model.
- Integration with experiment tracking platforms- It gets difficult to keep track of models being trained by developers- some are meant for production releases, some are meant for experimentation. An experiment tracking integration would enable developers to manage their models and results easily.
Dialogy as a platform would have following features after the project
- Integrated with experiment tracking for better project management
- Integrated with model explainability and interpretability frameworks allowing users to use these tools
- Integrated with an hyperparameter tuning service/library
Understanding of Machine Learning frameworks like PyTorch, Huggingface, etc. And of course Python.
- Himansu - linkedin, github, email - firstname.lastname@example.org
- Jaivarsan - linkedin, github, email - email@example.com
Idea 2: Speaker Anonymization
The goal of this project is to explore and implement methods to anonymize speech to remove the speaker information from the signal by distorting the speaker prosodic features or with techniques like voice conversion. The speaker information in the signal can be used to attack the speaker verification systems which leads to privacy and security concerns. The idea is to explore simple signal/speech processing techniques which are faster in both implementation and deployment, as well as neural methods which use deep learning models and architectures to resynthesis the speech signal with the original speaker’s information removed. Removing the speaker’s information can help us to use speech datasets without worrying about privacy attacks on the speakers in the dataset. This project involves the implementation of various research papers in the field and improving upon them and creating a python package for speaker anonymization.
- Literature Review of research papers in the field
- Implementation of various research papers which are interesting to the project in python.
- Github repository for experiments and final python package for speaker anonymization with proper documentation.
- Demonstration of the methods implemented and final package.
- Talk/presentation on the work done during the GSoC period.
- Experience in Signal/Speech processing (Preferred)
- Experience in Deep learning for speech tasks like Automatic Speech Recognition, Speech synthesis, speaker-related tasks etc.
- Python and deep learning frameworks like PyTorch/TensorFlow.
- Shangeth Rajaa - linkedin, github, email - firstname.lastname@example.org
- Swaraj - linkedin, github, email - email@example.com
Idea 3: Improving kaldi-serve performance
Kaldi Serve is a plug-and-play abstraction over the Kaldi ASR toolkit, designed for ease of deployment and optimal runtime performance. It currently has the following key features:
- Real-time streaming (uni & bi-directional) audio recognition.
- Thread-safe concurrent Decoder queue for server environments.
- RNNLM lattice rescoring.
- N-best alternatives with AM/LM costs, word-level timings, and confidence scores.
- Easy extensibility for custom applications.
We are mainly looking at improving the runtime performance of the ASR pipeline by offloading Decoder computation to the GPU to be performed in mini-batches, and implementing a request queueing mechanism within the gRPC server to be able to utilize the parallel computing capability in order to boost latencies under high concurrent loads.
Integration with Kaldi Batched Threaded NNet3 CUDA pipeline for enabling batched computation in kaldi-serve gRPC application.
Having some combination of these: Languages: C++, CUDA, Python Frameworks: Kaldi ASR, Pytorch, gRPC, Pybind11 Basics: Speech Recognition, Language Modeling, Deep Learning