Use machine learning to build super low data audio and video codecs

مرحبا شخص عظيم,

this and next week you will get two papers about the following idea: How could we produce better audio and video encodings. (With better meaning, similar data quality but less bandwidth required). If possible this will result in better audio and video quality in all our video calls. Previous approaches where to compress the audio/video down to a lower bit-rate e.g. by removing audio signals the human ear could not hear. This new wave of encodings leverage machine learning and a fascinating paradigm shift: We do not try to compress the original data, but to rebuild it at the destination.

On the input a machine learning model learns about your voice and the things you said
Transfer the minimum required data to rebuild your voice
Another machine learning model rebuilds your voice at the destination

This week covers the paper about voice and next week will be about how to rebuild your face on the destination, scare times but also super fascinating technology

Software exists to create business value

I am Simon Frey, the author of the Weekly CS Paper Newsletter. And I have great news: You can work with me

As CTO as a Service, I will help you choose the right technology for your company, build up your team and be a deeply technical sparring partner for your product development strategy.

Checkout my website simon-frey.com to learn more or directly contact me via the button below.

Let’s work together!

Abstract:

The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single auto regressive model. We introduce predictive-variance regularization to reduce the sensitivity to outliers, resulting in a significant increase in performance. We show that noise reduction to remove unwanted signals can significantly increase performance. We provide extensive subjective performance evaluations that show that our system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity.

Download Link:

https://arxiv.org/pdf/2102.09660.pdf