Generative Speech Restoration - Technical Overview

Blog

Karol Duzinkiewicz

Head of Research

Published:

Mar 11, 2025

Summary

This article delves into the evolution of speech enhancement technologies, contrasting traditional noise suppression methods with cutting-edge generative techniques. Discover how Revoize's innovative hybrid model seamlessly integrates these approaches to deliver unparalleled clarity and naturalness in speech processing. Through insightful explanations and compelling audio comparisons, you'll gain a deeper understanding of the future of speech enhancement and its potential to revolutionize communication.

At Revoize, we develop cutting-edge real-time generative speech enhancement. While this sounds impressive, what does it actually mean, and how does it work? In this article, we take a deeper dive into speech enhancement, exploring the challenges it addresses and the techniques used to solve them. This is not an overly technical breakdown but rather an exploration of the key concepts that make generative speech enhancement a breakthrough in audio processing.

Speech Enhancement

As we already talked about in our previous article - here - Speech Enhancement has been around for decades. Before 2015, traditional approaches relied on analog signal processing techniques, such as low- and high-pass filters, and more advanced digital signal processing (DSP) methods like Wiener filtering or spectral subtraction. While effective in some cases, these methods relied on rigid assumptions about the statistical properties of speech and noise, making them unreliable in complex, real-world conditions. Non-stationary noises and multiple overlapping degradations often rendered these techniques ineffective. To address various distortions, engineers had to build intricate pipelines, stringing together multiple DSP components fine-tuned for specific use cases, such as wind noise reduction.

Discriminative Speech Enhancement

The advancements in deep learning in the mid-2010s marked a significant shift in speech enhancement research. By 2015, deep neural networks (DNNs) had begun to replace legacy DSP-based methods, particularly in denoising tasks. These models proved to be highly effective, eliminating the need for cumbersome, manually designed pipelines. A single, well-trained model could now handle a wide variety of noise conditions. [1].

One of the most widely adopted techniques in modern speech enhancement is spectral masking [2]. This approach takes a magnitude spectrogram of noisy speech, extracts relevant features, and feeds them into a neural network. The model then predicts a time-frequency mask that is applied to the input spectrogram, effectively suppressing noise while preserving speech components. These approaches are referred to as discriminative—they identify and filter out noise.

When trained on diverse speech and noise datasets, these models learn to differentiate speech from background sounds. Companies like Google, Microsoft, and Zoom have integrated variations of this technology into their teleconferencing applications. Under different names like background noise removal, noise suppression, or voice isolation, spectral masking has become an industry standard for improving voice clarity in calls and meetings.

However, despite their effectiveness, these methods have a fundamental limitation: they can only remove unwanted elements—they cannot restore lost speech components. Bandwidth limitations, packet loss, and clipping distortions common, e.g. in VoIP communication result in missing information that spectral masking cannot reconstruct. This is an important characteristic of discriminative methods: they are unable to make you sound better than in the original recording. They can only remove the background.

Generative Speech Enhancement

The limitations of discriminative approaches led researchers to explore methods that could go beyond noise suppression and actually restore lost speech information. Around the same time, a breakthrough in the field of image processing introduced Generative Adversarial Networks (GANs) [3]. These networks demonstrated the ability to generate realistic images, including filling in missing parts of an image through a process known as inpainting. For example, GAN-based inpainting was introduced in 2016 [4] and it used a DNN to reconstruct missing fragments of a photo that would closely match the rest of the picture, as shown in the figure below.

This inspired researchers to apply similar techniques to speech enhancement. Unlike traditional methods, GAN-based speech enhancement is capable of reconstructing degraded portions of a speech signal. This is particularly effective because speech spectrograms share similarities with grayscale images, making them suitable for generative modeling. While these days GAN-based speech enhancement does not necessarily follow the inpainting concept exactly, it offers the ability to recover lost speech components [5, 6], making it a fundamentally different approach from spectral masking. Since these methods can recreate missing components of the speech signal, they are called generative.

The best of both worlds

In real-world applications, neither discriminative nor generative methods alone offer a complete solution. Instead, combining the strengths of both approaches provides the best results. At Revoize, we leverage a hybrid model that integrates a DNN-based discriminative denoiser with a GAN-based speech restoration system. This allows us to achieve both effective noise suppression and high-fidelity speech reconstruction.

Hearing the Difference: Audio Comparisons

To illustrate the impact of generative speech enhancement, we have prepared a few examples. First, let's examine a sample of clean speech contaminated by street noise.

0:00/1:34

Using a third-party discriminative speech enhancement algorithm, we obtain the following result.

0:00/1:34

While the background noise is largely removed, artifacts remain, and the speech quality is still affected. This occurs because spectral masking is not always perfect, especially when speech and noise overlap in the spectrogram. Now, let’s listen to the same noisy speech processed by Revoize.

0:00/1:34

The difference is clear. The Revoize model produces a cleaner, more natural-sounding speech signal with fewer distortions.

However, this is still in the domain of noise removal.

Band-limited speech

In addition to denoising, generative models excel at restoring speech that has been degraded due to bandwidth limitations. Consider the following example of a speech signal that has been band-limited to 2kHz.

0:00/1:34

Listening to this example, the loss of higher frequencies makes the speech sound muffled and less intelligible. If we inspect its spectrogram we can clearly see that the upper band frequencies are missing.

Traditional discriminative methods struggle to recover these missing frequencies, as demonstrated in the following processed example.

0:00/1:34

The enhanced speech remains largely unchanged because discriminative models are not designed to reconstruct lost information.

Now, let’s hear how Revoize handles the same degraded signal.

0:00/1:34

The improvement is significant. The upper frequency components are partially restored, making the speech clearer and more natural. Inspecting the spectrogram reveals that the missing information has been reconstructed, showcasing the advantage of a generative approach.

If we inspect the spectrogram of Revoize-enhanced sample, we can clearly see the advantage of a generative approach.

Conclusion

Generative speech enhancement represents a major step forward in speech processing, offering capabilities that traditional discriminative models lack. By leveraging GANs, we can restore lost details, resulting in clearer, more natural-sounding speech. However, generative models also come with challenges. One of the most significant is the potential for hallucination—where the model reconstructs speech elements that were not originally present. This is an active area of research, and we will explore these challenges in a future blog post.

For now, it’s clear that the combination of discriminative and generative methods provides the best results. With Revoize leading the way, the future of speech enhancement is not just about suppressing noise—it’s about bringing lost speech back to life.

🎙 Make every word count—upgrade your sound with Revoize today.

👉 Create an account here

References

[1] Ochieng P. Deep neural network techniques for monaural speech enhancement and separation: state of the art analysis. Artificial Intelligence Review. 2023 Dec;56(Suppl 3):3651-703.

[2] Saleem N, Khattak MI. Deep Neural Networks for Speech Enhancement in Complex-Noisy Environment. International Journal of Interactive Multimedia and Artificial Intelligence. 2019;5(2):26-32.

[3] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. Advances in neural information processing systems. 2014;27.

[4] Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA. Context encoders: Feature learning by inpainting. InProceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 2536-2544).

[5] Su J, Jin Z, Finkelstein A. HiFi-GAN-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features. In2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021 Oct 17 (pp. 166-170). IEEE.

[6] Wu M, Liu T. VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration. arXiv preprint arXiv:2204.05841; 2022.

‹ The Secret to Making Your Content More Believable

Enhancing Human Connections with Revoize ›