SVCODEC:
A STREAMING VARIABLE NEURAL SPEECH CODEC
Abstract.
Speech coding is of crucial importance in saving precious transmission and storage resources and significantly enhancing communication efficacy. Currently, it is evolving towards ultra-low bit rates, which imposes heightened demands on coding efficiency, communication delay, and speech quality.
This paper presents a variable bit rate streaming neural speech codec designed for ultra-low bit rate scenarios, based on the SoundStream network framework. The codec employs the vector quantised variational auto-encoder (VQ-VAE) algorithm to capture the temporal structure and spectral characteristics of audio datasets. This process generates a latent space codebook that is strongly correlated with the feature matrix of the target signal and facilitates the mapping of feature vectors to discrete feature vectors in the latent space for low bit rate speech compression.
During the joint quantization training of multiple latent space codebooks using residual vector quantizers (RVQ), a balanced training strategy is introduced to ensure the balanced weight of the codebook, facilitating variable bit implementation in the inference process. Considering the harmonic characteristics of speech signals, a multi-period discriminator and a multi-scale discriminator are utilized to overcome the limitations of single-scale discriminators. The short-time fourier transform (STFT) spectrum, which has the ability to provide more accurate time-frequency resolution, is employed to reconstruct the reconstruction loss function. Additionally, a codebook loss function is introduced to improve the utilization rate of the codebook and accelerate the convergence speed of the model.
Objective and subjective experiments demonstrate that our proposed new neural speech codec outperforms traditional classical speech codecs and existing neural speech codecs in terms of reconstructed speech naturalness and quality while maintaining the low latency characteristic of neural speech codecs. With a multi-stimulus test with hidden reference and anchor (MUSHRA) score of 87, it is highly suitable for ultra-low bit rate speech compression applications such as satellite voice communication and narrowband instant messaging. The demo has been publicly released at https://svcodec.github.io/.
This page is for research demonstration purposes only.
SVCODEC Model Architecture
Part I:The generator (a) of the GAN network which consists of encoder (a1), vector quantizer (a2) and decoder (a3) and used to generate synthesized speech (y) which will be input to the discriminator (b) of the GAN network. Part II: The efficient discriminator (b), consisting of a multi-period discriminator (b1) and a multi-scale discriminator (b2).
Subjective Experimental Results
When the bit rate is the same or similar, SVCODEC holds an absolute advantage in MUSHRA score
Objective Experimental Results
In the same code rate range, SVCODEC has higher ViSQOL, PESQ, DNSMOS, and STOI scores.
Original Input Speech from LibriSpeech
Test set from LribriSpeech.
| Label | Gender | Speaker | Text | Speech |
|---|---|---|---|---|
| Sample1 | Male | Paul-Gabriel Wiener | HE BEGAN A CONFUSED COMPLAINT AGAINST THE WIZARD WHO HAD VANISHED BEHIND THE CURTAIN ON THE LEFT | |
| Sample2 | Male | Brad Bush | SATURDAY AUGUST FIFTEENTH THE SEA UNBROKEN ALL ROUND NO LAND IN SIGHT | |
| Sample3 | Male | Taylor Burton-Edward | OUT IN THE WOODS STOOD A NICE LITTLE FIR TREE | |
| Sample4 | Female | Nikolle Doolin | ALSO A POPULAR CONTRIVANCE WHEREBY LOVE MAKING MAY BE SUSPENDED BUT NOT STOPPED DURING THE PICNIC SEASON | |
| Sample5 | Female | Rachelellen | SOMEHOW OF ALL THE DAYS WHEN THE HOME FEELING WAS THE STRONGEST THIS DAY IT SEEMED AS IF SHE COULD BEAR IT NO LONGER |
Output Speech of SVCODEC (Ours)
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 0.5kbps | |||||
| 1kbps | |||||
| 1.5kbps | |||||
| 2kbps | |||||
| 2.5kbps | |||||
| 3kbps | |||||
| 3.5kbps | |||||
| 4kbps |
Output Speech of DAC
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 0.5kbps | |||||
| 1kbps | |||||
| 1.5kbps | |||||
| 2kbps | |||||
| 2.5kbps | |||||
| 3kbps | |||||
| 4kbps |
Output Speech of Encodec
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 1.5kbps | |||||
| 3kbps |
Output Speech of SoundStream
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 0.5kbps | |||||
| 1kbps | |||||
| 1.5kbps | |||||
| 2kbps | |||||
| 2.5kbps | |||||
| 3kbps | |||||
| 3.5kbps | |||||
| 4kbps |
Output Speech of Codec2
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 0.7kbps | |||||
| 1.2kbps | |||||
| 1.6kbps | |||||
| 2.4kbps | |||||
| 3.2kbps |
Output Speech of MELP
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 1.2kbps | |||||
| 2.4kbps |
Output Speech of Opus
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 1.2kbps | |||||
| 2.4kbps | |||||
| 4kbps | |||||
| 6kbps |
Output Speech of Speex
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 1.2kbps | |||||
| 3kbps | |||||
| 4kbps | |||||
| 6kbps |
Output Speech of Lyra
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 3.2kbps | |||||
| 6kbps |
Multiple Comparisons
Similar output quality with lower bitrate.| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| Speex | 6kbps | |||||
| Opus | 6kbps | |||||
| Lyra | 6kbps | |||||
| SVCODEC | 4kbps |
Similar bitrate (1.2kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| Speex | 1.2kbps | |||||
| Opus | 1.2kbps | |||||
| MELP | 1.2kbps | |||||
| Codec2 | 1.2kbps | |||||
| DAC | 1kbps | |||||
| SoundStream | 1kbps | |||||
| SVCODEC | 1kbps |
Similar bitrate (2kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| DAC | 1.5kbps | |||||
| Encodec | 1.5kbps | |||||
| SoundStream | 1.5kbps | |||||
| SVCODEC | 1.5kbps |
Similar bitrate (2kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| DAC | 2kbps | |||||
| SoundStream | 2kbps | |||||
| SVCODEC | 2kbps |
Similar bitrate (2.4kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| Opus | 2.4kbps | |||||
| MELP | 2.4kbps | |||||
| Codec2 | 2.4kbps | |||||
| DAC | 2.5kbps | |||||
| SoundStream | 2.5kbps | |||||
| SVCODEC | 2.5kbps |
Similar bitrate (3kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| Lyra | 3.2kbps | |||||
| Speex | 3kbps | |||||
| Codec2 | 3.2kbps | |||||
| DAC | 3kbps | |||||
| Encodec | 3kbps | |||||
| SoundStream | 3kbps | |||||
| SVCODEC | 3kbps |
Similar bitrate (4kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| Speex | 4kbps | |||||
| Opus | 4kbps | |||||
| DAC | 4kbps | |||||
| SoundStream | 4kbps | |||||
| SVCODEC | 4kbps |
Comparision Results on LibriSpeech
Similar output quality with lower bitrate. Similar bitrate with higher output quality.
| Model | Bit Rate(kbps) | DNSMOS↑ | PESQ↑ | STOI↑ | ViSQOL↑ |
|---|---|---|---|---|---|
| Speex | 1.2 | 2.76 | 2.30 | 0.73 | 3.22 |
| Opus | 1.2 | 2.76 | 2.35 | 0.69 | 3.47 |
| MELP | 1.2 | 3.11 | 2.61 | 0.25 | 3.60 |
| Codec2 | 1.2 | 2.97 | 2.57 | 0.65 | 3.36 |
| DAC | 1 | 2.56 | 1.87 | 0.72 | 2.85 |
| SoundStream | 1 | 3.01 | 1.86 | 0.70 | 3.47 |
| SVCODEC | 1 | 3.12 | 2.43 | 0.72 | 3.72 |
| Encodec | 1.5 | 2.86 | 2.51 | 0.72 | 3.13 |
| DAC | 1.5 | 2.65 | 1.95 | 0.73 | 3.38 |
| SoundStream | 1.5 | 3.05 | 1.95 | 0.71 | 3.60 |
| SVCODEC | 1.5 | 3.19 | 2.57 | 0.74 | 3.91 |
| DAC | 2 | 2.95 | 2.30 | 0.74 | 3.73 |
| SoundStream | 2 | 3.08 | 2.00 | 0.72 | 3.66 |
| SVCODEC | 2 | 3.24 | 2.66 | 0.75 | 4.02 |
| Opus | 2.4 | 2.76 | 2.35 | 0.69 | 3.47 |
| MELP | 2.4 | 3.11 | 2.61 | 0.39 | 3.69 |
| Codec2 | 2.4 | 3.08 | 2.72 | 0.65 | 3.48 | DAC | 2.5 | 3.05 | 2.42 | 0.75 | 3.98 |
| SoundStream | 2.5 | 3.09 | 2.04 | 0.72 | 3.69 |
| SVCODEC | 2.5 | 3.26 | 2.71 | 0.76 | 4.08 |
| Lyra | 3.2 | 2.99 | 2.83 | 0.67 | 4.00 |
| Codec2 | 3.2 | 3.12 | 2.89 | 0.66 | 3.57 |
| Speex | 3 | 2.76 | 2.30 | 0.73 | 3.22 |
| Encodec | 3 | 3.09 | 2.75 | 0.84 | 3.44 |
| DAC | 3 | 3.12 | 2.60 | 0.78 | 4.10 |
| SoundStream | 3 | 3.09 | 2.06 | 0.73 | 3.71 |
| SVCODEC | 3 | 3.27 | 2.74 | 0.76 | 4.12 |
| SoundStream | 3.5 | 3.10 | 2.07 | 0.73 | 3.73 |
| SVCODEC | 3.5 | 3.29 | 2.76 | 0.76 | 4.15 |
| Speex | 4 | 2.80 | 2.58 | 0.73 | 3.48 |
| Opus | 4 | 2.76 | 2.35 | 0.69 | 3.47 |
| DAC | 4 | 3.26 | 2.83 | 0.85 | 4.25 |
| SoundStream | 4 | 3.10 | 2.08 | 0.73 | 3.75 |
| SVCODEC | 4 | 3.30 | 2.78 | 0.77 | 4.17 |
SVCODEC Model Variants
| Model | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| SVCODEC | |||||
| SVCODEC V1 | |||||
| SVCODEC V2 |
Comparison of model variants at 4kbps.
| Model | Periods | DNSMOS↑ | PESQ↑ | STOI↑ | ViSQOL↑ |
|---|---|---|---|---|---|
| SVCODEC | [2 3 5 7 11] | 3.296 | 2.775 | 0.765 | 4.17 |
| SVCODEC v1 | [2 3 5 7] | 3.275 | 2.747 | 0.756 | 4.13 |
| SVCODEC v2 | [2 3 5 7 11 13] | 3.249 | 2.729 | 0.627 | 4.12 |
SVCODEC Model Generalization
| Dataset | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| LJSpeech | |||||
| LibriTTS | |||||
| Aishell-1 |
Comparison on generalization of SVCODEC
| Dataset | DNSMOS↑ | PESQ↑ | STOI↑ | ViSQOL↑ |
|---|---|---|---|---|
| Aishell-1 | 3.09 | 2.42 | 0.65 | 4.08 |
| LibriSpeech | 3.30 | 2.78 | 0.77 | 4.17 |
| LibriTTS | 3.18 | 2.76 | 0.77 | 4.11 |
| LJSpeech | 3.43 | 2.61 | 0.74 | 4.32 |