SVCODEC:

A STREAMING VARIABLE NEURAL SPEECH CODEC


Huaifeng Zhang 1, Peifei Wu1, Guigeng Li1, Yuan An1, Hao Zhang1*,

1College of Electronic Engineerin, Ocean University of China, QingDao, 266100, China

Abstract.

Speech coding is of crucial importance in saving precious transmission and storage resources and significantly enhancing communication efficacy. Currently, it is evolving towards ultra-low bit rates, which imposes heightened demands on coding efficiency, communication delay, and speech quality. This paper presents a variable bit rate streaming neural speech codec designed for ultra-low bit rate scenarios, based on the SoundStream network framework. The codec employs the vector quantised variational auto-encoder (VQ-VAE) algorithm to capture the temporal structure and spectral characteristics of audio datasets. This process generates a latent space codebook that is strongly correlated with the feature matrix of the target signal and facilitates the mapping of feature vectors to discrete feature vectors in the latent space for low bit rate speech compression. During the joint quantization training of multiple latent space codebooks using residual vector quantizers (RVQ), a balanced training strategy is introduced to ensure the balanced weight of the codebook, facilitating variable bit implementation in the inference process. Considering the harmonic characteristics of speech signals, a multi-period discriminator and a multi-scale discriminator are utilized to overcome the limitations of single-scale discriminators. The short-time fourier transform (STFT) spectrum, which has the ability to provide more accurate time-frequency resolution, is employed to reconstruct the reconstruction loss function. Additionally, a codebook loss function is introduced to improve the utilization rate of the codebook and accelerate the convergence speed of the model. Objective and subjective experiments demonstrate that our proposed new neural speech codec outperforms traditional classical speech codecs and existing neural speech codecs in terms of reconstructed speech naturalness and quality while maintaining the low latency characteristic of neural speech codecs. With a multi-stimulus test with hidden reference and anchor (MUSHRA) score of 87, it is highly suitable for ultra-low bit rate speech compression applications such as satellite voice communication and narrowband instant messaging. The demo has been publicly released at https://svcodec.github.io/.





This page is for research demonstration purposes only.

SVCODEC Model Architecture

Part I:The generator (a) of the GAN network which consists of encoder (a1), vector quantizer (a2) and decoder (a3) and used to generate synthesized speech (y) which will be input to the discriminator (b) of the GAN network. Part II: The efficient discriminator (b), consisting of a multi-period discriminator (b1) and a multi-scale discriminator (b2).

Subjective Experimental Results

When the bit rate is the same or similar, SVCODEC holds an absolute advantage in MUSHRA score

Objective Experimental Results

In the same code rate range, SVCODEC has higher ViSQOL, PESQ, DNSMOS, and STOI scores.

Original Input Speech from LibriSpeech

Test set from LribriSpeech.
Label Gender Speaker Text Speech
Sample1 Male Paul-Gabriel Wiener HE BEGAN A CONFUSED COMPLAINT AGAINST THE WIZARD WHO HAD VANISHED BEHIND THE CURTAIN ON THE LEFT
Sample2 Male Brad Bush SATURDAY AUGUST FIFTEENTH THE SEA UNBROKEN ALL ROUND NO LAND IN SIGHT
Sample3 Male Taylor Burton-Edward OUT IN THE WOODS STOOD A NICE LITTLE FIR TREE
Sample4 Female Nikolle Doolin ALSO A POPULAR CONTRIVANCE WHEREBY LOVE MAKING MAY BE SUSPENDED BUT NOT STOPPED DURING THE PICNIC SEASON
Sample5 Female Rachelellen SOMEHOW OF ALL THE DAYS WHEN THE HOME FEELING WAS THE STRONGEST THIS DAY IT SEEMED AS IF SHE COULD BEAR IT NO LONGER

Output Speech of SVCODEC (Ours)

Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
0.5kbps
1kbps
1.5kbps
2kbps
2.5kbps
3kbps
3.5kbps
4kbps

Output Speech of DAC

Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
0.5kbps
1kbps
1.5kbps
2kbps
2.5kbps
3kbps
4kbps

Output Speech of Encodec

Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
1.5kbps
3kbps

Output Speech of SoundStream

Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
0.5kbps
1kbps
1.5kbps
2kbps
2.5kbps
3kbps
3.5kbps
4kbps

Output Speech of Codec2

Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
0.7kbps
1.2kbps
1.6kbps
2.4kbps
3.2kbps

Output Speech of MELP

Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
1.2kbps
2.4kbps

Output Speech of Opus

Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
1.2kbps
2.4kbps
4kbps
6kbps

Output Speech of Speex

Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
1.2kbps
3kbps
4kbps
6kbps

Output Speech of Lyra

Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
3.2kbps
6kbps

Multiple Comparisons

Similar output quality with lower bitrate.
Model Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
Speex 6kbps
Opus 6kbps
Lyra 6kbps
SVCODEC 4kbps


Similar bitrate (1.2kbps) with higher output quality.
Model Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
Speex 1.2kbps
Opus 1.2kbps
MELP 1.2kbps
Codec2 1.2kbps
DAC 1kbps
SoundStream 1kbps
SVCODEC 1kbps


Similar bitrate (2kbps) with higher output quality.
Model Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
DAC 1.5kbps
Encodec 1.5kbps
SoundStream 1.5kbps
SVCODEC 1.5kbps


Similar bitrate (2kbps) with higher output quality.
Model Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
DAC 2kbps
SoundStream 2kbps
SVCODEC 2kbps


Similar bitrate (2.4kbps) with higher output quality.
Model Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
Opus 2.4kbps
MELP 2.4kbps
Codec2 2.4kbps
DAC 2.5kbps
SoundStream 2.5kbps
SVCODEC 2.5kbps


Similar bitrate (3kbps) with higher output quality.
Model Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
Lyra 3.2kbps
Speex 3kbps
Codec2 3.2kbps
DAC 3kbps
Encodec 3kbps
SoundStream 3kbps
SVCODEC 3kbps


Similar bitrate (4kbps) with higher output quality.
Model Bitrate Sample1 Sample2 Sample3 Sample4 Sample5
Speex 4kbps
Opus 4kbps
DAC 4kbps
SoundStream 4kbps
SVCODEC 4kbps

Comparision Results on LibriSpeech

Similar output quality with lower bitrate. Similar bitrate with higher output quality.
Model Bit Rate(kbps) DNSMOS↑ PESQ↑ STOI↑ ViSQOL↑
Speex 1.2 2.76 2.30 0.73 3.22
Opus 1.2 2.76 2.35 0.69 3.47
MELP 1.2 3.11 2.61 0.25 3.60
Codec2 1.2 2.97 2.57 0.65 3.36
DAC 1 2.56 1.87 0.72 2.85
SoundStream 1 3.01 1.86 0.70 3.47
SVCODEC 1 3.12 2.43 0.72 3.72
Encodec 1.5 2.86 2.51 0.72 3.13
DAC 1.5 2.65 1.95 0.73 3.38
SoundStream 1.5 3.05 1.95 0.71 3.60
SVCODEC 1.5 3.19 2.57 0.74 3.91
DAC 2 2.95 2.30 0.74 3.73
SoundStream 2 3.08 2.00 0.72 3.66
SVCODEC 2 3.24 2.66 0.75 4.02
Opus 2.4 2.76 2.35 0.69 3.47
MELP 2.4 3.11 2.61 0.39 3.69
Codec2 2.4 3.08 2.72 0.65 3.48
DAC 2.5 3.05 2.42 0.75 3.98
SoundStream 2.5 3.09 2.04 0.72 3.69
SVCODEC 2.5 3.26 2.71 0.76 4.08
Lyra 3.2 2.99 2.83 0.67 4.00
Codec2 3.2 3.12 2.89 0.66 3.57
Speex 3 2.76 2.30 0.73 3.22
Encodec 3 3.09 2.75 0.84 3.44
DAC 3 3.12 2.60 0.78 4.10
SoundStream 3 3.09 2.06 0.73 3.71
SVCODEC 3 3.27 2.74 0.76 4.12
SoundStream 3.5 3.10 2.07 0.73 3.73
SVCODEC 3.5 3.29 2.76 0.76 4.15
Speex 4 2.80 2.58 0.73 3.48
Opus 4 2.76 2.35 0.69 3.47
DAC 4 3.26 2.83 0.85 4.25
SoundStream 4 3.10 2.08 0.73 3.75
SVCODEC 4 3.30 2.78 0.77 4.17


SVCODEC Model Variants

Model Sample1 Sample2 Sample3 Sample4 Sample5
SVCODEC
SVCODEC V1
SVCODEC V2

Comparison of model variants at 4kbps.

Model Periods DNSMOS↑ PESQ↑ STOI↑ ViSQOL↑
SVCODEC [2 3 5 7 11] 3.296 2.775 0.765 4.17
SVCODEC v1 [2 3 5 7] 3.275 2.747 0.756 4.13
SVCODEC v2 [2 3 5 7 11 13] 3.249 2.729 0.627 4.12

SVCODEC Model Generalization

Dataset Sample1 Sample2 Sample3 Sample4 Sample5
LJSpeech
LibriTTS
Aishell-1

Comparison on generalization of SVCODEC

Dataset DNSMOS↑ PESQ↑ STOI↑ ViSQOL↑
Aishell-1 3.09 2.42 0.65 4.08
LibriSpeech 3.30 2.78 0.77 4.17
LibriTTS 3.18 2.76 0.77 4.11
LJSpeech 3.43 2.61 0.74 4.32