SVCODEC:

A STREAMING VARIABLE NEURAL SPEECH CODEC

Huaifeng Zhang ¹, Peifei Wu¹, Guigeng Li¹, Yuan An¹, Hao Zhang^1*,

¹College of Electronic Engineerin, Ocean University of China, QingDao, 266100, China

Abstract.

Speech coding is of crucial importance in saving precious transmission and storage resources and significantly enhancing communication efficacy. Currently, it is evolving towards ultra-low bit rates, which imposes heightened demands on coding efficiency, communication delay, and speech quality. This paper presents a variable bit rate streaming neural speech codec designed for ultra-low bit rate scenarios, based on the SoundStream network framework. The codec employs the vector quantised variational auto-encoder (VQ-VAE) algorithm to capture the temporal structure and spectral characteristics of audio datasets. This process generates a latent space codebook that is strongly correlated with the feature matrix of the target signal and facilitates the mapping of feature vectors to discrete feature vectors in the latent space for low bit rate speech compression. During the joint quantization training of multiple latent space codebooks using residual vector quantizers (RVQ), a balanced training strategy is introduced to ensure the balanced weight of the codebook, facilitating variable bit implementation in the inference process. Considering the harmonic characteristics of speech signals, a multi-period discriminator and a multi-scale discriminator are utilized to overcome the limitations of single-scale discriminators. The short-time fourier transform (STFT) spectrum, which has the ability to provide more accurate time-frequency resolution, is employed to reconstruct the reconstruction loss function. Additionally, a codebook loss function is introduced to improve the utilization rate of the codebook and accelerate the convergence speed of the model. Objective and subjective experiments demonstrate that our proposed new neural speech codec outperforms traditional classical speech codecs and existing neural speech codecs in terms of reconstructed speech naturalness and quality while maintaining the low latency characteristic of neural speech codecs. With a multi-stimulus test with hidden reference and anchor (MUSHRA) score of 87, it is highly suitable for ultra-low bit rate speech compression applications such as satellite voice communication and narrowband instant messaging. The demo has been publicly released at https://svcodec.github.io/.

This page is for research demonstration purposes only.

SVCODEC Model Architecture

Part I:The generator (a) of the GAN network which consists of encoder (a1), vector quantizer (a2) and decoder (a3) and used to generate synthesized speech (y) which will be input to the discriminator (b) of the GAN network. Part II: The efficient discriminator (b), consisting of a multi-period discriminator (b1) and a multi-scale discriminator (b2).

Subjective Experimental Results

When the bit rate is the same or similar, SVCODEC holds an absolute advantage in MUSHRA score

Objective Experimental Results

In the same code rate range, SVCODEC has higher ViSQOL, PESQ, DNSMOS, and STOI scores.

Original Input Speech from LibriSpeech

Test set from LribriSpeech.

Label	Gender	Speaker	Text
Sample1	Male	Paul-Gabriel Wiener	HE BEGAN A CONFUSED COMPLAINT AGAINST THE WIZARD WHO HAD VANISHED BEHIND THE CURTAIN ON THE LEFT
Sample2	Male	Brad Bush	SATURDAY AUGUST FIFTEENTH THE SEA UNBROKEN ALL ROUND NO LAND IN SIGHT
Sample3	Male	Taylor Burton-Edward	OUT IN THE WOODS STOOD A NICE LITTLE FIR TREE
Sample4	Female	Nikolle Doolin	ALSO A POPULAR CONTRIVANCE WHEREBY LOVE MAKING MAY BE SUSPENDED BUT NOT STOPPED DURING THE PICNIC SEASON
Sample5	Female	Rachelellen	SOMEHOW OF ALL THE DAYS WHEN THE HOME FEELING WAS THE STRONGEST THIS DAY IT SEEMED AS IF SHE COULD BEAR IT NO LONGER

Output Speech of SVCODEC (Ours)

Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
0.5kbps
1kbps
1.5kbps
2kbps
2.5kbps
3kbps
3.5kbps
4kbps

Output Speech of DAC

Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
0.5kbps
1kbps
1.5kbps
2kbps
2.5kbps
3kbps
4kbps

Output Speech of Encodec

Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
1.5kbps
3kbps

Output Speech of SoundStream

Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
0.5kbps
1kbps
1.5kbps
2kbps
2.5kbps
3kbps
3.5kbps
4kbps

Output Speech of Codec2

Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
0.7kbps
1.2kbps
1.6kbps
2.4kbps
3.2kbps

Output Speech of MELP

Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
1.2kbps
2.4kbps

Output Speech of Opus

Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
1.2kbps
2.4kbps
4kbps
6kbps

Output Speech of Speex

Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
1.2kbps
3kbps
4kbps
6kbps

Output Speech of Lyra

Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
3.2kbps
6kbps

Multiple Comparisons

Similar output quality with lower bitrate.

Model	Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
Speex	6kbps
Opus	6kbps
Lyra	6kbps
SVCODEC	4kbps

Similar bitrate (1.2kbps) with higher output quality.

Model	Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
Speex	1.2kbps
Opus	1.2kbps
MELP	1.2kbps
Codec2	1.2kbps
DAC	1kbps
SoundStream	1kbps
SVCODEC	1kbps

Similar bitrate (2kbps) with higher output quality.

Model	Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
DAC	1.5kbps
Encodec	1.5kbps
SoundStream	1.5kbps
SVCODEC	1.5kbps

Similar bitrate (2kbps) with higher output quality.

Model	Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
DAC	2kbps
SoundStream	2kbps
SVCODEC	2kbps

Similar bitrate (2.4kbps) with higher output quality.

Model	Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
Opus	2.4kbps
MELP	2.4kbps
Codec2	2.4kbps
DAC	2.5kbps
SoundStream	2.5kbps
SVCODEC	2.5kbps

Similar bitrate (3kbps) with higher output quality.

Model	Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
Lyra	3.2kbps
Speex	3kbps
Codec2	3.2kbps
DAC	3kbps
Encodec	3kbps
SoundStream	3kbps
SVCODEC	3kbps

Similar bitrate (4kbps) with higher output quality.

Model	Bitrate	Sample1	Sample2	Sample3	Sample4	Sample5
Speex	4kbps
Opus	4kbps
DAC	4kbps
SoundStream	4kbps
SVCODEC	4kbps

Comparision Results on LibriSpeech

Similar output quality with lower bitrate. Similar bitrate with higher output quality.

Model	Bit Rate(kbps)	DNSMOS↑	PESQ↑	STOI↑	ViSQOL↑
Speex	1.2	2.76	2.30	0.73	3.22
Opus	1.2	2.76	2.35	0.69	3.47
MELP	1.2	3.11	2.61	0.25	3.60
Codec2	1.2	2.97	2.57	0.65	3.36
DAC	1	2.56	1.87	0.72	2.85
SoundStream	1	3.01	1.86	0.70	3.47
SVCODEC	1	3.12	2.43	0.72	3.72
Encodec	1.5	2.86	2.51	0.72	3.13
DAC	1.5	2.65	1.95	0.73	3.38
SoundStream	1.5	3.05	1.95	0.71	3.60
SVCODEC	1.5	3.19	2.57	0.74	3.91
DAC	2	2.95	2.30	0.74	3.73
SoundStream	2	3.08	2.00	0.72	3.66
SVCODEC	2	3.24	2.66	0.75	4.02
Opus	2.4	2.76	2.35	0.69	3.47
MELP	2.4	3.11	2.61	0.39	3.69
Codec2	2.4	3.08	2.72	0.65	3.48
DAC	2.5	3.05	2.42	0.75	3.98
SoundStream	2.5	3.09	2.04	0.72	3.69
SVCODEC	2.5	3.26	2.71	0.76	4.08
Lyra	3.2	2.99	2.83	0.67	4.00
Codec2	3.2	3.12	2.89	0.66	3.57
Speex	3	2.76	2.30	0.73	3.22
Encodec	3	3.09	2.75	0.84	3.44
DAC	3	3.12	2.60	0.78	4.10
SoundStream	3	3.09	2.06	0.73	3.71
SVCODEC	3	3.27	2.74	0.76	4.12
SoundStream	3.5	3.10	2.07	0.73	3.73
SVCODEC	3.5	3.29	2.76	0.76	4.15
Speex	4	2.80	2.58	0.73	3.48
Opus	4	2.76	2.35	0.69	3.47
DAC	4	3.26	2.83	0.85	4.25
SoundStream	4	3.10	2.08	0.73	3.75
SVCODEC	4	3.30	2.78	0.77	4.17