SVCODEC:
A STREAMING VARIABLE NEURAL SPEECH CODEC
Huaifeng Zhang 1, Peifei Wu1, Guigeng Li1, Yuan An1, Hao Zhang1*,
1College of Electronic Engineerin, Ocean University of China, QingDao, 266100, China
Abstract.
This page is for research demonstration purposes only.
SVCODEC Model Architecture
Part I:The generator (a) of the GAN network which consists of encoder (a1), vector quantizer (a2) and decoder (a3) and used to generate synthesized speech (y) which will be input to the discriminator (b) of the GAN network. Part II: The efficient discriminator (b), consisting of a multi-period discriminator (b1) and a multi-scale discriminator (b2).
Subjective Experimental Results
When the bit rate is the same or similar, SVCODEC holds an absolute advantage in MUSHRA score
Objective Experimental Results
In the same code rate range, SVCODEC has higher ViSQOL, PESQ, DNSMOS, and STOI scores.
Original Input Speech from LibriSpeech
| Label | Gender | Speaker | Text | Speech |
|---|---|---|---|---|
| Sample1 | Male | Paul-Gabriel Wiener | HE BEGAN A CONFUSED COMPLAINT AGAINST THE WIZARD WHO HAD VANISHED BEHIND THE CURTAIN ON THE LEFT | |
| Sample2 | Male | Brad Bush | SATURDAY AUGUST FIFTEENTH THE SEA UNBROKEN ALL ROUND NO LAND IN SIGHT | |
| Sample3 | Male | Taylor Burton-Edward | OUT IN THE WOODS STOOD A NICE LITTLE FIR TREE | |
| Sample4 | Female | Nikolle Doolin | ALSO A POPULAR CONTRIVANCE WHEREBY LOVE MAKING MAY BE SUSPENDED BUT NOT STOPPED DURING THE PICNIC SEASON | |
| Sample5 | Female | Rachelellen | SOMEHOW OF ALL THE DAYS WHEN THE HOME FEELING WAS THE STRONGEST THIS DAY IT SEEMED AS IF SHE COULD BEAR IT NO LONGER |
Output Speech of SVCODEC (Ours)
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 0.5kbps | |||||
| 1kbps | |||||
| 1.5kbps | |||||
| 2kbps | |||||
| 2.5kbps | |||||
| 3kbps | |||||
| 3.5kbps | |||||
| 4kbps |
Output Speech of DAC
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 0.5kbps | |||||
| 1kbps | |||||
| 1.5kbps | |||||
| 2kbps | |||||
| 2.5kbps | |||||
| 3kbps | |||||
| 4kbps |
Output Speech of Encodec
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 1.5kbps | |||||
| 3kbps |
Output Speech of SoundStream
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 0.5kbps | |||||
| 1kbps | |||||
| 1.5kbps | |||||
| 2kbps | |||||
| 2.5kbps | |||||
| 3kbps | |||||
| 3.5kbps | |||||
| 4kbps |
Output Speech of Codec2
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 0.7kbps | |||||
| 1.2kbps | |||||
| 1.6kbps | |||||
| 2.4kbps | |||||
| 3.2kbps |
Output Speech of MELP
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 1.2kbps | |||||
| 2.4kbps |
Output Speech of Opus
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 1.2kbps | |||||
| 2.4kbps | |||||
| 4kbps | |||||
| 6kbps |
Output Speech of Speex
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 1.2kbps | |||||
| 3kbps | |||||
| 4kbps | |||||
| 6kbps |
Output Speech of Lyra
| Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| 3.2kbps | |||||
| 6kbps |
Multiple Comparisons
Similar output quality with lower bitrate.| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| Speex | 6kbps | |||||
| Opus | 6kbps | |||||
| Lyra | 6kbps | |||||
| SVCODEC | 4kbps |
Similar bitrate (1.2kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| Speex | 1.2kbps | |||||
| Opus | 1.2kbps | |||||
| MELP | 1.2kbps | |||||
| Codec2 | 1.2kbps | |||||
| DAC | 1kbps | |||||
| SoundStream | 1kbps | |||||
| SVCODEC | 1kbps |
Similar bitrate (2kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| DAC | 1.5kbps | |||||
| Encodec | 1.5kbps | |||||
| SoundStream | 1.5kbps | |||||
| SVCODEC | 1.5kbps |
Similar bitrate (2kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| DAC | 2kbps | |||||
| SoundStream | 2kbps | |||||
| SVCODEC | 2kbps |
Similar bitrate (2.4kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| Opus | 2.4kbps | |||||
| MELP | 2.4kbps | |||||
| Codec2 | 2.4kbps | |||||
| DAC | 2.5kbps | |||||
| SoundStream | 2.5kbps | |||||
| SVCODEC | 2.5kbps |
Similar bitrate (3kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| Lyra | 3.2kbps | |||||
| Speex | 3kbps | |||||
| Codec2 | 3.2kbps | |||||
| DAC | 3kbps | |||||
| Encodec | 3kbps | |||||
| SoundStream | 3kbps | |||||
| SVCODEC | 3kbps |
Similar bitrate (4kbps) with higher output quality.
| Model | Bitrate | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|---|
| Speex | 4kbps | |||||
| Opus | 4kbps | |||||
| DAC | 4kbps | |||||
| SoundStream | 4kbps | |||||
| SVCODEC | 4kbps |
Comparision Results on LibriSpeech
| Model | Bit Rate(kbps) | DNSMOS↑ | PESQ↑ | STOI↑ | ViSQOL↑ |
|---|---|---|---|---|---|
| Speex | 1.2 | 2.76 | 2.30 | 0.73 | 3.22 |
| Opus | 1.2 | 2.76 | 2.35 | 0.69 | 3.47 |
| MELP | 1.2 | 3.11 | 2.61 | 0.25 | 3.60 |
| Codec2 | 1.2 | 2.97 | 2.57 | 0.65 | 3.36 |
| DAC | 1 | 2.56 | 1.87 | 0.72 | 2.85 |
| SoundStream | 1 | 3.01 | 1.86 | 0.70 | 3.47 |
| SVCODEC | 1 | 3.12 | 2.43 | 0.72 | 3.72 |
| Encodec | 1.5 | 2.86 | 2.51 | 0.72 | 3.13 |
| DAC | 1.5 | 2.65 | 1.95 | 0.73 | 3.38 |
| SoundStream | 1.5 | 3.05 | 1.95 | 0.71 | 3.60 |
| SVCODEC | 1.5 | 3.19 | 2.57 | 0.74 | 3.91 |
| DAC | 2 | 2.95 | 2.30 | 0.74 | 3.73 |
| SoundStream | 2 | 3.08 | 2.00 | 0.72 | 3.66 |
| SVCODEC | 2 | 3.24 | 2.66 | 0.75 | 4.02 |
| Opus | 2.4 | 2.76 | 2.35 | 0.69 | 3.47 |
| MELP | 2.4 | 3.11 | 2.61 | 0.39 | 3.69 |
| Codec2 | 2.4 | 3.08 | 2.72 | 0.65 | 3.48 | DAC | 2.5 | 3.05 | 2.42 | 0.75 | 3.98 |
| SoundStream | 2.5 | 3.09 | 2.04 | 0.72 | 3.69 |
| SVCODEC | 2.5 | 3.26 | 2.71 | 0.76 | 4.08 |
| Lyra | 3.2 | 2.99 | 2.83 | 0.67 | 4.00 |
| Codec2 | 3.2 | 3.12 | 2.89 | 0.66 | 3.57 |
| Speex | 3 | 2.76 | 2.30 | 0.73 | 3.22 |
| Encodec | 3 | 3.09 | 2.75 | 0.84 | 3.44 |
| DAC | 3 | 3.12 | 2.60 | 0.78 | 4.10 |
| SoundStream | 3 | 3.09 | 2.06 | 0.73 | 3.71 |
| SVCODEC | 3 | 3.27 | 2.74 | 0.76 | 4.12 |
| SoundStream | 3.5 | 3.10 | 2.07 | 0.73 | 3.73 |
| SVCODEC | 3.5 | 3.29 | 2.76 | 0.76 | 4.15 |
| Speex | 4 | 2.80 | 2.58 | 0.73 | 3.48 |
| Opus | 4 | 2.76 | 2.35 | 0.69 | 3.47 |
| DAC | 4 | 3.26 | 2.83 | 0.85 | 4.25 |
| SoundStream | 4 | 3.10 | 2.08 | 0.73 | 3.75 |
| SVCODEC | 4 | 3.30 | 2.78 | 0.77 | 4.17 |
SVCODEC Model Variants
| Model | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| SVCODEC | |||||
| SVCODEC V1 | |||||
| SVCODEC V2 |
Comparison of model variants at 4kbps.
| Model | Periods | DNSMOS↑ | PESQ↑ | STOI↑ | ViSQOL↑ |
|---|---|---|---|---|---|
| SVCODEC | [2 3 5 7 11] | 3.296 | 2.775 | 0.765 | 4.17 |
| SVCODEC v1 | [2 3 5 7] | 3.275 | 2.747 | 0.756 | 4.13 |
| SVCODEC v2 | [2 3 5 7 11 13] | 3.249 | 2.729 | 0.627 | 4.12 |
SVCODEC Model Generalization
| Dataset | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
|---|---|---|---|---|---|
| LJSpeech | |||||
| LibriTTS | |||||
| Aishell-1 |
Comparison on generalization of SVCODEC
| Dataset | DNSMOS↑ | PESQ↑ | STOI↑ | ViSQOL↑ |
|---|---|---|---|---|
| Aishell-1 | 3.09 | 2.42 | 0.65 | 4.08 |
| LibriSpeech | 3.30 | 2.78 | 0.77 | 4.17 |
| LibriTTS | 3.18 | 2.76 | 0.77 | 4.11 |
| LJSpeech | 3.43 | 2.61 | 0.74 | 4.32 |