Vietnamese Automatic Speech Recognition Using NVIDIA — QuartzNet Model
In this article, we demonstrate the efficacy of transfer learning automatic speech recognition (ASR) tasks. We start with a pre-trained English ASR model and show that transfer learning can be effectively and easily performed in a different language (for example, Vietnamese), even if the dataset for fine-tuning is small.
Index Terms — Vietnamese, automatic speech recognition, transfer learning, text to speech Vietnamese
Related Works
Transfer learning for ASR was originally used for Gaussian Mixture Model — Hidden Markov Model (GMM-HMM) systems. It relied on the idea that phoneme representation can be shared across different languages.
Anderson et al. applied this idea to acoustic modeling using the International Phonetic Alphabet (IPA). The cross-language acoustic model adaptation was explored in depth in the GlobalPhone project. It was based on two methods:
- Partial model adaptation for languages with limited data.
- Boot-strapping, where the model for a new target is initialized with a model for another language and then completely re-trained on the target dataset.
Hybrid Deep Neural Network (DNN) — HMM models also made use of TL. Basically, the features learned by DNN models tend to be language-independent at low layers. So, all languages can share these low-level layers.
This hypothesis was experimentally confirmed by TL between ASR models for Germanic, Romance, and Slavic languages. Kunze et al. applied TL to DNN-based end-to-end ASR models and adapted an English ASR model for German. In their experiments, they used a Wav2Letter model and froze the lower convolutional layers while retraining the upper layers.
Similarly, Bukhar et al. adapted a multi-language ASR model for two new low-resource languages (Uyghur and Vietnamese) by retraining the network’s last layer. Tong et al trained a multilingual CTC-based model with an IPA-based phone set and then adapted it for a language with limited data.
They compared three approaches for cross-lingual adaptation:
- Retraining only an output layer
- Retraining all parameters
- Randomly initializing weights of the last layer and then updating the whole network.
They found that updating all the parameters performs better than only retraining the output layer.
Dataset
The dataset we use in this article is the VIVOS dataset, which contains a speech corpus by recording speech data from more than 50 native Vietnamese volunteers.
For training, 46 speakers (22 males and 24 females) help record 15 hours of speech with 11,660 utterances. While for testing, another set of 19 speakers (12 males and 7 females) recorded 50 minutes of speech with 760 utterances in total.
This dataset has two features:
- Audio files in .wav
- Text file that contains all the transcriptions of audio files.
Walk-Through the Code
We recommend using Google Colab for this training section. But if you have all the dependencies and GPU, you can run it in your local.
First, open a new Python 3 notebook and follow the instructions below.
Install Dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install unidecode
export BRANCH=‘main’
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
Download Dataset
import glob
import os
import subprocess
import tarfile
import wget
data_dir = '.'
# Download the dataset
print("******")
if not os.path.exists(data_dir + '/vivos.tar.gz'):
vivos_url = 'https://ailab.hcmus.edu.vn/assets/vivos.tar.gz'
vivos_path = wget.download(vivos_url, data_dir)
print(f"Dataset downloaded at: {vivos_path}")
else:
print("Tarfile already exists.")
vivos_path = data_dir + '/vivos_sphere.tar.gz'
if not os.path.exists(data_dir + '/vivos/'):
# Untar and convert .sph to .wav (using sox)
tar = tarfile.open(vivos_path)
tar.extractall(path=data_dir)
print("Converting .sph to .wav...")
sph_list = glob.glob(data_dir + '/vivos/**/*.sph', recursive=True)
for sph_path in sph_list:
wav_path = sph_path[:-4] + '.wav'
cmd = ["sox", sph_path, wav_path]
subprocess.run(cmd)
print("Finished conversion.\n******")
Character Encoding CTC Model
Now that we have a processed dataset, we can begin training an ASR model on this dataset. The following section will detail how we prepare a CTC model which utilizes a Character Encoding scheme.
This section will utilize a pre-trained QuartzNet 15×5, trained on roughly 7,000 hours of English speech base model. We will modify the decoder layer (thereby changing the model’s vocabulary).
char_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_quartznet15x5")
Train Low-resource Languages
If the amount of training data or available computational resources is limited, it might be useful to freeze the encoder module of the network and train just the final decoder layer. This is also useful in cases where GPU memory is insufficient to train a large network, or the model might overfit due to its size. I recommend not doing it in Vietnamese because the vocal of Vietnamese and English speaker is very different. So we need to train its encoder also.
freeze_encoder = False #@param ["False", "True"] {type:"raw"}
freeze_encoder = bool(freeze_encoder)
if freeze_encoder:
char_model.encoder.freeze()
char_model.encoder.apply(enable_bn_se)
logging.info("Model encoder has been frozen, and batch normalization has been unfrozen")
else:
char_model.encoder.unfreeze()
logging.info("Model encoder has been un-frozen")
Set up Augmentation
Remember that the model was trained on several thousands of hours of data, so the regularization provided to it might not suit the current dataset. We can easily change it as we see fit.
Note: For low-resource languages, it might be better to increase augmentation via SpecAugment to reduce overfitting. However, this might, in turn, make it too hard for the model to train in a short number of epochs.
## Uncomment lines below if you want augment your data
# with open_dict(char_model.cfg.spec_augment):
# char_model.cfg.spec_augment.freq_masks = 2
# char_model.cfg.spec_augment.freq_width = 25
# char_model.cfg.spec_augment.time_masks = 2
# char_model.cfg.spec_augment.time_width = 0.05
char_model.spec_augmentation = char_model.from_config_dict(char_model.cfg.spec_augment)
Set up Metrics
Originally, the model was trained on an English dataset corpus. When calculating Word Error Rate, we can easily use the “space” token as a separator for word boundaries. On the other hand, certain languages such as Japanese and Mandarin do not use “space” tokens, instead opting for different ways to annotate the end of the word.
In cases where the “space” token is not used to denote a word boundary, we can use the Character Error Rate metric instead, which computes the edit distance at a token level rather than a word level.
We might also be interested in noting model predictions during training and inference. As such, we can enable logging of the predictions.
use_cer = True #@param ["False", "True"] {type:"raw"}
log_prediction = True #@param ["False", "True"] {type:"raw"}
char_model._wer.use_cer = use_cer
char_model._wer.log_prediction = log_prediction
And that’s it! We can train the model using the Pytorch Lightning Trainer and NeMo Experiment Manager, as always.
Result and Conclusion
We have tried several voice records in .wav format and got a good result. However, it still maintains some weaknesses:
- The model predicted not very well on different vocals not included in the dataset.
- It still can’t be ready for real-time ASR because of a very small dataset (about 15 hours of audio) respectively to its pre-trained datasets (roughly 7000 hours of English audio).
A transfer learning approach based on reusing a pre-trained QuartzNet network encoder turns out to be very effective for various ASR tasks. In all our experiments, we observed that fine-tuning a good baseline performs good results on a small dataset and small model.
We introduce this method to implement Vietnamese Automatic Speech Recognition (ASR) using QuartzNet 15×5 model. This model was based on a deep neural network with 1D time-channel separable convolutional layers. The small model (about 18,9M parameters) opens new possibilities for speech recognition on mobile and embedded devices.
Resources:
- https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#quartznet
- Cross-Language Transfer Learning, Continuous Learning, and Domain
Adaptation for End-to-End Automatic Speech Recognition, https://arxiv.org/pdf/2005.04290.pdf - QUARTZNET: DEEP AUTOMATIC SPEECH RECOGNITION WITH 1D TIME-CHANNEL SEPARABLE CONVOLUTIONS, https://arxiv.org/pdf/1910.10261.pdf
Read the original and latest article at: https://www.neurond.com/blog/vietnamese-automatic-speech-recognition-vietasr
NeurondAI is a transformation business.
Website: https://www.neurond.com/