Vietnamese Automatic Speech Recognition Using NVIDIA — QuartzNet Model

6 min readFeb 1, 2023

In this article, we demonstrate the efficacy of transfer learning automatic speech recognition (ASR) tasks. We start with a pre-trained English ASR model and show that transfer learning can be effectively and easily performed in a different language (for example, Vietnamese), even if the dataset for fine-tuning is small.

Index Terms — Vietnamese, automatic speech recognition, transfer learning, text to speech Vietnamese

Related Works

Transfer learning for ASR was originally used for Gaussian Mixture Model — Hidden Markov Model (GMM-HMM) systems. It relied on the idea that phoneme representation can be shared across different languages.

Anderson et al. applied this idea to acoustic modeling using the International Phonetic Alphabet (IPA). The cross-language acoustic model adaptation was explored in depth in the GlobalPhone project. It was based on two methods:

Partial model adaptation for languages with limited data.
Boot-strapping, where the model for a new target is initialized with a model for another language and then completely re-trained on the target dataset.

Hybrid Deep Neural Network (DNN) — HMM models also made use of TL. Basically, the features learned by DNN models tend to be language-independent at low layers. So, all languages can share these low-level layers.

This hypothesis was experimentally confirmed by TL between ASR models for Germanic, Romance, and Slavic languages. Kunze et al. applied TL to DNN-based end-to-end ASR models and adapted an English ASR model for German. In their experiments, they used a Wav2Letter model and froze the lower convolutional layers while retraining the upper layers.

Similarly, Bukhar et al. adapted a multi-language ASR model for two new low-resource languages (Uyghur and Vietnamese) by retraining the network’s last layer. Tong et al trained a multilingual CTC-based model with an IPA-based phone set and then adapted it for a language with limited data.

They compared three approaches for cross-lingual adaptation:

Retraining only an output layer
Retraining all parameters
Randomly initializing weights of the last layer and then updating the whole network.

They found that updating all the parameters performs better than only retraining the output layer.

Dataset

The dataset we use in this article is the VIVOS dataset, which contains a speech corpus by recording speech data from more than 50 native Vietnamese volunteers.

For training, 46 speakers (22 males and 24 females) help record 15 hours of speech with 11,660 utterances. While for testing, another set of 19 speakers (12 males and 7 females) recorded 50 minutes of speech with 760 utterances in total.

This dataset has two features:

Audio files in .wav
Text file that contains all the transcriptions of audio files.

Walk-Through the Code

We recommend using Google Colab for this training section. But if you have all the dependencies and GPU, you can run it in your local.

First, open a new Python 3 notebook and follow the instructions below.

Install Dependencies

!pip install wget 

!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3 

!pip install unidecode 

export BRANCH=‘main’ 

!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

Download Dataset

import glob 

import os 

import subprocess 

import tarfile 

import wget 

data_dir = '.' 

# Download the dataset 
print("******") 

if not os.path.exists(data_dir + '/vivos.tar.gz'): 

    vivos_url = 'https://ailab.hcmus.edu.vn/assets/vivos.tar.gz' 

    vivos_path = wget.download(vivos_url, data_dir) 

    print(f"Dataset downloaded at: {vivos_path}") 

else: 

    print("Tarfile already exists.") 

    vivos_path = data_dir + '/vivos_sphere.tar.gz'  

if not os.path.exists(data_dir + '/vivos/'): 

    # Untar and convert .sph to .wav (using sox) 

    tar = tarfile.open(vivos_path) 

    tar.extractall(path=data_dir) 

     print("Converting .sph to .wav...") 

    sph_list = glob.glob(data_dir + '/vivos/**/*.sph', recursive=True) 

    for sph_path in sph_list: 

        wav_path = sph_path[:-4] + '.wav' 

        cmd = ["sox", sph_path, wav_path] 

        subprocess.run(cmd) 
 

print("Finished conversion.\n******")

Character Encoding CTC Model

Now that we have a processed dataset, we can begin training an ASR model on this dataset. The following section will detail how we prepare a CTC model which utilizes a Character Encoding scheme.

This section will utilize a pre-trained QuartzNet 15×5, trained on roughly 7,000 hours of English speech base model. We will modify the decoder layer (thereby changing the model’s vocabulary).

char_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_quartznet15x5")

Train Low-resource Languages

If the amount of training data or available computational resources is limited, it might be useful to freeze the encoder module of the network and train just the final decoder layer. This is also useful in cases where GPU memory is insufficient to train a large network, or the model might overfit due to its size. I recommend not doing it in Vietnamese because the vocal of Vietnamese and English speaker is very different. So we need to train its encoder also.

freeze_encoder = False #@param ["False", "True"] {type:"raw"} 

freeze_encoder = bool(freeze_encoder) 


if freeze_encoder: 

  char_model.encoder.freeze() 

  char_model.encoder.apply(enable_bn_se) 

  logging.info("Model encoder has been frozen, and batch normalization has been unfrozen") 

else: 

  char_model.encoder.unfreeze() 

  logging.info("Model encoder has been un-frozen")

Set up Augmentation

Remember that the model was trained on several thousands of hours of data, so the regularization provided to it might not suit the current dataset. We can easily change it as we see fit.

Note: For low-resource languages, it might be better to increase augmentation via SpecAugment to reduce overfitting. However, this might, in turn, make it too hard for the model to train in a short number of epochs.

## Uncomment lines below if you want augment your data 

# with open_dict(char_model.cfg.spec_augment): 

  # char_model.cfg.spec_augment.freq_masks = 2 
  
  # char_model.cfg.spec_augment.freq_width = 25 
  
  # char_model.cfg.spec_augment.time_masks = 2 
  
  # char_model.cfg.spec_augment.time_width = 0.05 

 char_model.spec_augmentation = char_model.from_config_dict(char_model.cfg.spec_augment)

Set up Metrics

Originally, the model was trained on an English dataset corpus. When calculating Word Error Rate, we can easily use the “space” token as a separator for word boundaries. On the other hand, certain languages such as Japanese and Mandarin do not use “space” tokens, instead opting for different ways to annotate the end of the word.

In cases where the “space” token is not used to denote a word boundary, we can use the Character Error Rate metric instead, which computes the edit distance at a token level rather than a word level.

We might also be interested in noting model predictions during training and inference. As such, we can enable logging of the predictions.

use_cer = True #@param ["False", "True"] {type:"raw"} 

log_prediction = True #@param ["False", "True"] {type:"raw"} 

char_model._wer.use_cer = use_cer 

char_model._wer.log_prediction = log_prediction

And that’s it! We can train the model using the Pytorch Lightning Trainer and NeMo Experiment Manager, as always.

Result and Conclusion

We have tried several voice records in .wav format and got a good result. However, it still maintains some weaknesses:

The model predicted not very well on different vocals not included in the dataset.
It still can’t be ready for real-time ASR because of a very small dataset (about 15 hours of audio) respectively to its pre-trained datasets (roughly 7000 hours of English audio).

A transfer learning approach based on reusing a pre-trained QuartzNet network encoder turns out to be very effective for various ASR tasks. In all our experiments, we observed that fine-tuning a good baseline performs good results on a small dataset and small model.

We introduce this method to implement Vietnamese Automatic Speech Recognition (ASR) using QuartzNet 15×5 model. This model was based on a deep neural network with 1D time-channel separable convolutional layers. The small model (about 18,9M parameters) opens new possibilities for speech recognition on mobile and embedded devices.

Resources:

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#quartznet
Cross-Language Transfer Learning, Continuous Learning, and Domain
Adaptation for End-to-End Automatic Speech Recognition, https://arxiv.org/pdf/2005.04290.pdf
QUARTZNET: DEEP AUTOMATIC SPEECH RECOGNITION WITH 1D TIME-CHANNEL SEPARABLE CONVOLUTIONS, https://arxiv.org/pdf/1910.10261.pdf

Read the original and latest article at: https://www.neurond.com/blog/vietnamese-automatic-speech-recognition-vietasr

NeurondAI is a transformation business.

Website: https://www.neurond.com/