End-To-End deep neural models for Automatic Speech Recognition for Polish Language
Abstract
This article concerns research on deep learning models (DNN) used for automatic speech recognition (ASR). In such systems, recognition is based on Mel Frequency Cepstral Coefficients (MFCC) acoustic features and spectrograms. The latest ASR technologies are based on convolutional
neural networks (CNNs), recurrent neural networks (RNNs) and Transformers. The article presents an analysis of modern artificial intelligence algorithms adapted for automatic recognition of the Polish language. The differences between conventional architectures and ASR DNN End-To-End (E2E) models are discussed. Preliminary tests of five selected models (QuartzNet, FastConformer, Wav2Vec 2.0 XLSR, Whisper and ESPnet Model Zoo) on Mozilla Common Voice, Multilingual LibriSpeech and VoxPopuli databases are demonstrated. Tests were conducted for clean audio signal, signal with bandwidth limitation and degraded. The tested models were evaluated on the basis of Word Error Rate (WER).
References
R. Ardila et al., “Common Voice: a Massively-Multilingual Speech Corpus,” Proc. of the 12th Conf. on Language Resources and Evaluation (LREC 2020), pp. 4218–4222, 11–16 May 2020. [Online]. Available: https://www.aclweb.org/anthology/2020.lrec-1.520.pdf
J. Mahaveer et al., “Mls: A large-scale multilingual dataset for speech research.”
C. Wang et al., “VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation,” arXiv (Cornell University), Jan. 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2101.00390
“NVIDIA NeMo conversational AI toolki github repository.” [Online]. Available: https://github.com/NVIDIA/NeMo
“STT Pl Quartznet15x5 model.” [Online]. Available: https://catalog. ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt pl quartznet15x5
“NVIDIA FastConformer-Hybrid Large (pl) model.” [Online]. Available: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt
pl fastconformer hybrid large pc
“Whisper Github repository.” [Online]. Available: https://github.com/ openai/whisper
“Zoo espnet multi-purpose pre-trained model.” [Online]. Available: https://github.com/espnet/espnet model zoo
“Fine-tuned wav2vec2-xlsr-53 large model for speech recognition in polish.” [Online]. Available: https://huggingface.co/jonatasgrosman/ wav2vec2-large-xlsr-53-polish
L. R. Rabiner and B.-H. Juang, “An introduction to hidden Markov models,” IEEE ASSP magazine, vol. 3, no. 1, pp. 4–16, Jan. 1986.
[Online]. Available: https://doi.org/10.1109/massp.1986.1165342
W. Cavnar and J. Trenkle, “N-gram-based text categorization,” Proc. of the 3rd Annu. Symp. on Document Analysis and Information Retrieval, pp. 161–175, 11-13 Apr 1994. [Online]. Available:
https://www.let.rug.nl/vannoord/TextCat/textcat.pdf
M. J. F. Gales and S. Young, “The application of hidden Markov models in speech recognition,” Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195–304, Jan. 2007. [Online]. Available: https://doi.org/10.1561/2000000004
J. Holmes, W. Holmes, and P. Garner, “Using formant frequencies in speech recognition,” in Proc. 5th European Conf. on Speech Communication and Technology (Eurospeech 1997). Rhodes, Greece: ISCA Speech, 22-25 Sep. 1997, pp. 2083–2086. [Online]. Available: https://doi.org/10.21437/Eurospeech.1997-551
F. Honig, G. Stemmer, C. Hacker, and F. Brugnara, “Revising Perceptual Linear Prediction (PLP),” in INTERSPEECH 2016. San Francisco, CA, USA: ISCA Speech, 8–12 Sep. 2016, pp. 410–414.
[Online]. Available: https://doi.org/10.21437/Interspeech.2016-1446
R. M. S. Chanwoo Kim, “Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 24, no. 7, pp. 1315 – 1329, Jul. 2016. [Online]. Available: https://doi.org/10.1109/TASLP. 2016.2545928
F. Zheng, G. Zhang, and Z. Song, “Comparison of different implementations of MFCC,” Journal of Computer Science and Technology, vol. 16, no. 6, pp. 582–589, 11 2001. [Online].
Available: https://doi.org/10.1007/bf02943243
A. Pohl and B. Ziołko,´ Using part of speech N-Grams for improving automatic speech recognition of Polish, Jan. 2013. [Online]. Available: https://doi.org/10.1007/978-3-642-39712-7 38
K. Marasek et al., “Spoken language translation for polish,” arXiv (Cornell University), Nov. 2015. [Online]. Available: https://doi.org/10. 48550/arXiv.1511.07788
A. Graves and N. Jaitly, “Towards End-To-End Speech Recognition with Recurrent Neural Networks,” Proc. of 31st Int. Conf. on Machine Learning, pp. 1764–1772, 21–26 Jun. 2014. [Online]. Available: http://proceedings.mlr.press/v32/graves14.pdf
O. Vinyals, S. V. Ravuri, and D. Povey, “Revisiting recurrent neural networks for robust asr,” in 2012 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Kyoto, Japan: IEEE, 25 - 30 Mar. 2012, pp. 4085–4088. [Online]. Available: https://doi.org/10.1109/ ICASSP.2012.6288816
Dan Jurafsky and James H. Martin, “Speech and Language Processing (3rd ed. draft).” [Online]. Available: https://web.stanford.edu/∼jurafsky/ slp3/
Y. Zhang et al., “Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks,” in INTERSPEECH 2005 - Eurospeech, 9th European Conf. on Speech Communication and Technology. Lisbon, Portugal: ISCA Speech, 4-8 Sep. 2005, pp. 2997–3000. [Online].
Available: https://doi.org/10.21437/Interspeech.2005-138
A. V. Oppenheim, “Speech spectrograms using the fast Fourier transform,” IEEE Spectrum, vol. 7, no. 8, pp. 57–62, 8 1970. [Online]. Available: https://doi.org/10.1109/mspec.1970.5213512
S. William and C. Jim, “End-to-end deep neural network for automatic speech recognition,” Stanford CS224D Reports, 2015.
[Online]. Available: http://cs224d.stanford.edu/reports/SongWilliam.pdf
B. X. Linhao Dong, Shuang Xu, “Speech-transformer: A norecurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE 40th Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 15-20 Apr. 2018, pp. 5884–5888. [Online]. Available: https://doi.org/10.1109/ICASSP.2018.8462506
A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems 30 (NIPS 2017), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
[Online]. Available: https://proceedings.neurips.cc/paper files/paper/ 2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
“Common Voice open source, multi-language dataset of voices.”
[Online]. Available: https://commonvoice.mozilla.org/en/datasets
“Dataset Card for CommonVoice) Hugging Face.” [Online]. Available: https://huggingface.co/datasets/common voice
V. o. Panayotov, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE 40th Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 19-24 Apr. 2015, pp. 5206–5210. [Online]. Available: https://doi.org/10.1109/ICASSP.2015.7178964
“LibriVox free public domain audiobooks.” [Online]. Available: https://librivox.org/
“Project Gutenberg free eBooks.” [Online]. Available: https://www. gutenberg.org/
“Multilingual LibriSpeech (MLS) Website.” [Online]. Available: https://www.openslr.org/94/
“VoxPopuli Github repository.” [Online]. Available: https://github.com/ facebookresearch/voxpopuli
“Github The AI-powered developer platform to build, scale, and deliver secure software.” [Online]. Available: https://github.com/
“Hugging Face The AI community building the future.” [Online].
Available: https://huggingface.co/
S. Watanabe et al., “Espnet: End-to-end speech processing toolkit,” in INTERSPEECH 2018. ISCA Speech, 2-6 Sep. 2018, pp. 2207–2211.
[Online]. Available: https://doi.org/10.21437/Interspeech.2018-1456
S. Kriman et al., “Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions,” in 2020 IEEE International Conf. on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, 04–08 May 2020, p. 6124–6128. [Online]. Available: https://doi.org/10.1109/ICASSP40776.2020.9053889
J. Li et al., “Jasper: An end-to-end convolutional neural acoustic model,” arXiv (Cornell University), Apr. 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1904.03288
A. Waibel et al., “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, Mar. 1989. [Online]. Available: https://doi.org/10.1109/29.21701
A. ulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in INTERSPEECH 2020. Shanghai, China: ISCA Speech, 25-29 Oct. 2020, pp. 5036–5040. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-3015
“NVIDIA Models - NeMo Core.” [Online]. Available: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/ main/asr/models.html#hybrid-transducer-ctc
A. Baevski et al., “wav2vec 2.0: A Framework for
Self-Supervised Learning of Speech Representations,” Neural Information Processing Systems, vol. 33, pp. 12449–12460, 6
[Online]. Available: https://proceedings.neurips.cc/paper/2020/ file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
“MetaAI Wav2vec 2.0: Learning the structure of speech from raw audio.” [Online]. Available: https://ai.meta.com/blog/ wav2vec-20-learning-the-structure-of-speech-from-raw-audio/
C. Alexis et al., “Unsupervised cross-lingual representation learning for speech recognition,” in INTERSPEECH 2021. Brno, Czechia: ISCA Speech, 30 Aug. - 3 Sep. 2021, pp. 2426–2430. [Online]. Available: https://doi.org/10.21437/Interspeech.2021-329
P. Roach et al., “Babel: An eastern european multi-language database,” in Proc. of 4th International Conf. on Spoken Language Processing. ICSLP ’96. IEEE, 3-6 Oct. 1996. [Online]. Available: https://doi.org/10.1109/ICSLP.1996.608002
A. Conneau et al., “FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech,” 2022 IEEE Spoken Language Technology Workshop (SLT), 1 2023. [Online]. Available: https: //doi.org/10.1109/slt54892.2023.10023141
“Espnet2 Espnet major update.” [Online]. Available: ttps://espnet.github. io/espnet/espnet2 tutorial.html
“List of ESPnet2 corpora.” [Online]. Available: https://github.com/ espnet/espnet/blob/master/egs2/README.md
A. Ahmed and R. Steve, “Word Error Rate Estimation for Speech Recognition: e-WER,” in Proc. of the 56th Annu. Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
Melbourne, Australia: -, Association for Computational Linguistics
, pp. 20—-24. [Online]. Available: https://aclanthology.org/ P18-2004/
“Google Colaboratory hosted Jupyter Notebook service.” [Online].
Available: https://colab.google/
“Audio degradation toolbox in python, with a command-line tool. github repository.” [Online]. Available: https://github.com/ emilio-molina/audio degrader
“Jiwer ASR evaluation library.” [Online]. Available: https://jitsi.github.
io/jiwer/
Additional Files
Published
Issue
Section
License
Copyright (c) 2024 International Journal of Electronics and Telecommunications
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
1. License
The non-commercial use of the article will be governed by the Creative Commons Attribution license as currently displayed on https://creativecommons.org/licenses/by/4.0/.
2. Author’s Warranties
The author warrants that the article is original, written by stated author/s, has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author and free of any third party rights, and that any necessary written permissions to quote from other sources have been obtained by the author/s. The undersigned also warrants that the manuscript (or its essential substance) has not been published other than as an abstract or doctorate thesis and has not been submitted for consideration elsewhere, for print, electronic or digital publication.
3. User Rights
Under the Creative Commons Attribution license, the author(s) and users are free to share (copy, distribute and transmit the contribution) under the following conditions: 1. they must attribute the contribution in the manner specified by the author or licensor, 2. they may alter, transform, or build upon this work, 3. they may use this contribution for commercial purposes.
4. Rights of Authors
Authors retain the following rights:
- copyright, and other proprietary rights relating to the article, such as patent rights,
- the right to use the substance of the article in own future works, including lectures and books,
- the right to reproduce the article for own purposes, provided the copies are not offered for sale,
- the right to self-archive the article
- the right to supervision over the integrity of the content of the work and its fair use.
5. Co-Authorship
If the article was prepared jointly with other authors, the signatory of this form warrants that he/she has been authorized by all co-authors to sign this agreement on their behalf, and agrees to inform his/her co-authors of the terms of this agreement.
6. Termination
This agreement can be terminated by the author or the Journal Owner upon two months’ notice where the other party has materially breached this agreement and failed to remedy such breach within a month of being given the terminating party’s notice requesting such breach to be remedied. No breach or violation of this agreement will cause this agreement or any license granted in it to terminate automatically or affect the definition of the Journal Owner. The author and the Journal Owner may agree to terminate this agreement at any time. This agreement or any license granted in it cannot be terminated otherwise than in accordance with this section 6. This License shall remain in effect throughout the term of copyright in the Work and may not be revoked without the express written consent of both parties.
7. Royalties
This agreement entitles the author to no royalties or other fees. To such extent as legally permissible, the author waives his or her right to collect royalties relative to the article in respect of any use of the article by the Journal Owner or its sublicensee.
8. Miscellaneous
The Journal Owner will publish the article (or have it published) in the Journal if the article’s editorial process is successfully completed and the Journal Owner or its sublicensee has become obligated to have the article published. Where such obligation depends on the payment of a fee, it shall not be deemed to exist until such time as that fee is paid. The Journal Owner may conform the article to a style of punctuation, spelling, capitalization and usage that it deems appropriate. The Journal Owner will be allowed to sublicense the rights that are licensed to it under this agreement. This agreement will be governed by the laws of Poland.
By signing this License, Author(s) warrant(s) that they have the full power to enter into this agreement. This License shall remain in effect throughout the term of copyright in the Work and may not be revoked without the express written consent of both parties.