PDF
Abstract
Viseme is a visual unit of speech that describes the movement of the lip when uttering words or sentences. Accurate viseme classification and word recognition are mandatory for proper speech understanding, particularly in applications such as aiding communication for the hearing impaired and enhancing human-computer interaction. Prior, various deep learning-based methodologies are developed for viseme classification and word detection. However, they often struggle with low accuracy in word/sentence detection, fail to predict the correct sentences/words grammatically and contextually, and require higher computational time. To overcome this issue, a novel Carnegie Mellon Pronouncing dictionary-based VisemeWNet strategy is proposed for classifying the visemes and recognizing the word. The proposed VisemeWNet model utilized a Gated Dilated Capsule Attention Network for the classification of visemes that integrates a Gated Dilated Convolution for capturing the high-density features and multi-scale information, Transition layer for enhancing computational efficiency, and an Adaptive Capsule Attention Network for concentrating the spatial and channel attention’s pertinent features in order to improve the ability of the model to prioritize relevant information within the input data. Additionally, the proposed VisemeWNet framework utilized a Contextformer to detect the words/sentences that are grammatically and contextually correct and it incorporates a Generative Pre-trained Transformer with relative attention for calculating the perplexity score, and Bidirectional Encoder Representation from Transformer for measuring the semantic coherence. The proposed VisemeWNet method is evaluated on the MIRACL-VC1 dataset. The result demonstrates that the proposed VisemeWNet framework effectively classified the visemes and efficiently detected the words/sentences grammatically and contextually. Moreover, the proposed model achieved a higher Word Accuracy Rate of 98.8%, a higher Sentence Accuracy Rate of 92.5%, and less Viseme Error Rate of 3.5%. These results highlight the potential of the VisemeWNet framework to improve real-time communication systems for the hearing impaired and enable more interactions in human-computer systems.
Keywords
Contract logistics provider
/
group decision-making
/
linguistic intuitionistic information
/
consensus model
/
personalized semantics
Cite this article
Download citation ▾
R. Sangeetha, D. Malathi.
VisemeWNet: Enhancing Viseme Classification and Word Recognition with Gated Dilated Capsule Attention Networks.
Journal of Systems Science and Systems Engineering 1-26 DOI:10.1007/s11518-025-5689-1
| [1] |
Akhter N, Ali M, Hussain L, Shah M, Mahmood T, Ali A, Al-Fuqaha A. Diverse pose lip-reading framework. Applied Sciences, 2022, 12(19): 9532.
|
| [2] |
Alsayadi HA, Abdelhamid AA, Hegazy I, Fayed ZT. Arabic speech recognition using end-to-end deep learning. IET Signal Processing, 2021, 15(8): 521-534.
|
| [3] |
Baaloul A, Benblidia N, Reguieg FZ, Bouakkaz M, Felouat H. An Arabic visual speech recognition framework with CNN and vision transformers for lipreading. Multimedia Tools and Applications, 2024, 83(27): 69989-70023.
|
| [4] |
Chen H, Wang Q, Du J, Wan GS, Xiong SF, Yin BC, Pan J, Lee CH. Collaborative viseme subword and end-to-end modeling for word-level lip reading. IEEE Transactions on Multimedia, 2024, 26(2024): 9358-9371.
|
| [5] |
Daou S, Ben-Hamadou A, Rekik A, Kallel A. Cross-attention fusion of visual and geometric features for large-vocabulary Arabic lipreading. Technologies, 2025, 13(1): 26.
|
| [6] |
Dweik W, Altorman S, Ashour S. Read my lips: Artificial intelligence word-level Arabic lipreading system. Egyptian Informatics Journal, 2022, 23(4): 1-12.
|
| [7] |
El Ogri O, Jaouad EM, Benslimane M, Hjouji A. Automatic lip-reading classification using deep learning approaches and optimized quaternion meixner moments by GWO algorithm. Knowledge-Based Systems, 2024, 304: 112430.
|
| [8] |
Exarchos T, Dimitrakopoulos GN, Vrahatis AG, Chrysovitsiotis G, Zachou Z, Kyrodimos E. Lip-reading advancements: A 3D convolutional neural network/long short-term memory fusion for precise word recognition. Biomedinformatics, 2024, 4(1): 410-422.
|
| [9] |
Fenghour S, Chen D, Guo K, Li B, Xiao P. An effective conversion of visemes to words for high-performance automatic lipreading. Sensors, 2021, 21(23): 7890.
|
| [10] |
Fu Y, Lu Y, Ni R. Chinese lip-reading research based on ShuffleNet and CBAM. Applied Sciences, 2023, 13(2): 1106.
|
| [11] |
Haq M A, Ruan S J, Cai W J, Li L P H. Using lip reading recognition to predict daily Mandarin conversation. IEEE Access, 2022, 10: 53481-53489.
|
| [12] |
Jeon S, Elsharkawy A, Kim MS. Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition. Sensors, 2021, 22(1): 72.
|
| [13] |
Khafaga D S, Mahmoud H A H, Alghamdi N S, Albraikan A A (2021). Novel algorithm utilizing deep learning for enhanced Arabic lip reading recognition. International Journal of Advanced Computer Science and Applications 12(11).
|
| [14] |
Kumar L A, Renuka D K, Rose S L, Wartana I M. Deep learning based assistive technology on audio visual speech recognition for hearing impaired. International Journal of Cognitive Computing in Engineering, 2022, 3: 24-30.
|
| [15] |
Li D, Gao Y, Zhu C, Wang Q, Wang R. Improving speech recognition performance in noisy environments by enhancing lip reading accuracy. Sensors, 2023, 23(4): 2053.
|
| [16] |
Li Y, Hashim A S, Lin Y, Nohuddin P N, Venkatachalam K, Ahmadian A. AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse. Applied Soft Computing, 2024, 164: 111906.
|
| [17] |
Liu Q, Ge M, Li H. Intelligent event-based lip reading word classification with spiking neural networks using spatio-temporal attention features and triplet loss. Information Sciences, 2024, 675: 120660.
|
| [18] |
Miled M, Messaoud MAB, Bouzid A. Lip reading of words with lip segmentation and deep learning. Multimedia Tools and Applications, 2023, 82(1): 551-571.
|
| [19] |
Rahmatullah G M, Ruan S J, Li LPH. Recognizing Indonesian words based on visual cues of lip movement using deep learning. Measurement, 2025, 250: 116968.
|
| [20] |
Rudregowda S, Patil Kulkarni S, Hl G, Ravi V, Krichen M. Visual speech recognition for Kannada language using VGG16 convolutional neural network. In Acoustics, 2023, 5(1): 343-353.
|
| [21] |
Shashidhar R, Patilkulkarni S, Puneeth SB. Combining audio and visual speech recognition using LSTM and deep convolutional neural network. International Journal of Information Technology, 2022, 14(7): 3425-3436.
|
| [22] |
Sheng C, Zhu X, Xu H, Pietikäinen M, Liu L. Adaptive semantic-spatio-temporal graph convolutional network for lip reading. IEEE Transactions on Multimedia, 2021, 24: 3545-3557.
|
| [23] |
Tan G, Wan Z, Wang Y, Cao Y, Zha Z J. Tackling event-based lip-reading by exploring multigrained spatiotemporal clues. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(5): 8279-8291.
|
| [24] |
Ülkümen B, Öztürk A. Detection of emergency words with automatic image based lip reading method. Intelligent Methods in Engineering Sciences, 2024, 3(1): 1-6
|
| [25] |
Wang H, Cui B, Yuan Q, Pu G, Liu X, Zhu J. Mini-3DCvT: A lightweight lip-reading method based on 3D convolution visual transformer. The Visual Computer, 2025, 41(3): 1957-1969.
|
| [26] |
Wang H, Pu G, Chen T. A lip reading method based on 3D convolutional vision transformer. IEEE Access, 2022, 10: 77205-77212.
|
| [27] |
Zhang X, Hu Y, Liu X, Gu Y, Li T, Yin J, Liu T. A novel approach for visual speech recognition using the partition-time masking and swin transformer 3D convolutional model. Sensors, 2025, 25(8): 2366.
|
RIGHTS & PERMISSIONS
Systems Engineering Society of China and Springer-Verlag GmbH Germany
Just Accepted
This article has successfully passed peer review and final editorial review, and will soon enter typesetting, proofreading and other publishing processes. The currently displayed version is the accepted final manuscript. The officially published version will be updated with format, DOI and citation information upon launch. We recommend that you pay attention to subsequent journal notifications and preferentially cite the officially published version. Thank you for your support and cooperation.