VisemeWNet: Enhancing Viseme Classification and Word Recognition with Gated Dilated Capsule Attention Networks

R. Sangeetha; D. Malathi

doi:10.1007/s11518-025-5689-1

Journal of Systems Science and Systems Engineering ›› :1 -26. DOI: 10.1007/s11518-025-5689-1

Article

research-article

VisemeWNet: Enhancing Viseme Classification and Word Recognition with Gated Dilated Capsule Attention Networks

R. Sangeetha ¹^,^a
, D. Malathi ¹

Author information +

History +

PDF

Abstract

Viseme is a visual unit of speech that describes the movement of the lip when uttering words or sentences. Accurate viseme classification and word recognition are mandatory for proper speech understanding, particularly in applications such as aiding communication for the hearing impaired and enhancing human-computer interaction. Prior, various deep learning-based methodologies are developed for viseme classification and word detection. However, they often struggle with low accuracy in word/sentence detection, fail to predict the correct sentences/words grammatically and contextually, and require higher computational time. To overcome this issue, a novel Carnegie Mellon Pronouncing dictionary-based VisemeWNet strategy is proposed for classifying the visemes and recognizing the word. The proposed VisemeWNet model utilized a Gated Dilated Capsule Attention Network for the classification of visemes that integrates a Gated Dilated Convolution for capturing the high-density features and multi-scale information, Transition layer for enhancing computational efficiency, and an Adaptive Capsule Attention Network for concentrating the spatial and channel attention’s pertinent features in order to improve the ability of the model to prioritize relevant information within the input data. Additionally, the proposed VisemeWNet framework utilized a Contextformer to detect the words/sentences that are grammatically and contextually correct and it incorporates a Generative Pre-trained Transformer with relative attention for calculating the perplexity score, and Bidirectional Encoder Representation from Transformer for measuring the semantic coherence. The proposed VisemeWNet method is evaluated on the MIRACL-VC1 dataset. The result demonstrates that the proposed VisemeWNet framework effectively classified the visemes and efficiently detected the words/sentences grammatically and contextually. Moreover, the proposed model achieved a higher Word Accuracy Rate of 98.8%, a higher Sentence Accuracy Rate of 92.5%, and less Viseme Error Rate of 3.5%. These results highlight the potential of the VisemeWNet framework to improve real-time communication systems for the hearing impaired and enable more interactions in human-computer systems.

Keywords

Contract logistics provider / group decision-making / linguistic intuitionistic information / consensus model / personalized semantics

Cite this article

Download citation ▾

R. Sangeetha, D. Malathi. VisemeWNet: Enhancing Viseme Classification and Word Recognition with Gated Dilated Capsule Attention Networks. Journal of Systems Science and Systems Engineering 1-26 DOI:10.1007/s11518-025-5689-1

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Akhter N, Ali M, Hussain L, Shah M, Mahmood T, Ali A, Al-Fuqaha A. Diverse pose lip-reading framework. Applied Sciences, 2022, 12(19): 9532.

[2]	Alsayadi HA, Abdelhamid AA, Hegazy I, Fayed ZT. Arabic speech recognition using end-to-end deep learning. IET Signal Processing, 2021, 15(8): 521-534.

[3]	Baaloul A, Benblidia N, Reguieg FZ, Bouakkaz M, Felouat H. An Arabic visual speech recognition framework with CNN and vision transformers for lipreading. Multimedia Tools and Applications, 2024, 83(27): 69989-70023.

[4]	Chen H, Wang Q, Du J, Wan GS, Xiong SF, Yin BC, Pan J, Lee CH. Collaborative viseme subword and end-to-end modeling for word-level lip reading. IEEE Transactions on Multimedia, 2024, 26(2024): 9358-9371.

[5]	Daou S, Ben-Hamadou A, Rekik A, Kallel A. Cross-attention fusion of visual and geometric features for large-vocabulary Arabic lipreading. Technologies, 2025, 13(1): 26.

[6]	Dweik W, Altorman S, Ashour S. Read my lips: Artificial intelligence word-level Arabic lipreading system. Egyptian Informatics Journal, 2022, 23(4): 1-12.

[7]	El Ogri O, Jaouad EM, Benslimane M, Hjouji A. Automatic lip-reading classification using deep learning approaches and optimized quaternion meixner moments by GWO algorithm. Knowledge-Based Systems, 2024, 304: 112430.

[8]	Exarchos T, Dimitrakopoulos GN, Vrahatis AG, Chrysovitsiotis G, Zachou Z, Kyrodimos E. Lip-reading advancements: A 3D convolutional neural network/long short-term memory fusion for precise word recognition. Biomedinformatics, 2024, 4(1): 410-422.

[9]	Fenghour S, Chen D, Guo K, Li B, Xiao P. An effective conversion of visemes to words for high-performance automatic lipreading. Sensors, 2021, 21(23): 7890.

[10]	Fu Y, Lu Y, Ni R. Chinese lip-reading research based on ShuffleNet and CBAM. Applied Sciences, 2023, 13(2): 1106.

[11]	Haq M A, Ruan S J, Cai W J, Li L P H. Using lip reading recognition to predict daily Mandarin conversation. IEEE Access, 2022, 10: 53481-53489.

[12]	Jeon S, Elsharkawy A, Kim MS. Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition. Sensors, 2021, 22(1): 72.

[13]	Khafaga D S, Mahmoud H A H, Alghamdi N S, Albraikan A A (2021). Novel algorithm utilizing deep learning for enhanced Arabic lip reading recognition. International Journal of Advanced Computer Science and Applications 12(11).

[14]	Kumar L A, Renuka D K, Rose S L, Wartana I M. Deep learning based assistive technology on audio visual speech recognition for hearing impaired. International Journal of Cognitive Computing in Engineering, 2022, 3: 24-30.

[15]	Li D, Gao Y, Zhu C, Wang Q, Wang R. Improving speech recognition performance in noisy environments by enhancing lip reading accuracy. Sensors, 2023, 23(4): 2053.

[16]	Li Y, Hashim A S, Lin Y, Nohuddin P N, Venkatachalam K, Ahmadian A. AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse. Applied Soft Computing, 2024, 164: 111906.

[17]	Liu Q, Ge M, Li H. Intelligent event-based lip reading word classification with spiking neural networks using spatio-temporal attention features and triplet loss. Information Sciences, 2024, 675: 120660.

[18]	Miled M, Messaoud MAB, Bouzid A. Lip reading of words with lip segmentation and deep learning. Multimedia Tools and Applications, 2023, 82(1): 551-571.

[19]	Rahmatullah G M, Ruan S J, Li LPH. Recognizing Indonesian words based on visual cues of lip movement using deep learning. Measurement, 2025, 250: 116968.

[20]	Rudregowda S, Patil Kulkarni S, Hl G, Ravi V, Krichen M. Visual speech recognition for Kannada language using VGG16 convolutional neural network. In Acoustics, 2023, 5(1): 343-353.

[21]	Shashidhar R, Patilkulkarni S, Puneeth SB. Combining audio and visual speech recognition using LSTM and deep convolutional neural network. International Journal of Information Technology, 2022, 14(7): 3425-3436.

[22]	Sheng C, Zhu X, Xu H, Pietikäinen M, Liu L. Adaptive semantic-spatio-temporal graph convolutional network for lip reading. IEEE Transactions on Multimedia, 2021, 24: 3545-3557.

[23]	Tan G, Wan Z, Wang Y, Cao Y, Zha Z J. Tackling event-based lip-reading by exploring multigrained spatiotemporal clues. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(5): 8279-8291.

[24]	Ülkümen B, Öztürk A. Detection of emergency words with automatic image based lip reading method. Intelligent Methods in Engineering Sciences, 2024, 3(1): 1-6

[25]	Wang H, Cui B, Yuan Q, Pu G, Liu X, Zhu J. Mini-3DCvT: A lightweight lip-reading method based on 3D convolution visual transformer. The Visual Computer, 2025, 41(3): 1957-1969.

[26]	Wang H, Pu G, Chen T. A lip reading method based on 3D convolutional vision transformer. IEEE Access, 2022, 10: 77205-77212.

[27]	Zhang X, Hu Y, Liu X, Gu Y, Li T, Yin J, Liu T. A novel approach for visual speech recognition using the partition-time masking and swin transformer 3D convolutional model. Sensors, 2025, 25(8): 2366.