Depressive semantic awareness from vlog facial and vocal streams via spatio-temporal transformer

Yongfeng Tao , Minqiang Yang , Yushan Wu , Kevin Lee , Adrienne Kline , Bin Hu

›› 2024, Vol. 10 ›› Issue (3) : 577 -585.

PDF
›› 2024, Vol. 10 ›› Issue (3) :577 -585. DOI: 10.1016/j.dcan.2023.03.007
Research article
research-article

Depressive semantic awareness from vlog facial and vocal streams via spatio-temporal transformer

Author information +
History +
PDF

Abstract

With the rapid growth of information transmission via the Internet, efforts have been made to reduce network load to promote efficiency. One such application is semantic computing, which can extract and process semantic communication. Social media has enabled users to share their current emotions, opinions, and life events through their mobile devices. Notably, people suffering from mental health problems are more willing to share their feelings on social networks. Therefore, it is necessary to extract semantic information from social media (vlog data) to identify abnormal emotional states to facilitate early identification and intervention. Most studies do not consider spatio-temporal information when fusing multimodal information to identify abnormal emotional states such as depression. To solve this problem, this paper proposes a spatio-temporal squeeze transformer method for the extraction of semantic features of depression. First, a module with spatio-temporal data is embedded into the transformer encoder, which is utilized to obtain a representation of spatio-temporal features. Second, a classifier with a voting mechanism is designed to encourage the model to classify depression and non-depression effectively. Experiments are conducted on the D-Vlog dataset. The results show that the method is effective, and the accuracy rate can reach 70.70%. This work provides scaffolding for future work in the detection of affect recognition in semantic communication based on social media vlog data.

Keywords

Emotional computing / Semantic awareness / Depression recognition / Vlog data

Cite this article

Download citation ▾
Yongfeng Tao, Minqiang Yang, Yushan Wu, Kevin Lee, Adrienne Kline, Bin Hu. Depressive semantic awareness from vlog facial and vocal streams via spatio-temporal transformer. , 2024, 10(3): 577-585 DOI:10.1016/j.dcan.2023.03.007

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

R. Dangi, P. Lalwani, G. Choudhary, I. You, G. Pau,Study and investigation on 5g technology: a systematic review, Sensors 22 (1) (2021) 26.

[2]

Z. Ning, P. Dong, X. Kong, F. Xia, A cooperative partial computation offloading scheme for mobile edge computing enabled internet of things, IEEE Internet Things J. 6 (3) (2018) 4804-4814.

[3]

Y. Zhang, C. Jiang, B. Yue, J. Wan, M. Guizani, Information fusion for edge intelligence: a survey, Inf. Fusion 81 (2022) 171-186.

[4]

W. Weaver, Recent Contributions to the Mathematical Theory of Communication, ETC: a review of general semantics, 1953, pp. 261-281.

[5]

S. Jiang, Y. Liu, Y. Zhang, P. Luo, K. Cao, J. Xiong, H. Zhao, J. Wei, Reliable semantic communication system enabled by knowledge graph, Entropy 24 (6) (2022) 846.

[6]

Z. Weng, Z. Qin, G.Y. Li, Semantic communications for speech signals, in: ICC 2021-IEEE International Conference on Communications, IEEE, 2021, pp. 1-6.

[7]

Q. Zhou, R. Li, Z. Zhao, C. Peng, H. Zhang, Semantic communication with adaptive universal transformer, IEEE Wireless Commun. Lett. 11 (3) (2021) 453-457.

[8]

X. Chen, M.D. Sykora, T.W. Jackson, S. Elayan,What about mood swings: identifying depression on twitter with temporal measures of emotions,in:Companion Proceedings of the the Web Conference 2018, 2018, pp. 1653-1660.

[9]

Z.A.E. Sarhan, H.A. El Shinnawy, M.E. Eltawil, Y. Elnawawy, W. Rashad, M. S. Mohammed, Global functioning and suicide risk in patients with depression and comorbid borderline personality disorder, Neurol. Psychiatr. Brain Res. 31 (2019) 37-42.

[10]

G.S. Malhi, J.J. Mann, Depression, Lancet 392 (10161) (2018) 2299-2312.

[11]

J.C. Mundt, P.J. Snyder, M.S. Cannizzaro, K. Chappie, D.S. Geralts, Voice acoustic measures of depression severity and treatment response collected via interactive voice response (ivr) technology, J. Neurolinguistics 20 (1) (2007) 50-64.

[12]

W. Wang, X. Yu, B. Fang, D.Y. Zhao, Y. Chen, W. Wei, J. Chen, Cross-modality LGE-CMR Segmentation using Image-to-Image Translation based Data Augmentation, IEEE/ACM Transactions on Computational Biology and Bioinformatics (01), (2022), 1-1.

[13]

J. Chen, S. Sun, L.-b. Zhang, B. Yang, W. Wang, Compressed sensing framework for heart sound acquisition in internet of medical things, IEEE Trans. Ind. Inf. 18 (3)(2022) 2000-2009.

[14]

A. Jan, H. Meng, Y.F.B.A. Gaus, F. Zhang, Artificial intelligent system for automatic depression level analysis through visual and vocal expressions, IEEE Transact. Cognit. Develop. Syst. 10 (3) (2017) 668-680.

[15]

L. Wen, X. Li, G. Guo, Y. Zhu, Automated depression diagnosis based on facial dynamic analysis and sparse coding, IEEE Trans. Inf. Forensics Secur. 10 (7) (2015) 1432-1441.

[16]

B. Stasak, D. Joachim, J. Epps, Breaking age barriers with automatic voice-based depression detection, IEEE Pervasive Comput. 21 (2) (2022) 10-19.

[17]

L. He, M. Niu, P. Tiwari, P. Marttinen, R. Su, J. Jiang, C. Guo, H. Wang, S. Ding, Z. Wang, et al., Deep learning for depression recognition with audiovisual cues: a review, Inf. Fusion 80 (2022) 56-86.

[18]

H. Sun, Y.-W. Chen, L. Lin, Tensorformer: a tensor-based multimodal transformer for multimodal sentiment analysis and depression detection, IEEE Transact. Affective Comput. (2022) 2022.

[19]

S. Gupta, L. Goel, A. Singh, A. Prasad, M.A. Ullah, Psychological analysis for depression detection from social networking sites, Comput. Intell. Neurosci. 2022 (2022).

[20]

U. Ahmed, J.C.-W. Lin, G. Srivastava, Social media multiaspect detection by using unsupervised deep active attention, IEEE Transact. Computat. Soc. Syst. 2022 (2022).

[21]

S. Ghosh, A. Ekbal, P. Bhattacharyya, What does your bio say? inferring twitter users' depression status from multimodal profile information using deep learning, IEEE Transact. Computat. Soc. Syst. 2021 (2021).

[22]

H. Sun, J. Liu, S. Chai, Z. Qiu, L. Lin, X. Huang, Y. Chen, Multi-modal adaptive fusion transformer network for the estimation of depression level, Sensors 21 (14)(2021) 4764.

[23]

J. Yoon, C. Kang, S. Kim, J. Han, D-vlog: multimodal vlog dataset for depression detection,in: Proceedings of the AAAI Conference on Artificial Intelligence, 2022.

[24]

Y. Guo, C. Zhu, S. Hao, R. Hong, A topic-attentive transformer-based model for multimodal depression detection, arXiv preprint arXiv:2206. 13256 (2022) 2022.

[25]

Z. Ning, K. Zhang, X. Wang, L. Guo, X. Hu, J. Huang, B. Hu, R.Y. Kwok, Intelligent edge computing in internet of vehicles: a joint computation offloading and caching solution, IEEE Trans. Intell. Transport. Syst. 22 (4) (2020) 2212-2225.

[26]

Z. Ning, J. Huang, X. Wang, J.J. Rodrigues, L. Guo, Mobile edge computing-enabled internet of vehicles: toward energy-efficient scheduling, IEEE Network 33 (5) (2019) 198-205.

[27]

R. Carnap, Y. Bar-Hillel, et al., An Outline of a Theory of Semantic Information vol. 1952, Research Laboratory of Electronics, Massachusetts Institute of Technology, 1952.

[28]

J. Bao, P. Basu, M. Dean, C. Partridge, A. Swami, W. Leland, J.A. Hendler, Towards a theory of semantic communication, in: 2011 IEEE Network Science Workshop, IEEE, 2011, pp. 110-117.

[29]

P. Basu, J. Bao, M. Dean, J. Hendler, Preserving quality of information by using semantic relationships, Pervasive Mob. Comput. 11 (2014) 188-202.

[30]

Z. Ning, P. Dong, X. Wang, X. Hu, L. Guo, B. Hu, Y. Guo, T. Qiu, R.Y. Kwok, Mobile edge computing enabled 5g health monitoring for internet of medical things: a decentralized game theoretic approach, IEEE J. Sel. Area. Commun. 39 (2) (2020) 463-478.

[31]

Z. Ning, K. Zhang, X. Wang, M.S. Obaidat, L. Guo, X. Hu, B. Hu, Y. Guo, B. Sadoun, R.Y. Kwok, Joint computing and caching in 5g-envisioned internet of vehicles: a deep reinforcement learning-based traffic control system, IEEE Trans. Intell. Transport. Syst. 22 (8) (2020) 5201-5212.

[32]

B. Güler, A. Yener, A. Swami, The semantic communication game, IEEE Transact. Cognit. Communi. Network. 4 (4) (2018) 787-802.

[33]

N. Farsad, M. Rao, A. Goldsmith, Deep learning for joint source-channel coding of text, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 2326-2330.

[34]

H. Xie, Z. Qin, G.Y. Li, B.-H. Juang, Deep learning enabled semantic communication systems, IEEE Trans. Signal Process. 69 (2021) 2663-2675.

[35]

E. Bourtsoulatze, D.B. Kurka, D. Gündüz, Deep joint source-channel coding for wireless image transmission, IEEE Transact. Cognit. Communi. Network. 5 (3) (2019) 567-579.

[36]

D.B. Kurka, D. Gündüz, Deepjscc-f: deep joint source-channel coding of images with feedback, IEEE J. Selected Areas Info. Theory 1 (1) (2020) 178-193.

[37]

M. Jankowski, D. Gündüz, K. Mikolajczyk, Joint device-edge inference over wireless links with pruning, in: 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), IEEE, 2020, pp. 1-5.

[38]

M. Yang, Y. Ma, Z. Liu, H. Cai, X. Hu, B. Hu, Undisturbed mental state assessment in the 5g era: a case study of depression detection based on facial expressions, IEEE Wireless Commun. 28 (3) (2021) 46-53.

[39]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017)(2017).

[40]

G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, R. Collobert,End-to-end Asr: from Supervised to Semi-supervised Learning with Modern Architectures, 2019, p. 2019, arXiv preprint arXiv: 1911.08460.

[41]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x 16 words: transformers for image recognition at scale,in:International Conference on Learning Representations, 2020.

[42]

J. Yang, X. Dong, L. Liu, C. Zhang, J. Shen, D. Yu,Recurring the transformer for video action recognition, in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14063-14073.

[43]

C. Wang, Z. Wang, Progressive multi-scale vision transformer for facial action unit detection, Front. Neurorob. 15 (2021) (2021).

[44]

A.-M. Bucur, A. Cosma, P. Rosso, L.P. Dinu, It's just a matter of time: detecting depression with time-enriched multimodal transformers, arXiv preprint arXiv: 2301.05453 (2023) 2023.

[45]

C. Doersch, A. Gupta, A. Zisserman, Crosstransformers: spatially-aware few-shot transfer, Adv. Neural Inf. Process. Syst. 33 (2020) 21981-21993.

[46]

B. Li, P. Xiong, C. Han, T. Guo,Shrinking temporal attention in transformers for video action recognition, in:Proceedings of the AAAI Conference on Artificial Intelligence, 2022.

[47]

J. Hu, L. Shen, G. Sun,Squeeze-and-excitation networks, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132-7141.

[48]

J. Gratch, R. Artstein, G. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsella, et al., Tech. rep, in: The Distress Analysis Interview Corpus of Human and Computer Interviews, University of Southern California Los Angeles, 2014.

[49]

D.E. King, Dlib-ml: a machine learning toolkit, J. Mach. Learn. Res. 10 (2009) 1755-1758.

[50]

F. Eyben, K.R. Scherer, B.W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S.S. Narayanan, et al., The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing, IEEE Transact. Affect. Comput. 7 (2) (2015) 190-202.

[51]

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: an imperative style, high-performance deep learning library, Adv. Neural Inf. Process., Syst. 32 (2019) (2019).

[52]

L. Bottou, Stochastic gradient descent tricks, in: Neural Networks: Tricks of the Trade, Springer, 2012, pp. 421-436.

[53]

I. Loshchilov, F. Hutter,Sgdr: Stochastic Gradient Descent with Warm Restarts, 2016, p. 2016, arXiv preprint arXiv:1608.03983.

[54]

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (1)(2014) 1929-1958.

[55]

Y. Fang, F. Zhao, Y. Qin, H. Luo, C. Wang, Learning all dynamics: traffic forecasting via locality-aware spatio-temporal joint transformer, IEEE Trans. Intell. Transport. Syst. 2022 (2022).

[56]

A. Pampouchidou, O. Simantiraki, C.-M. Vazakopoulou, C. Chatzaki, M. Pediaditis, A. Maridaki, K. Marias, P. Simos, F. Yang, F. Meriaudeau, et al., Facial geometry and speech analysis for depression detection, in: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE, 2017, pp. 1433-1436.

[57]

A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. Morency,Tensor fusion network for multimodal sentiment analysis, in:Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103-1114.

[58]

S. Yin, C. Liang, H. Ding, S. Wang,A multi-modal hierarchical recurrent neural network for depression detection, in:Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, 2019, pp. 65-71.

AI Summary AI Mindmap
PDF

112

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/