Six-Writings multimodal processing with pictophonetic coding to enhance Chinese language models
Li WEIGANG, Mayara Chew MARINHO, Denise Leyi LI, Vitor Vasconcelos DE OLIVEIRA
Six-Writings multimodal processing with pictophonetic coding to enhance Chinese language models
While large language models (LLMs) have made significant strides in natural language processing (NLP), they continue to face challenges in adequately addressing the intricacies of the Chinese language in certain scenarios. We propose a framework called Six-Writings multimodal processing (SWMP) to enable direct integration of Chinese NLP (CNLP) with morphological and semantic elements. The first part of SWMP, known as Six-Writings pictophonetic coding (SWPC), is introduced with a suitable level of granularity for radicals and components, enabling effective representation of Chinese characters and words. We conduct several experimental scenarios, including the following: (1) We establish an experimental database consisting of images and SWPC for Chinese characters, enabling dual-mode processing and matrix generation for CNLP. (2) We characterize various generative modes of Chinese words, such as thousands of Chinese idioms, used as question-and-answer (Q&A) prompt functions, facilitating analogies by SWPC. The experiments achieve 100% accuracy in answering all questions in the Chinese morphological data set (CA8-Mor-10177). (3) A fine-tuning mechanism is proposed to refine word embedding results using SWPC, resulting in an average relative error of ≤25% for 39.37% of the questions in the Chinese wOrd Similarity data set (COS960). The results demonstrate that SWMP/SWPC methods effectively capture the distinctive features of Chinese and offer a promising mechanism to enhance CNLP with better efficiency.
Chinese language model / Chinese natural language processing (CNLP) / Generative language model / Multimodal processing / Six-Writings
/
〈 | 〉 |