Cross-linguistic news topic discovery is the automatic classification of online news articles in different languages reporting on the same event. Its difficulty is in multilingual text clustering. There are many types of associations between news events, which can interact with each other and propagate throughout the event network. Using this idea, a method for Chinese-Vietnamese bilingual news topic discovery based on association graph clustering is proposed in this paper. First, a Chinese-Vietnamese bilingual association graph is constructed based on associations among elements of the articles. The Chinese and Vietnamese texts are clustered roughly using the affinity propagate (AP) algorithm, then the clustering results are adjusted, making use of the association sizes to update the weights dynamically. News in both languages is used to supervise the clustering, weakening the effect of linguistic differences on the results. Finally, optimal local and global Chinese-Vietnamese bilingual news clusters are obtained, realizing automated clustering. We use 2000 news texts obtained from 15 authoritative Chinese websites and 10 Vietnamese websites as experimental data. The experimental results show that the F value of the proposed method is improved by 8.4% compared with K-means clustering.
Large multimodal models (LMMs) have demonstrated significant success across various tasks but fall short on some basic visual functions, such as inaccurate object counting and imprecise localization. These limitations restrict the application of LMMs in broad scenarios. To enhance the capabilities of LMMs, we propose a novel method to patch their visual perceptual abilities by collaborating with small task-specific models. Our method begins with utilizing an LMM to decompose the user query into a series of visual functions. For each function, the appropriate model, either the LMM itself or a small task-specific model, is invoked. To determine whether to patch the LMM with a small task-specific model, we design a novel question-answering-based reinforcement learning strategy to optimize the decision process. Finally, the LMM generates the answer utilizing the visual perceptual results. The proposed method is evaluated on two standard visual question-answering datasets and two specialized datasets. The experimental results demonstrate that our method effectively enhances the visual abilities of LMMs.
Cardinality and cost estimation are critical components of query optimization, as they directly influence the construction of efficient physical execution plans. While machine learning-based estimators have achieved notable success, they face several challenges: (1) Training data derived from rigid, template-driven benchmarks exhibits significant distributional divergence from real-world query workloads, a challenge further compounded by the manual template design in exhaustively representing the full spectrum of query patterns. (2) These methods demonstrate limited generalization capabilities, especially in scenarios involving sub-plan estimation or queries that significantly deviate from the training query templates. Furthermore, the inherent inefficiency of operator-level cardinality estimation frequently undermines its applicability for accurate cost estimation. (3) These approaches frequently fail to leverage the rich semantic information and dynamic dependencies between operators.
To address these challenges, we propose a novel operator-level cardinality and cost estimator that simultaneously estimates the cardinality and cost of all sub-plans within a query plan. First, we leverage large language models to generate high-quality and diverse SQL queries, which serve as the foundation for pre-training and fine-tuning our model. Second, we introduce a semantic-based operator encoding strategy, augmented with a novel tree-structure-aware neural network, to effectively represent each sub-plan. Third, we propose a specialized loss function tailored for joint cardinality and cost prediction at the operator level, fully utilizing labels from each sub-plan. Extensive experiments on both synthetic and real-world datasets demonstrate that our method consistently outperforms state-of-the-art approaches.
Dialogue policy, a critical component in multi-domain task-oriented dialogue systems, decides the dialogue acts according to the received dialogue state. We introduce zero-shot reinforcement learning for dialogue policy learning, which aims to learn dialogue policies capable of generalizing to unseen domains without further training. This setup brings forward two challenges: 1) the representation of unseen actions & states, and 2) zero-shot generalization to unseen domains. For the first issue, we propose Unified Representation (UR), an ontology-agnostic representation, which effectively infers representations in unseen domains by capturing the underlying semantic relations between unseen actions and states and seen ones. To tackle the second issue, we propose Q-Values Perturbation (QVP), a family of exploration strategies that can be applied either during training or testing. Experiments on MultiWOZ, suggest that UR, QVP, and an integrated framework combining the two are all effective.
The rapid advancements in Deep Neural Networks (DNNs) have revolutionized generative software engineering tasks, including code summarization, program repair, code generation, and code translation. However, the performance of DNN models in these tasks heavily depends on the quality of their training and evaluation datasets. This systematic literature review examines 70 primary studies to comprehensively analyze dataset construction methodologies, prevalent data quality challenges, and solutions proposed to address these challenges. Our findings reveal that dataset construction processes significantly influence quality, with common issues such as noise, redundancy, imbalance, and insufficient granularity undermining model effectiveness. We identify key strategies to mitigate these problems, including data augmentation, automated cleaning techniques, and standardized validation frameworks. Furthermore, we highlight the critical role of dataset diversity and timeliness in improving model generalization. This study provides actionable insights for researchers and practitioners in the era of generative AI, where high-quality datasets are essential for developing reliable language models as software engineering tools. By emphasizing rigorous dataset curation and innovative quality assurance methods, our work bridges the gap between theoretical advancements and practical applications, enabling the creation of robust, generalizable models for real-world code-related tasks. The synthesized recommendations aim to guide future research in optimizing dataset design, fostering reproducibility, and addressing evolving challenges in data-driven software engineering.
Dialogue discourse parsing is a fundamental task in natural language understanding. It aims to capture the relationships between utterances in a dialogue, facilitating a deeper understanding of dialogue structures and semantics, especially in long and complex dialogues. Existing research often develops separate dialogue discourse parsers for text-only and multimodal scenarios, largely due to the scarcity of parallel multimodal annotated datasets. This separation limits the ability to fully utilize diverse data with different modalities and poses challenges for real-world artificial intelligence applications. To address the limitation, we propose a unified dialogue discourse parsing framework that bridges text-only and multimodal parsing within a single model. We first develop a basic text-only parser, pre-trained on textual datasets. Then, we extend it to multimodal scenarios by adding additional multimodal encoders and fusion modules, while freezing the parameters learned during the text-only stage. We conduct extensive experiments on three datasets, covering both text-only and multimodal dialogues. Experimental results show that our approach achieves significant average improvements over several existing benchmarks. This demonstrates the generalizability and effectiveness of our framework for dialogue discourse parsing across different modalities.
Bipartite graph-based multi-view clustering conducts clustering of samples in accordance with the relationships between samples and anchors, and has demonstrated significant advancements in recent years. Predefined bipartite graphs with fixed anchors may not reflect the underlying clustering structure accurately, leading to the degradation of clustering performance. To address this problem, we propose a Structure Sparsity-Induced Bipartite Graph (SSBG) learning method to dynamically construct view-specific bipartite graphs with automatically learned anchors. Concretely, representative anchors of each view are learned by integrating key samples selected by introducing a selection matrix with structure sparsity. Meanwhile, the feature matrix of each view is reconstructed by the learned anchors and the corresponding bipartite graph in a self-representation manner. Due to the representativeness of the anchors and the advantages of the self-representation model in representing complex relationships, the consistent bipartite graph fused from multiple views possesses enhanced ability to represent the underlying clustering structure. A converged iterating algorithm is developed to optimize for the objective function, and the final clustering partition can be directly obtained according to the connected components of the fused consistent bipartite graph. Extensive experimental results demonstrate the advantages of SSBG in clustering performance across various benchmark datasets.
Decentralized federated learning (DFL) is inherently vulnerable to data poisoning attacks, as malicious clients can transmit manipulated gradients to neighboring clients. Existing defense methods either reject suspicious gradients per iteration or restart DFL aggregation after excluding all malicious clients. They all neglect the potential benefits that may exist within contributions from malicious clients. In this paper, we propose a novel gradient purification defense, termed , to defend against data poisoning attacks in DFL. It aims to separately mitigate the harm in gradients and retain benefits embedded in model weights, thereby enhancing overall model accuracy. For each benign client in , a recording variable is designed to track historically aggregated gradients from one of its neighbors. It allows benign clients to precisely detect malicious neighbors and mitigate all aggregated malicious gradients at once. Upon mitigation, benign clients optimize model weights using purified gradients. This optimization not only retains previously beneficial components from malicious clients but also exploits canonical contributions from benign clients. We analyze the convergence of , as well as its ability to harvest high accuracy. Extensive experiments demonstrate that, is capable of mitigating data poisoning attacks under both iid and non-iid data distributions. It also significantly outperforms state-of-the-art defense methods in terms of model accuracy.
Large language models (LLMs) show impressive capabilities across many NLP tasks, but their enormous size creates major deployment challenges. While single compression methods provide limited solutions, combining approaches such as pruning, quantization, knowledge distillation, and low-rank approximation might be essential for both higher compression rates and better model performance.
This paper studies the synergistic effects of combining multiple LLM compression techniques. Our findings reveal that strategic combinations can potentially reduce model size by more than 90% while maintaining performance, with contextual pruning and quantization. Meanwhile, the order of application could impact outcomes, and that joint optimization of compression methods could outperform sequential combination. Although promising, existing combination approaches rely on manual design choices and lack a systematic framework for multi-technique compression. To address this, we prototype a formal framework for automated, multi-technique LLM compression that optimizes the combination sequence. Finally, we discuss remaining challenges and outline future research directions for more efficient large language models.