Large Language Models (LLMs) have demonstrated remarkable performance across various downstream tasks, as evidenced by numerous studies. Since 2022, generative AI has shown significant potential in diverse application domains, including gaming, film and television, media, and finance. By 2023, the global AI-generated content (AIGC) industry had attracted over $26 billion in investment. As LLMs become increasingly prevalent, prompt engineering has emerged as a key research area to enhance user-AI interactions and improve LLM performance. The prompt, which serves as the input instruction for the LLM, is closely linked to the model’s responses. Prompt engineering refines the content and structure of prompts, thereby enhancing the performance of LLMs without changing the underlying model parameters. Despite significant advancements in prompt engineering, a comprehensive and systematic summary of existing techniques and their practical applications remains absent. To fill this gap, we investigate existing techniques and applications of prompt engineering. We conduct a thorough review and propose a novel taxonomy that provides a foundational framework for prompt construction. This taxonomy categorizes prompt engineering into four distinct aspects: profile and instruction, knowledge, reasoning and planning, and reliability. By providing a structured framework for understanding its various dimensions, we aim to facilitate the systematic design of prompts. Furthermore, we summarize existing prompt engineering techniques and explore the applications of LLMs across various domains, highlighting their interrelation with prompt engineering strategies. This survey underscores the progress of prompt engineering and its critical role in advancing AI applications, ultimately aiming to provide a systematic reference for future research and applications.
Recent advancements in large language models (LLMs) have significantly contributed to the progress of the Text-to-SQL task. A common requirement in many of these works is the post-correction of SQL queries. However, the majority of this process entails analyzing error cases to develop prompts with rules that eliminate model bias. And there is a weakness of execution verification for SQL queries. In addition, the prevalent techniques primarily depend on GPT-4 and few-shot prompts, resulting in expensive costs. To investigate the effective methods for SQL refinement in a cost-efficient manner, we introduce Semantic-Enhanced Text-to-SQL with Adaptive Refinement (SEA-SQL), which includes Adaptive Bias Elimination and Dynamic Execution Adjustment, aims to improve performance while minimizing resource expenditure with zero-shot prompts. Specifically, SEA-SQL employs a semantic-enhanced schema to augment database information and optimize SQL queries. During the SQL query generation, a fine-tuned adaptive bias eliminator is applied to mitigate inherent biases caused by the LLM. The dynamic execution adjustment is utilized to guarantee the executability of the bias eliminated SQL query. We conduct experiments on the Spider and BIRD datasets to demonstrate the effectiveness of this framework. The results demonstrate that SEA-SQL achieves state-of-the-art performance in the GPT-3.5 scenario with 9%–58% of the generation cost. Furthermore, SEA-SQL is comparable to GPT-4 with only 0.9%–5.3% of the generation cost. Our code is available at the website of github.com/545999961/SEA-SQL.
Synthesizing tables – which generates fake tables that resemble real tables – is important for supervised machine learning (ML) with many practical applications, such as generating more data as a way of data augmentation or publishing synthetic tables while preserving the privacy of real tables. A fundamental question is: Given a real table, can we synthesize a table such that ML models trained on the real or the synthetic table perform similarly on an unseen test table?
Existing table synthesis works, mostly using deep generative models (e.g., GAN or VAE), try to learn the density function (or true data distribution) of the real table from sampled real records, by treating each record separately. In practice, the assumption that records are not correlated is often violated (e.g., different purchase records for the same product should be correlated). Failing to capture such correlation across records will result in a synthesized table that does not resemble the real table. Consequently, the ML model trained using the synthetic table performs very differently from the ML model trained using the real table. In this paper, we explicitly model such record correlation as groups determined by user-specified categorical values, e.g., records with the same (Market, Product) values should be in the same group. Given such groups, we propose to use conditional GANs to simultaneously model the probability densities of discrete (e.g., categorical) values and continuous (e.g., numeric) values that may co-exist within a record, to ensure that both global data distribution (within a table) and local data distributions (within groups of a table) are preserved in the synthetic table. Moreover, we extend previous differentially private GAN (DPGAN) that only ensures the discriminator of GAN is differentially private to further ensure the privacy of the original data embeddings and sample frequencies. Experiments show that our approach significantly outperforms the SOTA table synthesis methods for supervised ML, and can well protect privacy while providing high utility.
In recent years, personalized federated learning (PFL) has gained widespread attention for its robust performance in handling heterogeneous data. However, most PFL methods require client models to share the same architecture, which is impractical in real-world scenarios. Therefore, federated distillation learning is proposed, which allows clients to use different architecture models for FL training. Nevertheless, these methods do not consider the importance of different distillation knowledge aggregated by the client knowledge, resulting in poor client collaboration performance. In this paper, we propose a novel personalized federated learning method based on partial distillation (FedPD) that assesses the relevance of the different distillation knowledge and ensemble knowledge for each client, thereby achieving selective knowledge transfer. Specifically, FedPD contains two key modules. One is the partial knowledge transfer (PKT) which uses the partial distillation coefficient to identify the importance of each distillation knowledge to select more valuable distillation knowledge. The other is the partial knowledge ensemble (PKE), which maintains a server model for each client to extract distillation knowledge to guide the client. Extensive experiments on real-world datasets in various experimental settings show that FedPD significantly improves client model performance compared to state-of-the-art federated learning methods.
Multi-Hop Question Answering (MHQA) tasks require retrieving and reasoning over multiple relevant supporting facts to answer a question. However, existing MHQA models often rely on a single entity or fact to provide an answer, rather than performing true multi-hop reasoning. Additionally, during the reasoning process, models may be influenced by multiple irrelevant factors, leading to broken reasoning chains and even incorrect answers. In recent years, causal inference-based methods have gained widespread attention in bias removal research. But existing models still perform poorly when dealing the complex causal biases hidden in multi-hop evidence. To address these challenge, we propose CausalBridgeQA, a novel method that integrates multi-hop question answering with causal relationships, effectively mitigating feature spurious correlations and the problem of broken reasoning chains. Specifically, we first extract causal relationships from the input text context, then compile these relationships into causal questions containing higher-level semantic information and feed them into MHQA reasoning system. Finally, we design a knowledge compensation mechanism in the reading comprehension module of the MHQA system to specifically address questions that are difficult to answer or frequently answered incorrectly, significantly improving the performance of MHQA tasks. Finally, a series of experiments conducted on three real-world QA datasets verified the effectiveness of our proposed method.
Different from most conventional recommendation problems, sequential recommendation (SR) focuses on learning users’ preferences by exploiting the internal order and dependency among the interacted items, which has received significant attention from both researchers and practitioners. In recent years, we have witnessed great progress and achievements in this field, necessitating a new survey. In this survey, we study the SR problem from a new perspective (i.e., the construction of an item’s properties), and summarize the most recent techniques used in sequential recommendation such as multi-modal SR, generative SR, LLM-powered SR, ultra-long SR, and data-augmented SR. Moreover, we introduce some frontier research topics in SR, e.g., open-domain SR, data-centric SR, cloud-edge collaborative SR, continuous SR, SR for good, and explainable SR. We believe that our survey could be served as a valuable roadmap for readers in this field.
Privacy-preserving feature selection allows identifying more important features while ensuring data privacy, thus enhancing data quality. Secure multiparty computation (MPC) is a cryptographic method that allows effective data processing without a trusted third party. However, most MPC-based feature selection schemes overlook the correlation between features and perform poorly for model training when handling datasets containing both numerical and categorical attributes. This paper proposes a feature selection scheme, MPC-Relief, to select the relevant features while preserving privacy. To achieve safety under MPC, we transform all complex computational steps from data-dependent to data-oblivious with faithful implementations. In detail, we construct bidirectional vectors to partition subsets and propose an MPC-based nonlinear function, MN-Ramp, to calculate the difference between mixed attributes. Besides, we apply a mapping method for the distance calculation to eliminate the need for conditional judgments. We evaluate the computational and communication overhead of the MN-Ramp function in both WAN and LAN environments and validate its effectiveness across various datasets. The comparative analysis demonstrated that our scheme achieves up to an 18% accuracy improvement over other schemes when handling nonlinear datasets. The results of the classification task based on the selected features indicate that our scheme notably enhances the performance of subsequent models while ensuring strong privacy security guarantees.
In recent years, research has increasingly transformed data into graph representations, using graph neural networks to extract rich relationships and interaction information. This enhances the model’s ability to understand and process complex data structures. Due to the privacy and sensitivity of certain data, especially in government and enterprise fields, these high-quality data are often strictly controlled, limiting centralized model training. These issues lead to weaker generalization of traditional models for unseen data. To address these challenges, this paper proposes a Reinforced Federated Graph Domain Generalization (RFGDG) method, which improves generalization across domain data scenarios while protecting data privacy through multi-party collaboration. We design a mini-batch processing strategy based on graph sampling, combined with GraphSage, to build an efficient local graph-based node classification model. This sampling strategy reduces computational overhead while preserving graph structure, improving local model performance. To address data heterogeneity and feature inconsistency across clients, we propose a federated graph domain generalization strategy based on random Fourier feature transformation and weighted covariance matrix optimization, which unifies feature representations, reduces redundancy, and enhances adaptability to inconsistent data. We also propose a dynamic parameter aggregation strategy for federated graph neural networks using deep reinforcement learning. With the Deep Deterministic Policy Gradient (DDPG) algorithm, we dynamically adjust aggregation weights based on each client’s contribution, improving global model accuracy and convergence speed. This strategy considers graph structure heterogeneity and client contribution differences, ensuring generalization in multi-client environments. Extensive experiments on three public graph datasets and one dataset from the Weibo platform demonstrate that the proposed RFGDG method significantly improves global model accuracy and shows stronger robustness and adaptability in multi-client environments.
Data is undoubtedly becoming a commodity like oil, land, and labour in the 21st century. Although there have been many successful marketplaces for data trading, the existing data marketplaces lack consideration of the case where buyers want to acquire a collection of datasets (instead of one), and the overall spatial coverage and connectivity matter. In this paper, we make the first attempt to formulate this problem as Budgeted Maximum Coverage with Connectivity Constraint (BMCC), which aims to acquire a dataset collection with the maximum spatial coverage under a limited budget while maintaining spatial connectivity. To solve the problem, we propose two approximate algorithms with detailed theoretical guarantees and time complexity analysis, followed by two acceleration strategies to further improve the efficiency of the algorithm. Experiments are conducted on five real-world spatial dataset collections to verify the efficiency and effectiveness of our algorithms.