Spreadsheets are very common for information processing to support decision making by both professional developers and non-technical end users. Moreover, business intelligence and artificial intelligence are increasingly popular in the industry nowadays, where spreadsheets have been used as, or integrated into, intelligent or expert systems in various application domains. However, it has been repeatedly reported that faults often exist in operational spreadsheets, which could severely compromise the quality of conclusions and decisions based on the spreadsheets. With a view to systematically examining this problem via survey of existing work, we have conducted a comprehensive literature review on the quality issues and related techniques of spreadsheets over a 35.5-year period (from January 1987 to June 2022) for target journals and a 10.5-year period (from January 2012 to June 2022) for target conferences. Among other findings, two major ones are: (a) Spreadsheet quality is best addressed throughout the whole spreadsheet life cycle, rather than just focusing on a few specific stages of the life cycle. (b) Relatively more studies focus on spreadsheet testing and debugging (related to fault detection and removal) when compared with spreadsheet specification, modeling, and design (related to development). As prevention is better than cure, more research should be performed on the early stages of the spreadsheet life cycle. Enlightened by our comprehensive review, we have identified the major research gaps as well as highlighted key research directions for future work in the area.
Transformer architectures have facilitated the development of large-scale and general-purpose sequence models for prediction tasks in natural language processing and computer vision, e.g., GPT-3 and Swin Transformer. Although originally designed for prediction problems, it is natural to inquire about their suitability for sequential decision-making and reinforcement learning problems, which are typically beset by long-standing issues involving sample efficiency, credit assignment, and partial observability. In recent years, sequence models, especially the Transformer, have attracted increasing interest in the RL communities, spawning numerous approaches with notable effectiveness and generalizability. This survey presents a comprehensive overview of recent works aimed at solving sequential decision-making tasks with sequence models such as the Transformer, by discussing the connection between sequential decision-making and sequence modeling, and categorizing them based on the way they utilize the Transformer. Moreover, this paper puts forth various potential avenues for future research intending to improve the effectiveness of large sequence models for sequential decision-making, encompassing theoretical foundations, network architectures, algorithms, and efficient training systems.
Federated learning is a promising learning paradigm that allows collaborative training of models across multiple data owners without sharing their raw datasets. To enhance privacy in federated learning, multi-party computation can be leveraged for secure communication and computation during model training. This survey provides a comprehensive review on how to integrate mainstream multi-party computation techniques into diverse federated learning setups for guaranteed privacy, as well as the corresponding optimization techniques to improve model accuracy and training efficiency. We also pinpoint future directions to deploy federated learning to a wider range of applications.
Model-based reinforcement learning is a promising direction to improve the sample efficiency of reinforcement learning with learning a model of the environment. Previous model learning methods aim at fitting the transition data, and commonly employ a supervised learning approach to minimize the distance between the predicted state and the real state. The supervised model learning methods, however, diverge from the ultimate goal of model learning, i.e., optimizing the learned-in-the-model policy. In this work, we investigate how model learning and policy learning can share the same objective of maximizing the expected return in the real environment. We find model learning towards this objective can result in a target of enhancing the similarity between the gradient on generated data and the gradient on the real data. We thus derive the gradient of the model from this target and propose the Model Gradient algorithm (MG) to integrate this novel model learning approach with policy-gradient-based policy optimization. We conduct experiments on multiple locomotion control tasks and find that MG can not only achieve high sample efficiency but also lead to better convergence performance compared to traditional model-based reinforcement learning approaches.
The Internet of Everything (IoE) based cloud computing is one of the most prominent areas in the digital big data world. This approach allows efficient infrastructure to store and access big real-time data and smart IoE services from the cloud. The IoE-based cloud computing services are located at remote locations without the control of the data owner. The data owners mostly depend on the untrusted Cloud Service Provider (CSP) and do not know the implemented security capabilities. The lack of knowledge about security capabilities and control over data raises several security issues. Deoxyribonucleic Acid (DNA) computing is a biological concept that can improve the security of IoE big data. The IoE big data security scheme consists of the Station-to-Station Key Agreement Protocol (StS KAP) and Feistel cipher algorithms. This paper proposed a DNA-based cryptographic scheme and access control model (DNACDS) to solve IoE big data security and access issues. The experimental results illustrated that DNACDS performs better than other DNA-based security schemes. The theoretical security analysis of the DNACDS shows better resistance capabilities.
Super-resolution (SR) is a long-standing problem in image processing and computer vision and has attracted great attention from researchers over the decades. The main concept of SR is to reconstruct images from low-resolution (LR) to high-resolution (HR).It is an ongoing process in image technology, through up-sampling, de-blurring, and de-noising. Convolution neural network (CNN) has been widely used to enhance the resolution of images in recent years. Several alternative methods use deep learning to improve the progress of image super-resolution based on CNN. Here, we review the recent findings of single image super-resolution using deep learning with an emphasis on distillation knowledge used to enhance image super-resolution., it is also to highlight the potential applications of image super-resolution in security monitoring, medical diagnosis, microscopy image processing, satellite remote sensing, communication transmission, the digital multimedia industry and video enhancement. Finally, we present the challenges and assess future trends in super-resolution based on deep learning.
A threshold signature is a special digital signature in which the
Visualization and artificial intelligence (AI) are well-applied approaches to data analysis. On one hand, visualization can facilitate humans in data understanding through intuitive visual representation and interactive exploration. On the other hand, AI is able to learn from data and implement bulky tasks for humans. In complex data analysis scenarios, like epidemic traceability and city planning, humans need to understand large-scale data and make decisions, which requires complementing the strengths of both visualization and AI. Existing studies have introduced AI-assisted visualization as AI4VIS and visualization-assisted AI as VIS4AI. However, how can AI and visualization complement each other and be integrated into data analysis processes are still missing. In this paper, we define three integration levels of visualization and AI. The highest integration level is described as the framework of VIS+AI, which allows AI to learn human intelligence from interactions and communicate with humans through visual interfaces. We also summarize future directions of VIS+AI to inspire related studies.
Graphs that are used to model real-world entities with vertices and relationships among entities with edges, have proven to be a powerful tool for describing real-world problems in applications. In most real-world scenarios, entities and their relationships are subject to constant changes. Graphs that record such changes are called dynamic graphs. In recent years, the widespread application scenarios of dynamic graphs have stimulated extensive research on dynamic graph processing systems that continuously ingest graph updates and produce up-to-date graph analytics results. As the scale of dynamic graphs becomes larger, higher performance requirements are demanded to dynamic graph processing systems. With the massive parallel processing power and high memory bandwidth, GPUs become mainstream vehicles to accelerate dynamic graph processing tasks. GPU-based dynamic graph processing systems mainly address two challenges: maintaining the graph data when updates occur (i.e., graph updating) and producing analytics results in time (i.e., graph computing). In this paper, we survey GPU-based dynamic graph processing systems and review their methods on addressing both graph updating and graph computing. To comprehensively discuss existing dynamic graph processing systems on GPUs, we first introduce the terminologies of dynamic graph processing and then develop a taxonomy to describe the methods employed for graph updating and graph computing. In addition, we discuss the challenges and future research directions of dynamic graph processing on GPUs.
Uncertain Knowledge Graphs (UKGs) are used to characterize the inherent uncertainty of knowledge and have a richer semantic structure than deterministic knowledge graphs. The research on the embedding of UKG has only recently begun, Uncertain Knowledge Graph Embedding (UKGE) model has a certain effect on solving this problem. However, there are still unresolved issues. On the one hand, when reasoning the confidence of unseen relation facts, the introduced probabilistic soft logic cannot be used to combine multi-path and multi-step global information, leading to information loss. On the other hand, the existing UKG embedding model can only model symmetric relation facts, but the embedding problem of asymmetric relation facts has not be addressed. To address the above issues, a Multiplex Uncertain Knowledge Graph Embedding (MUKGE) model is proposed in this paper. First, to combine multiple information and achieve more accurate results in confidence reasoning, the Uncertain ResourceRank (URR) reasoning algorithm is introduced. Second, the asymmetry in the UKG is defined. To embed asymmetric relation facts of UKG, a multi-relation embedding model is proposed. Finally, experiments are carried out on different datasets via 4 tasks to verify the effectiveness of MUKGE. The results of experiments demonstrate that MUKGE can obtain better overall performance than the baselines, and it helps advance the research on UKG embedding.
With the development of information technology and cloud computing, data sharing has become an important part of scientific research. In traditional data sharing, data is stored on a third-party storage platform, which causes the owner to lose control of the data. As a result, there are issues of intentional data leakage and tampering by third parties, and the private information contained in the data may lead to more significant issues. Furthermore, data is frequently maintained on multiple storage platforms, posing significant hurdles in terms of enlisting multiple parties to engage in data sharing while maintaining consistency. In this work, we propose a new architecture for applying blockchains to data sharing and achieve efficient and reliable data sharing among heterogeneous blockchains. We design a new data sharing transaction mechanism based on the system architecture to protect the security of the raw data and the processing process. We also design and implement a hybrid concurrency control protocol to overcome issues caused by the large differences in blockchain performance in our system and to improve the success rate of data sharing transactions. We took Ethereum and Hyperledger Fabric as examples to conduct cross-blockchain data sharing experiments. The results show that our system achieves data sharing across heterogeneous blockchains with reasonable performance and has high scalability.
Container-based virtualization is becoming increasingly popular in cloud computing due to its efficiency and flexibility. Resource isolation is a fundamental property of containers. Existing works have indicated weak resource isolation could cause significant performance degradation for containerized applications and enhanced resource isolation. However, current studies have almost not discussed the isolation problems of page cache which is a key resource for containers. Containers leverage memory cgroup to control page cache usage. Unfortunately, existing policy introduces two major problems in a container-based environment. First, containers can utilize more memory than limited by their cgroup, effectively breaking memory isolation. Second, the OS kernel has to evict page cache to make space for newly-arrived memory requests, slowing down containerized applications. This paper performs an empirical study of these problems and demonstrates the performance impacts on containerized applications. Then we propose pCache (precise control of page cache) to address the problems by dividing page cache into private and shared and controlling both kinds of page cache separately and precisely. To do so, pCache leverages two new technologies: fair account (f-account) and evict on demand (EoD). F-account splits the shared page cache charging based on per-container share to prevent containers from using memory for free, enhancing memory isolation. And EoD reduces unnecessary page cache evictions to avoid the performance impacts. The evaluation results demonstrate that our system can effectively enhance memory isolation for containers and achieve substantial performance improvement over the original page cache management policy.
The sixth-generation (6G) wireless communication system is envisioned be cable of providing highly dependable services by integrating with native reliable and trustworthy functionalities. Zero-trust vehicular networks is one of the typical scenarios for 6G dependable services. Under the technical framework of vehicle-and-roadside collaboration, more and more on-board devices and roadside infrastructures will communicate for information exchange. The reliability and security of the vehicle-and-roadside collaboration will directly affect the transportation safety. Considering a zero-trust vehicular environment, to prevent malicious vehicles from uploading false or invalid information, we propose a malicious vehicle identity disclosure approach based on the Shamir secret sharing scheme. Meanwhile, a two-layer consortium blockchain architecture and smart contracts are designed to protect the identity and privacy of benign vehicles as well as the security of their private data. After that, in order to improve the efficiency of vehicle identity disclosure, we present an inspection policy based on zero-sum game theory and a roadside unit incentive mechanism jointly using contract theory and subjective logic model. We verify the performance of the entire zero-trust solution through extensive simulation experiments. On the premise of protecting the vehicle privacy, our solution is demonstrated to significantly improve the reliability and security of 6G vehicular networks.
In this paper, we propose a novel warm restart technique using a new logarithmic step size for the stochastic gradient descent (SGD) approach. For smooth and non-convex functions, we establish an
Multiparty private set intersection (PSI) allows several parties, each holding a set of elements, to jointly compute the intersection without leaking any additional information. With the development of cloud computing, PSI has a wide range of applications in privacy protection. However, it is complex to build an efficient and reliable scheme to protect user privacy.
To address this issue, we propose EMPSI, an efficient PSI (with cardinality) protocol in a multiparty setting. EMPSI avoids using heavy cryptographic primitives (mainly rely on symmetric-key encryption) to achieve better performance. In addition, both PSI and PSI with the cardinality of EMPSI are secure against semi-honest adversaries and allow any number of colluding clients (at least one honest client). We also do experiments to compare EMPSI with some state-of-the-art works. The experimental results show that proposed EMPSI(-CA) has better performance and is scalable in the number of clients and the set size.
Heterogeneous information network (HIN) has recently been widely adopted to describe complex graph structure in recommendation systems, proving its effectiveness in modeling complex graph data. Although existing HIN-based recommendation studies have achieved great success by performing message propagation between connected nodes on the defined metapaths, they have the following major limitations. Existing works mainly convert heterogeneous graphs into homogeneous graphs via defining metapaths, which are not expressive enough to capture more complicated dependency relationships involved on the metapath. Besides, the heterogeneous information is more likely to be provided by item attributes while social relations between users are not adequately considered. To tackle these limitations, we propose a novel social recommendation model MPISR, which models MetaPath Interaction for Social Recommendation on heterogeneous information network. Specifically, our model first learns the initial node representation through a pretraining module, and then identifies potential social friends and item relations based on their similarity to construct a unified HIN. We then develop the two-way encoder module with similarity encoder and instance encoder to capture the similarity collaborative signals and relational dependency on different metapaths. Extensive experiments on five real datasets demonstrate the effectiveness of our method.
The deep learning methods based on syntactic dependency tree have achieved great success on Aspect-based Sentiment Analysis (ABSA). However, the accuracy of the dependency parser cannot be determined, which may keep aspect words away from its related opinion words in a dependency tree. Moreover, few models incorporate external affective knowledge for ABSA. Based on this, we propose a novel architecture to tackle the above two limitations, while fills up the gap in applying heterogeneous graphs convolution network to ABSA. Specially, we employ affective knowledge as an sentiment node to augment the representation of words. Then, linking sentiment node which have different attributes with word node through a specific edge to form a heterogeneous graph based on dependency tree. Finally, we design a multi-level semantic heterogeneous graph convolution network (Semantic-HGCN) to encode the heterogeneous graph for sentiment prediction. Extensive experiments are conducted on the datasets SemEval 2014 Task 4, SemEval 2015 task 12, SemEval 2016 task 5 and ACL 14 Twitter. The experimental results show that our method achieves the state-of-the-art performance.
Hybrid memory systems composed of dynamic random access memory (DRAM) and Non-volatile memory (NVM) often exploit page migration technologies to fully take the advantages of different memory media. Most previous proposals usually migrate data at a granularity of 4 KB pages, and thus waste memory bandwidth and DRAM resource. In this paper, we propose Mocha, a non-hierarchical architecture that organizes DRAM and NVM in a flat address space physically, but manages them in a cache/memory hierarchy. Since the commercial NVM device–Intel Optane DC Persistent Memory Modules (DCPMM) actually access the physical media at a granularity of 256 bytes (an Optane block), we manage the DRAM cache at the 256-byte size to adapt to this feature of Optane. This design not only enables fine-grained data migration and management for the DRAM cache, but also avoids write amplification for Intel Optane DCPMM. We also create an Indirect Address Cache (IAC) in Hybrid Memory Controller (HMC) and propose a reverse address mapping table in the DRAM to speed up address translation and cache replacement. Moreover, we exploit a utility-based caching mechanism to filter cold blocks in the NVM, and further improve the efficiency of the DRAM cache. We implement Mocha in an architectural simulator. Experimental results show that Mocha can improve application performance by 8.2% on average (up to 24.6%), reduce 6.9% energy consumption and 25.9% data migration traffic on average, compared with a typical hybrid memory architecture–HSCC.
Significant progress has been made in machine learning with large amounts of clean labels and static data. However, in many real-world applications, the data often changes with time and it is difficult to obtain massive clean annotations, that is, noisy labels and time series are faced simultaneously. For example, in product-buyer evaluation, each sample records the daily time behavior of users, but the long transaction period brings difficulties to analysis, and salespeople often erroneously annotate the user’s purchase behavior. Such a novel setting, to our best knowledge, has not been thoroughly studied yet, and there is still a lack of effective machine learning methods. In this paper, we present a systematic approach RTS both theoretically and empirically, consisting of two components, Noise-Tolerant Time Series Representation and Purified Oversampling Learning. Specifically, we propose reducing label noise’s destructive impact to obtain robust feature representations and potential clean samples. Then, a novel learning method based on the purified data and time series oversampling is adopted to train an unbiased model. Theoretical analysis proves that our proposal can improve the quality of the noisy data set. Empirical experiments on diverse tasks, such as the house-buyer evaluation task from real-world applications and various benchmark tasks, clearly demonstrate that our new algorithm robustly outperforms many competitive methods.
A great many practical applications have observed knowledge evolution, i.e., continuous born of new knowledge, with its formation influenced by the structure of historical knowledge. This observation gives rise to evolving knowledge graphs whose structure temporally grows over time. However, both the modal characterization and the algorithmic implementation of evolving knowledge graphs remain unexplored. To this end, we propose EvolveKG – a general framework that enables algorithms in the static knowledge graphs to learn the evolving ones. EvolveKG quantifies the influence of a historical fact on a current one, called the effectiveness of the fact, and makes knowledge prediction by leveraging all the cross-time knowledge interaction. The novelty of EvolveKG lies in Derivative Graph – a weighted snapshot of evolution at a certain time. Particularly, each weight quantifies knowledge effectiveness through a temporarily decaying function of consistency and attenuation, two proposed factors depicting whether or not the effectiveness of a fact fades away with time. Besides, considering both knowledge creation and loss, we obtain higher prediction accuracy when the effectiveness of all the facts increases with time or remains unchanged. Under four real datasets, the superiority of EvolveKG is confirmed in prediction accuracy.
Deterministic databases are able to reduce coordination costs in a replication. This property has fostered a significant interest in the design of efficient deterministic concurrency control protocols. However, the state-of-the-art deterministic concurrency control protocol Aria has three issues. First, it is impractical to configure a suitable batch size when the read-write set is unknown. Second, Aria running in low-concurrency scenarios, e.g., a single-thread scenario, suffers from the same conflicts as running in high-concurrency scenarios. Third, the single-version schema brings write-after-write conflicts.
To address these issues, we propose Gria, an efficient deterministic concurrency control protocol. Gria has the following properties. First, the batch size of Gria is auto-scaling. Second, Gria’s conflict probability in low-concurrency scenarios is lower than that in high-concurrency scenarios. Third, Gria has no write-after-write conflicts by adopting a multi-version structure. To further reduce conflicts, we propose two optimizations: a reordering mechanism as well as a rechecking strategy. The evaluation result on two popular benchmarks shows that Gria outperforms Aria by 13x.
Federated learning (FL) has emerged to break data-silo and protect clients’ privacy in the field of artificial intelligence. However, deep leakage from gradient (DLG) attack can fully reconstruct clients’ data from the submitted gradient, which threatens the fundamental privacy of FL. Although cryptology and differential privacy prevent privacy leakage from gradient, they bring negative effect on communication overhead or model performance. Moreover, the original distribution of local gradient has been changed in these schemes, which makes it difficult to defend against adversarial attack. In this paper, we propose a novel federated learning framework with model decomposition, aggregation and assembling (FedDAA), along with a training algorithm, to train federated model, where local gradient is decomposed into multiple blocks and sent to different proxy servers to complete aggregation. To bring better privacy protection performance to FedDAA, an indicator is designed based on image structural similarity to measure privacy leakage under DLG attack and an optimization method is given to protect privacy with the least proxy servers. In addition, we give defense schemes against adversarial attack in FedDAA and design an algorithm to verify the correctness of aggregated results. Experimental results demonstrate that FedDAA can reduce the structural similarity between the reconstructed image and the original image to 0.014 and remain model convergence accuracy as 0.952, thus having the best privacy protection performance and model training effect. More importantly, defense schemes against adversarial attack are compatible with privacy protection in FedDAA and the defense effects are not weaker than those in the traditional FL. Moreover, verification algorithm of aggregation results brings about negligible overhead to FedDAA.
Commonsense question answering (CQA) requires understanding and reasoning over QA context and related commonsense knowledge, such as a structured Knowledge Graph (KG). Existing studies combine language models and graph neural networks to model inference. However, traditional knowledge graph are mostly concept-based, ignoring direct path evidence necessary for accurate reasoning. In this paper, we propose MRGNN (Meta-path Reasoning Graph Neural Network), a novel model that comprehensively captures sequential semantic information from concepts and paths. In MRGNN, meta-paths are introduced as direct inference evidence and an original graph neural network is adopted to aggregate features from both concepts and paths simultaneously. We conduct sufficient experiments on the CommonsenceQA and OpenBookQA datasets, showing the effectiveness of MRGNN. Also, we conduct further ablation experiments and explain the reasoning behavior through the case study.
In recent years, the demand for real-time data processing has been increasing, and various stream processing systems have emerged. When the amount of data input to the stream processing system fluctuates, the computing resources required by the stream processing job will also change. The resources used by stream processing jobs need to be adjusted according to load changes, avoiding the waste of computing resources. At present, existing works adjust stream processing jobs based on the assumption that there is a linear relationship between the operator parallelism and operator resource consumption (e.g., throughput), which makes a significant deviation when the operator parallelism increases. This paper proposes a nonlinear model to represent operator performance. We divide the operator performance into three stages, the Non-competition stage, the Non-full competition stage, and the Full competition stage. Using our proposed performance model, given the parallelism of the operator, we can accurately predict the CPU utilization and operator throughput. Evaluated with actual experiments, the prediction error of our model is below 5%. We also propose a quick accurate auto-scaling (QAAS) method that uses the operator performance model to implement the auto-scaling of the operator parallelism of the Flink job. Compared to previous work, QAAS is able to maintain stable job performance under load changes, minimizing the number of job adjustments and reducing data backlogs by 50%.
Recent years have seen the wide application of natural language processing (NLP) models in crucial areas such as finance, medical treatment, and news media, raising concerns about the model robustness and vulnerabilities. We find that prompt paradigm can probe special robust defects of pre-trained language models. Malicious prompt texts are first constructed for inputs and a pre-trained language model can generate adversarial examples for victim models via mask-filling. Experimental results show that prompt paradigm can efficiently generate more diverse adversarial examples besides synonym substitution. Then, we propose a novel robust training approach based on prompt paradigm which incorporates prompt texts as the alternatives to adversarial examples and enhances robustness under a lightweight minimax-style optimization framework. Experiments on three real-world tasks and two deep neural models show that our approach can significantly improve the robustness of models to resist adversarial attacks.
The use of all samples in the optimization process does not produce robust results in datasets with label noise. Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction. In this paper, we recommend using samples with loss less than a threshold determined during the optimization, instead of using all samples in the mini-batch. Our proposed method, Adaptive-k, aims to exclude label noise samples from the optimization process and make the process robust. On noisy datasets, we found that using a threshold-based approach, such as Adaptive-k, produces better results than using all samples or a fixed number of low-loss samples in the mini-batch. On the basis of our theoretical analysis and experimental results, we show that the Adaptive-k method is closest to the performance of the Oracle, in which noisy samples are entirely removed from the dataset. Adaptive-k is a simple but effective method. It does not require prior knowledge of the noise ratio of the dataset, does not require additional model training, and does not increase training time significantly. In the experiments, we also show that Adaptive-k is compatible with different optimizers such as SGD, SGDM, and Adam. The code for Adaptive-k is available at GitHub.