Jun 2024, Volume 18 Issue 3
    

  • Select all
    Architecture
  • RESEARCH ARTICLE
    Xianghao XU, Fang WANG, Hong JIANG, Yongli CHENG, Dan FENG, Peng FANG

    In order to analyze and process the large graphs with high cost efficiency, researchers have developed a number of out-of-core graph processing systems in recent years based on just one commodity computer. On the other hand, with the rapidly growing need of analyzing graphs in the real-world, graph processing systems have to efficiently handle massive concurrent graph processing (CGP) jobs. Unfortunately, due to the inherent design for single graph processing job, existing out-of-core graph processing systems usually incur unnecessary data accesses and severe competition of I/O bandwidth when handling the CGP jobs. In this paper, we propose GraphCP, a disk I/O optimized out-of-core graph processing system that efficiently supports the processing of CGP jobs. GraphCP proposes a benefit-aware sharing execution model to share the I/O access and processing of graph data among the CGP jobs and adaptively schedule the graph data loading based on the states of vertices, which efficiently overcomes above challenges faced by existing out-of-core graph processing systems. Moreover, GraphCP adopts a dependency-based future-vertex updating model so as to reduce disk I/Os in the future iterations. In addition, GraphCP organizes the graph data with a Source-Sorted Sub-Block graph representation for better processing capacity and I/O access locality. Extensive evaluation results show that GraphCP is 20.5× and 8.9× faster than two out-of-core graph processing systems GridGraph and GraphZ, and 3.5× and 1.7× faster than two state-of-art concurrent graph processing systems Seraph and GraphSO.

  • RESEARCH ARTICLE
    Tiezheng GUO, Zhiwei ZHANG, Ye YUAN, Xiaochun YANG, Guoren WANG

    With the development of information technology and cloud computing, data sharing has become an important part of scientific research. In traditional data sharing, data is stored on a third-party storage platform, which causes the owner to lose control of the data. As a result, there are issues of intentional data leakage and tampering by third parties, and the private information contained in the data may lead to more significant issues. Furthermore, data is frequently maintained on multiple storage platforms, posing significant hurdles in terms of enlisting multiple parties to engage in data sharing while maintaining consistency. In this work, we propose a new architecture for applying blockchains to data sharing and achieve efficient and reliable data sharing among heterogeneous blockchains. We design a new data sharing transaction mechanism based on the system architecture to protect the security of the raw data and the processing process. We also design and implement a hybrid concurrency control protocol to overcome issues caused by the large differences in blockchain performance in our system and to improve the success rate of data sharing transactions. We took Ethereum and Hyperledger Fabric as examples to conduct cross-blockchain data sharing experiments. The results show that our system achieves data sharing across heterogeneous blockchains with reasonable performance and has high scalability.

  • Software
  • LETTER
    Qianwen GOU, Yunwei DONG, YuJiao WU, Qiao KE
  • Artificial Intelligence
  • LETTER
    Xiran SONG, Hong HUANG, Jianxun LIAN, Hai JIN
  • LETTER
    Weixin LI, Yuhao WU, Yang LIU, Weike PAN, Zhong MING
  • RESEARCH ARTICLE
    Jing ZHANG, Ruidong FAN, Hong TAO, Jiacheng JIANG, Chenping HOU

    Clustering is widely exploited in data mining. It has been proved that embedding weak label prior into clustering is effective to promote its performance. Previous researches mainly focus on only one type of prior. However, in many real scenarios, two kinds of weak label prior information, e.g., pairwise constraints and cluster ratio, are easily obtained or already available. How to incorporate them to improve clustering performance is important but rarely studied. We propose a novel constrained Clustering with Weak Label Prior method (CWLP), which is an integrated framework. Within the unified spectral clustering model, the pairwise constraints are employed as a regularizer in spectral embedding and label proportion is added as a constraint in spectral rotation. To approximate a variant of the embedding matrix more precisely, we replace a cluster indicator matrix with its scaled version. Instead of fixing an initial similarity matrix, we propose a new similarity matrix that is more suitable for deriving clustering results. Except for the theoretical convergence and computational complexity analyses, we validate the effectiveness of CWLP through several benchmark datasets, together with its ability to discriminate suspected breast cancer patients from healthy controls. The experimental evaluation illustrates the superiority of our proposed approach.

  • RESEARCH ARTICLE
    Na LIU, Fan ZHANG, Liang CHANG, Fuqing DUAN

    Face attribute classification (FAC) is a high-profile problem in biometric verification and face retrieval. Although recent research has been devoted to extracting more delicate image attribute features and exploiting the inter-attribute correlations, significant challenges still remain. Wavelet scattering transform (WST) is a promising non-learned feature extractor. It has been shown to yield more discriminative representations and outperforms the learned representations in certain tasks. Applied to the image classification task, WST can enhance subtle image texture information and create local deformation stability. This paper designs a scattering-based hybrid block, to incorporate frequency-domain (WST) and image-domain features in a channel attention manner (Squeeze-and-Excitation, SE), termed WS-SE block. Compared with CNN, WS-SE achieves a more efficient FAC performance and compensates for the model sensitivity of the small-scale affine transform. In addition, to further exploit the relationships among the attribute labels, we propose a learning strategy from a causal view. The cause attributes defined using the causality-related information can be utilized to infer the effect attributes with a high confidence level. Ablative analysis experiments demonstrate the effectiveness of our model, and our hybrid model obtains state-of-the-art results in two public datasets.

  • RESEARCH ARTICLE
    Sheng XU, Peifeng LI, Qiaoming ZHU

    The discourse analysis task, which focuses on understanding the semantics of long text spans, has received increasing attention in recent years. As a critical component of discourse analysis, discourse relation recognition aims to identify the rhetorical relations between adjacent discourse units (e.g., clauses, sentences, and sentence groups), called arguments, in a document. Previous works focused on capturing the semantic interactions between arguments to recognize their discourse relations, ignoring important textual information in the surrounding contexts. However, in many cases, more than capturing semantic interactions from the texts of the two arguments are needed to identify their rhetorical relations, requiring mining more contextual clues. In this paper, we propose a method to convert the RST-style discourse trees in the training set into dependency-based trees and train a contextual evidence selector on these transformed structures. In this way, the selector can learn the ability to automatically pick critical textual information from the context (i.e., as evidence) for arguments to assist in discriminating their relations. Then we encode the arguments concatenated with corresponding evidence to obtain the enhanced argument representations. Finally, we combine original and enhanced argument representations to recognize their relations. In addition, we introduce auxiliary tasks to guide the training of the evidence selector to strengthen its selection ability. The experimental results on the Chinese CDTB dataset show that our method outperforms several state-of-the-art baselines in both micro and macro F1 scores.

  • RESEARCH ARTICLE
    Qi LIU, Qinghua ZHANG, Fan ZHAO, Guoyin WANG

    Uncertain Knowledge Graphs (UKGs) are used to characterize the inherent uncertainty of knowledge and have a richer semantic structure than deterministic knowledge graphs. The research on the embedding of UKG has only recently begun, Uncertain Knowledge Graph Embedding (UKGE) model has a certain effect on solving this problem. However, there are still unresolved issues. On the one hand, when reasoning the confidence of unseen relation facts, the introduced probabilistic soft logic cannot be used to combine multi-path and multi-step global information, leading to information loss. On the other hand, the existing UKG embedding model can only model symmetric relation facts, but the embedding problem of asymmetric relation facts has not be addressed. To address the above issues, a Multiplex Uncertain Knowledge Graph Embedding (MUKGE) model is proposed in this paper. First, to combine multiple information and achieve more accurate results in confidence reasoning, the Uncertain ResourceRank (URR) reasoning algorithm is introduced. Second, the asymmetry in the UKG is defined. To embed asymmetric relation facts of UKG, a multi-relation embedding model is proposed. Finally, experiments are carried out on different datasets via 4 tasks to verify the effectiveness of MUKGE. The results of experiments demonstrate that MUKGE can obtain better overall performance than the baselines, and it helps advance the research on UKG embedding.

  • RESEARCH ARTICLE
    Youming GE, Cong HUANG, Yubao LIU, Sen ZHANG, Weiyang KONG

    In this paper, we address the problem of unsuperised social network embedding, which aims to embed network nodes, including node attributes, into a latent low dimensional space. In recent methods, the fusion mechanism of node attributes and network structure has been proposed for the problem and achieved impressive prediction performance. However, the non-linear property of node attributes and network structure is not efficiently fused in existing methods, which is potentially helpful in learning a better network embedding. To this end, in this paper, we propose a novel model called ASM (Adaptive Specific Mapping) based on encoder-decoder framework. In encoder, we use the kernel mapping to capture the non-linear property of both node attributes and network structure. In particular, we adopt two feature mapping functions, namely an untrainable function for node attributes and a trainable function for network structure. By the mapping functions, we obtain the low dimensional feature vectors for node attributes and network structure, respectively. Then, we design an attention layer to combine the learning of both feature vectors and adaptively learn the node embedding. In encoder, we adopt the component of reconstruction for the training process of learning node attributes and network structure. We conducted a set of experiments on seven real-world social network datasets. The experimental results verify the effectiveness and efficiency of our method in comparison with state-of-the-art baselines.

  • RESEARCH ARTICLE
    Jiaqi LIU, Zhiwen YU, Bin GUO, Cheng DENG, Luoyi FU, Xinbing WANG, Chenghu ZHOU

    A great many practical applications have observed knowledge evolution, i.e., continuous born of new knowledge, with its formation influenced by the structure of historical knowledge. This observation gives rise to evolving knowledge graphs whose structure temporally grows over time. However, both the modal characterization and the algorithmic implementation of evolving knowledge graphs remain unexplored. To this end, we propose EvolveKG – a general framework that enables algorithms in the static knowledge graphs to learn the evolving ones. EvolveKG quantifies the influence of a historical fact on a current one, called the effectiveness of the fact, and makes knowledge prediction by leveraging all the cross-time knowledge interaction. The novelty of EvolveKG lies in Derivative Graph – a weighted snapshot of evolution at a certain time. Particularly, each weight quantifies knowledge effectiveness through a temporarily decaying function of consistency and attenuation, two proposed factors depicting whether or not the effectiveness of a fact fades away with time. Besides, considering both knowledge creation and loss, we obtain higher prediction accuracy when the effectiveness of all the facts increases with time or remains unchanged. Under four real datasets, the superiority of EvolveKG is confirmed in prediction accuracy.

  • RESEARCH ARTICLE
    Lei LI, Chengyu WANG, Minghui QIU, Cen CHEN, Ming GAO, Aoying ZHOU

    BERT is a representative pre-trained language model that has drawn extensive attention for significant improvements in downstream Natural Language Processing (NLP) tasks. The complex architecture and massive parameters bring BERT competitive performance but also result in slow speed at model inference time. To speed up BERT inference, FastBERT realizes adaptive inference with an acceptable drop in accuracy based on knowledge distillation and the early-exit technique. However, many factors may limit the performance of FastBERT, such as the teacher classifier that is not knowledgeable enough, the batch size shrinkage and the redundant computation of student classifiers. To overcome these limitations, we propose a new BERT inference method with GPU-Efficient Exit Prediction (GEEP). GEEP leverages the shared exit loss to simplify the training process of FastBERT from two steps into only one step and makes the teacher classifier more knowledgeable by feeding diverse Transformer outputs to the teacher classifier. In addition, the exit layer prediction technique is proposed to utilize a GPU hash table to handle the token-level exit layer distribution and to sort test samples by predicted exit layers. In this way, GEEP can avoid batch size shrinkage and redundant computation of student classifiers. Experimental results on twelve public English and Chinese NLP datasets prove the effectiveness of the proposed approach. The source codes of GEEP will be released to the public upon paper acceptance.

  • Theoretical Computer Science
  • RESEARCH ARTICLE
    Zhe YUAN, Zhewei WEI, Fangrui LV, Ji-Rong WEN

    Motif-based graph local clustering (MGLC) is a popular method for graph mining tasks due to its various applications. However, the traditional two-phase approach of precomputing motif weights before performing local clustering loses locality and is impractical for large graphs. While some attempts have been made to address the efficiency bottleneck, there is still no applicable algorithm for large scale graphs with billions of edges. In this paper, we propose a purely local and index-free method called Index-free Triangle-based Graph Local Clustering (TGLC*) to solve the MGLC problem w.r.t. a triangle. TGLC* directly estimates the Personalized PageRank (PPR) vector using random walks with the desired triangle-weighted distribution and proposes the clustering result using a standard sweep procedure. We demonstrate TGLC*’s scalability through theoretical analysis and its practical benefits through a novel visualization layout. TGLC* is the first algorithm to solve the MGLC problem without precomputing the motif weight. Extensive experiments on seven real-world large-scale datasets show that TGLC* is applicable and scalable for large graphs.

  • RESEARCH ARTICLE
    Yongjie YANG, Dinko DIMITROV

    We consider GROUP CONTROL BY ADDING INDIVIDUALS (GCAI) in the setting of group identification for two procedural rules—the consensus-start-respecting rule and the liberal-start-respecting rule. It is known that GCAI for both rules are NP-hard, but whether they are fixed-parameter tractable with respect to the number of distinguished individuals remained open. We resolve both open problems in the affirmative. In addition, we strengthen the NP-hardness of GCAI by showing that, with respect to the natural parameter the number of added individuals, GCAI for both rules are W[2]-hard. Notably, the W[2]-hardness for the liberal-start-respecting rule holds even when restricted to a very special case where the qualifications of individuals satisfy the so-called consecutive ones property. However, for the consensus-start-respecting rule, the problem becomes polynomial-time solvable in this special case. We also study a dual restriction where the disqualifications of individuals fulfill the consecutive ones property, and show that under this restriction GCAI for both rules turn out to be polynomial-time solvable. Our reductions for showing W[2]-hardness also imply several algorithmic lower bounds.

  • Information Systems
  • LETTER
    Yuqi LI, Tao MENG, Zhixiong HE, Haiyan LIU, Keqin LI
  • LETTER
    Wei JIANG, Bo NING, Guanyu LI, Mei BAI, Xiao JIA, Fangliang WEI
  • RESEARCH ARTICLE
    Hengyu LIU, Tiancheng ZHANG, Fan LI, Minghe YU, Ge YU

    Knowledge tracing aims to track students’ knowledge status over time to predict students’ future performance accurately. In a real environment, teachers expect knowledge tracing models to provide the interpretable result of knowledge status. Markov chain-based knowledge tracing (MCKT) models, such as Bayesian Knowledge Tracing, can track knowledge concept mastery probability over time. However, as the number of tracked knowledge concepts increases, the time complexity of MCKT predicting student performance increases exponentially (also called explaining away problem). When the number of tracked knowledge concepts is large, we cannot utilize MCKT to track knowledge concept mastery probability over time. In addition, the existing MCKT models only consider the relationship between students’ knowledge status and problems when modeling students’ responses but ignore the relationship between knowledge concepts in the same problem. To address these challenges, we propose an inTerpretable pRobAbilistiC gEnerative moDel (TRACED), which can track students’ numerous knowledge concepts mastery probabilities over time. To solve explain away problem, we design long and short-term memory (LSTM)-based networks to approximate the posterior distribution, predict students’ future performance, and propose a heuristic algorithm to train LSTMs and probabilistic graphical model jointly. To better model students’ exercise responses, we proposed a logarithmic linear model with three interactive strategies, which models students’ exercise responses by considering the relationship among students’ knowledge status, knowledge concept, and problems. We conduct experiments with four real-world datasets in three knowledge-driven tasks. The experimental results show that TRACED outperforms existing knowledge tracing methods in predicting students’ future performance and can learn the relationship among students, knowledge concepts, and problems from students’ exercise sequences. We also conduct several case studies. The case studies show that TRACED exhibits excellent interpretability and thus has the potential for personalized automatic feedback in the real-world educational environment.

  • Image and Graphics
  • RESEARCH ARTICLE
    Shuzhe LI, Hongwei XU, Qiong LI, Qi HAN

    Due to the advantages of high volume of transactions and low resource consumption, Directed Acyclic Graph (DAG)-based Distributed Ledger Technology (DLT) has been considered a possible next-generation alternative to block-chain. However, the security of the DAG-based system has yet to be comprehensively understood. Aiming at verifying and evaluating the security of DAG-based DLT, we develop a Multi-Agent based IOTA Simulation platform called MAIOTASim. In MAIOTASim, we model honest and malicious nodes and simulate the configurable network environment, including network topology and delay. The double-spending attack is a particular security issue related to DLT. We perform the security verification of the consensus algorithms under multiple double-spending attack strategies. Our simulations show that the consensus algorithms can resist the parasite chain attack and partially resist the splitting attack, but they are ineffective under the large weight attack. We take the cumulative weight difference of transactions as the evaluation criterion and analyze the effect of different consensus algorithms with parameters under each attack strategy. Besides, MAIOTASim enables users to perform large-scale simulations with multiple nodes and tens of thousands of transactions more efficiently than state-of-the-art ones.

  • Information Security
  • LETTER
    Yang YANG, Guoyin ZHANG, Sizhao LI, Zechao LIU
  • LETTER
    Haifeng QIAN, Cheng LIN, Qiaohan CHU, Jie CHEN
  • REVIEW ARTICLE
    Antonio SANTOS-OLMO, Luis Enrique SÁNCHEZ, David G. ROSADO, Manuel A. SERRANO, Carlos BLANCO, Haralambos MOURATIDIS, Eduardo FERNÁNDEZ-MEDINA

    The information society depends increasingly on risk assessment and management systems as means to adequately protect its key information assets. The availability of these systems is now vital for the protection and evolution of companies. However, several factors have led to an increasing need for more accurate risk analysis approaches. These are: the speed at which technologies evolve, their global impact and the growing requirement for companies to collaborate. Risk analysis processes must consequently adapt to these new circumstances and new technological paradigms. The objective of this paper is, therefore, to present the results of an exhaustive analysis of the techniques and methods offered by the scientific community with the aim of identifying their main weaknesses and providing a new risk assessment and management process. This analysis was carried out using the systematic review protocol and found that these proposals do not fully meet these new needs. The paper also presents a summary of MARISMA, the risk analysis and management framework designed by our research group. The basis of our framework is the main existing risk standards and proposals, and it seeks to address the weaknesses found in these proposals. MARISMA is in a process of continuous improvement, as is being applied by customers in several European and American countries. It consists of a risk data management module, a methodology for its systematic application and a tool that automates the process.

  • RESEARCH ARTICLE
    Xingxing CHEN, Qingfeng CHENG, Weidong YANG, Xiangyang LUO

    With the widespread use of network infrastructures such as 5G and low-power wide-area networks, a large number of the Internet of Things (IoT) device nodes are connected to the network, generating massive amounts of data. Therefore, it is a great challenge to achieve anonymous authentication of IoT nodes and secure data transmission. At present, blockchain technology is widely used in authentication and s data storage due to its decentralization and immutability. Recently, Fan et al. proposed a secure and efficient blockchain-based IoT authentication and data sharing scheme. We studied it as one of the state-of-the-art protocols and found that this scheme does not consider the resistance to ephemeral secret compromise attacks and the anonymity of IoT nodes. To overcome these security flaws, this paper proposes an enhanced authentication and data transmission scheme, which is verified by formal security proofs and informal security analysis. Furthermore, Scyther is applied to prove the security of the proposed scheme. Moreover, it is demonstrated that the proposed scheme achieves better performance in terms of communication and computational cost compared to other related schemes.

  • Interdisciplinary
  • LETTER
    Haixin WANG, Yunhan WANG, Qun JIANG, Yan ZHANG, Shengquan CHEN
  • RESEARCH ARTICLE
    Wenzheng BAO, Bin YANG

    Protein acetylation refers to a process of adding acetyl groups (CH3CO-) to lysine residues on protein chains. As one of the most commonly used protein post-translational modifications, lysine acetylation plays an important role in different organisms. In our study, we developed a human-specific method which uses a cascade classifier of complex-valued polynomial model (CVPM), combined with sequence and structural feature descriptors to solve the problem of imbalance between positive and negative samples. Complex-valued gene expression programming and differential evolution are utilized to search the optimal CVPM model. We also made a systematic and comprehensive analysis of the acetylation data and the prediction results. The performances of our proposed method are 79.15% in Sp, 78.17% in Sn, 78.66% in ACC 78.76% in F1, and 0.5733 in MCC, which performs better than other state-of-the-art methods.