Despite significant successes achieved in knowledge discovery, traditional machine learning methods may fail to obtain satisfactory performances when dealing with complex data, such as imbalanced, high-dimensional, noisy data, etc. The reason behind is that it is difficult for these methods to capture multiple characteristics and underlying structure of data. In this context, it becomes an important topic in the data mining field that how to effectively construct an efficient knowledge discovery and mining model. Ensemble learning, as one research hot spot, aims to integrate data fusion, data modeling, and data mining into a unified framework. Specifically, ensemble learning firstly extracts a set of features with a variety of transformations. Based on these learned features, multiple learning algorithms are utilized to produce weak predictive results. Finally, ensemble learning fuses the informative knowledge from the above results obtained to achieve knowledge discovery and better predictive performance via voting schemes in an adaptive way. In this paper, we review the research progress of the mainstream approaches of ensemble learning and classify them based on different characteristics. In addition, we present challenges and possible research directions for each mainstream approach of ensemble learning, and we also give an extra introduction for the combination of ensemble learning with other machine learning hot spots such as deep learning, reinforcement learning, etc.
Text, as one of the most influential inventions of humanity, has played an important role in human life, so far from ancient times. The rich and precise information embodied in text is very useful in a wide range of vision-based applications, therefore text detection and recognition in natural scenes have become important and active research topics in computer vision and document analysis. Especially in recent years, the community has seen a surge of research efforts and substantial progresses in these fields, though a variety of challenges (e.g. noise, blur, distortion, occlusion and variation) still remain. The purposes of this survey are three-fold: 1) introduce up-to-date works, 2) identify state-of-the-art algorithms, and 3) predict potential research directions in the future. Moreover, this paper provides comprehensive links to publicly available resources, including benchmark datasets, source codes, and online demos. In summary, this literature review can serve as a good reference for researchers in the areas of scene text detection and recognition.
There is a trend that, virtually everyone, ranging from big Web companies to traditional enterprisers to physical science researchers to social scientists, is either already experiencing or anticipating unprecedented growth in the amount of data available in their world, as well as new opportunities and great untapped value. This paper reviews big data challenges from a data management respective. In particular, we discuss big data diversity, big data reduction, big data integration and cleaning, big data indexing and query, and finally big data analysis and mining. Our survey gives a brief overview about big-data-oriented research and problems.
Multi-label learning deals with problems where each example is represented by a single instance while being associated with multiple class labels simultaneously. Binary relevance is arguably the most intuitive solution for learning from multi-label examples. It works by decomposing the multi-label learning task into a number of independent binary learning tasks (one per class label). In view of its potential weakness in ignoring correlations between labels, many correlation-enabling extensions to binary relevance have been proposed in the past decade. In this paper, we aim to review the state of the art of binary relevance from three perspectives. First, basic settings for multi-label learning and binary relevance solutions are briefly summarized. Second, representative strategies to provide binary relevancewith label correlation exploitation abilities are discussed. Third, some of our recent studies on binary relevance aimed at issues other than label correlation exploitation are introduced. As a conclusion, we provide suggestions on future research directions.
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity-off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16-core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.
DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.
Remote photoplethysmography (rPPG) allows remote measurement of the heart rate using low-cost RGB imaging equipment. In this study, we review the development of the field of rPPG since its emergence in 2008. We also classify existing rPPG approaches and derive a framework that provides an overview of modular steps. Based on this framework, practitioners can use our classification to design algorithms for an rPPG approach that suits their specific needs. Researchers can use the reviewed and classified algorithms as a starting point to improve particular features of an rPPG algorithm.
The internet of things (IoT) attracts great interest in many application domains concerned with monitoring and control of physical phenomena. However, application development is still one of the main hurdles to a wide adoption of IoT technology. Application development is done at a low level, very close to the operating system and requires programmers to focus on low-level system issues. The underlying APIs can be very complicated and the amount of data collected can be huge. This can be very hard to deal with as a developer. In this paper, we present a runtime model based approach to IoT application development. First, the manageability of sensor devices is abstracted as runtime models that are automatically connected with the corresponding systems. Second, a customized model is constructed according to a personalized application scenario and the synchronization between the customized model and sensor device runtime models is ensured through model transformation. Thus, all the application logic can be carried out by executing programs on the customized model. An experiment on a real-world application scenario demonstrates the feasibility, effectiveness, and benefits of the new approach to IoT application development.
This paper provides a short review of some of the main topics in which the current research in evolutionary multi-objective optimization is being focused. The topics discussed include new algorithms, efficiency, relaxed forms of dominance, scalability, and alternative metaheuristics. This discussion motivates some further topics which, from the author’s perspective, constitute good potential areas for future research, namely, constraint-handling techniques, incorporation of user’s preferences and parameter control. This information is expected to be useful for those interested in pursuing research in this area.
Graphical models have become the basic framework for topic based probabilistic modeling. Especially models with latent variables have proved to be effective in capturing hidden structures in the data. In this paper, we survey an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora. From an unsupervised learning perspective, “topics are semantically related probabilistic clusters of words in text corpora; and the process for finding these topics is called topic modeling”. In topic modeling, a document consists of different hidden topics and the topic probabilities provide an explicit representation of a document to smooth data from the semantic level. It has been an active area of research during the last decade. Many models have been proposed for handling the problems of modeling text corpora with different characteristics, for applications such as document classification, hidden association finding, expert finding, community discovery and temporal trend analysis. We give basic concepts, advantages and disadvantages in a chronological order, existing models classification into different categories, their parameter estimation and inference making algorithms with models performance evaluation measures. We also discuss their applications, open challenges and future directions in this dynamic area of research.
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems.
RDF is increasingly being used to encode data for the semantic web and data exchange. There have been a large number of works that address RDF data management following different approaches. In this paper we provide an overview of these works. This review considers centralized solutions (what are referred to as warehousing approaches), distributed solutions, and the techniques that have been developed for querying linked data. In each category, further classifications are provided that would assist readers in understanding the identifying characteristics of different approaches.
Robust face representation is imperative to highly accurate face recognition. In this work, we propose an open source face recognition method with deep representation named as VIPLFaceNet, which is a 10-layer deep convolutional neural network with seven convolutional layers and three fully-connected layers. Compared with the well-known AlexNet, our VIPLFaceNet takes only 20% training time and 60% testing time, but achieves 40% drop in error rate on the real-world face recognition benchmark LFW. Our VIPLFaceNet achieves 98.60% mean accuracy on LFW using one single network. An open-source C++ SDK based on VIPLFaceNet is released under BSD license. The SDK takes about 150ms to process one face image in a single thread on an i7 desktop CPU. VIPLFaceNet provides a state-of-the-art start point for both academic and industrial face recognition applications.
Dynamic consolidation of virtual machines (VMs) in a data center is an effective way to reduce the energy consumption and improve physical resource utilization. Determining which VMs should be migrated from an overloaded host directly influences the VM migration time and increases energy consumption for the whole data center, and can cause the service level of agreement (SLA), delivered by providers and users, to be violated. So when designing a VM selection policy, we not only consider CPU utilization, but also define a variable that represents the degree of resource satisfaction to select the VMs. In addition, we propose a novel VM placement policy that prefers placing a migratable VM on a host that has the minimum correlation coefficient. The bigger correlation coefficient a host has, the greater the influence will be on VMs located on that host after the migration. Using CloudSim, we run simulations whose results let draw us to conclude that the policies we propose in this paper perform better than existing policies in terms of energy consumption, VM migration time, and SLA violation percentage.
With the explosive growth of information, more and more organizations are deploying private cloud systems or renting public cloud systems to process big data. However, there is no existing benchmark suite for evaluating cloud performance on the whole system level. To the best of our knowledge, this paper proposes the first benchmark suite
Blockchain has recently emerged as a research trend, with potential applications in a broad range of industries and context. One particular successful Blockchain technology is smart contract, which is widely used in commercial settings (e.g., high value financial transactions). This, however, has security implications due to the potential to financially benefit froma security incident (e.g., identification and exploitation of a vulnerability in the smart contract or its implementation). Among, Ethereum is the most active and arresting. Hence, in this paper, we systematically review existing research efforts on Ethereum smart contract security, published between 2015 and 2019. Specifically, we focus on how smart contracts can be maliciously exploited and targeted, such as security issues of contract program model, vulnerabilities in the program and safety consideration introduced by program execution environment. We also identify potential research opportunities and future research agenda.
Semi-supervised learning constructs the predictive model by learning from a few labeled training examples and a large pool of unlabeled ones. It has a wide range of application scenarios and has attracted much attention in the past decades. However, it is noteworthy that although the learning performance is expected to be improved by exploiting unlabeled data, some empirical studies show that there are situations where the use of unlabeled data may degenerate the performance. Thus, it is advisable to be able to exploit unlabeled data safely. This article reviews some research progress of safe semi-supervised learning, focusing on three types of safeness issue: data quality, where the training data is risky or of low-quality;model uncertainty, where the learning algorithm fails to handle the uncertainty during training; measure diversity, where the safe performance could be adapted to diverse measures.
In this paper, several diversity measurements will be discussed and defined. As in other evolutionary algorithms, first the population position diversity will be discussed followed by the discussion and definition of population velocity diversity which is different from that in other evolutionary algorithms since only PSO has the velocity parameter. Furthermore, a diversity measurement called cognitive diversity is discussed and defined, which can reveal clustering information about where the current population of particles intends to move towards. The diversity of the current population of particles and the cognitive diversity together tell what the convergence/divergence stage the current population of particles is at and which stage it moves towards.
In the real-world applications, most optimization problems are subject to different types of constraints. These problems are known as constrained optimization problems (COPs). Solving COPs is a very important area in the optimization field. In this paper, a hybrid multi-swarm particle swarm optimization (HMPSO) is proposed to deal with COPs. This method adopts a parallel search operator in which the current swarm is partitioned into several subswarms and particle swarm optimization (PSO) is severed as the search engine for each sub-swarm. Moreover, in order to explore more promising regions of the search space, differential evolution (DE) is incorporated to improve the personal best of each particle. First, the method is tested on 13 benchmark test functions and compared with three stateof-the-art approaches. The simulation results indicate that the proposed HMPSO is highly competitive in solving the 13 benchmark test functions. Afterward, the effectiveness of some mechanisms proposed in this paper and the effect of the parameter setting were validated by various experiments. Finally, HMPSO is further applied to solve 24 benchmark test functions collected in the 2006 IEEE Congress on Evolutionary Computation (CEC2006) and the experimental results indicate that HMPSO is able to deal with 22 test functions.
Standard support vector machines (SVMs) training algorithms have O(l3) computational and O(l2) space complexities, where l is the training set size. It is thus computationally infeasible on very large data sets. To alleviate the computational burden in SVM training, we propose an algorithm to train SVMs on a bound vectors set that is extracted based on Fisher projection. For linear separate problems, we use linear Fisher discriminant to compute the projection line, while for non-linear separate problems, we use kernel Fisher discriminant to compute the projection line. For each case, we select a certain ratio samples whose projections are adjacent to those of the other class as bound vectors. Theoretical analysis shows that the proposed algorithm is with low computational and space complexities. Extensive experiments on several classification benchmarks demonstrate the effectiveness of our approach.
As an effective image segmentation method, the standard fuzzy c-means (FCM) clustering algorithm is very sensitive to noise in images. Several modified FCM algorithms, using local spatial information, can overcome this problem to some degree. However, when the noise level in the image is high, these algorithms still cannot obtain satisfactory segmentation performance. In this paper, we introduce a non local spatial constraint term into the objective function of FCM and propose a fuzzy c-means clustering algorithm with non local spatial information (FCM_NLS). FCM_NLS can deal more effectively with the image noise and preserve geometrical edges in the image. Performance evaluation experiments on synthetic and real images, especially magnetic resonance (MR) images, show that FCM_NLS is more robust than both the standard FCM and the modified FCM algorithms using local spatial information for noisy image segmentation.
Mobile ad-hoc networks (MANETs) and wireless sensor networks (WSNs) have gained remarkable appreciation and technological development over the last few years. Despite ease of deployment, tremendous applications and significant advantages, security has always been a challenging issue due to the nature of environments in which nodes operate. Nodes’ physical capture, malicious or selfish behavior cannot be detected by traditional security schemes. Trust and reputation based approaches have gained global recognition in providing additional means of security for decision making in sensor and ad-hoc networks. This paper provides an extensive literature review of trust and reputation based models both in sensor and ad-hoc networks. Based on the mechanism of trust establishment, we categorize the stateof-the-art into two groups namely node-centric trust models and system-centric trust models. Based on trust evidence, initialization, computation, propagation and weight assignments, we evaluate the efficacy of the existing schemes. Finally, we conclude our discussion with identification of some unresolved issues in pursuit of trust and reputation management.
With the enforcement of the removal system for distressed firms and the new Bankruptcy Law in China’s securities market in June 2007, the development of the bankruptcy process for firms in China is expected to create a huge impact. Therefore, identification of potential corporate distress and offering early warnings to investors, analysts, and regulators has become important. There are very distinct differences, in accounting procedures and quality of financial documents, between firms in China and those in the western world. Therefore, it may not be practical to directly apply those models or methodologies developed elsewhere to support identification of such potential distressed situations. Moreover, localized models are commonly superior to ones imported from other environments.
Based on the
In the past decade, recommender systems have been widely used to provide users with personalized products and services. However, most traditional recommender systems are still facing a challenge in dealing with the huge volume, complexity, and dynamics of information. To tackle this challenge, many studies have been conducted to improve recommender system by integrating deep learning techniques. As an unsupervised deep learning method, autoencoder has been widely used for its excellent performance in data dimensionality reduction, feature extraction, and data reconstruction. Meanwhile, recent researches have shown the high efficiency of autoencoder in information retrieval and recommendation tasks. Applying autoencoder on recommender systems would improve the quality of recommendations due to its better understanding of users’ demands and characteristics of items. This paper reviews the recent researches on autoencoder-based recommender systems. The differences between autoencoder-based recommender systems and traditional recommender systems are presented in this paper. At last, some potential research directions of autoencoder-based recommender systems are discussed.
N6-methyladenosine (m 6A) is a prevalent methylation modification and plays a vital role in various biological processes, such as metabolism, mRNA processing, synthesis, and transport. Recent studies have suggested that m 6A modification is related to common diseases such as cancer, tumours, and obesity. Therefore, accurate prediction of methylation sites in RNA sequences has emerged as a critical issue in the area of bioinformatics. However, traditional high-throughput sequencing and wet bench experimental techniques have the disadvantages of high costs, significant time requirements and inaccurate identification of sites. But through the use of traditional experimental methods, researchers have produced many large databases of m 6A sites. With the support of these basic databases and existing deep learning methods, we developed an m 6A site predictor named DeepM6ASeq-EL, which integrates an ensemble of five LSTM and CNN classifiers with the combined strategy of hard voting. Compared to the state-of-the-art prediction method WHISTLE (average AUC 0.948 and 0.880), the DeepM6ASeq-EL had a lower accuracy in m 6A site prediction (average AUC: 0.861 for the full transcript models and 0.809 for the mature messenger RNA models) when tested on six independent datasets.
Hardware/software partitioning is an essential step in hardware/software co-design. For large size problems, it is difficult to consider both solution quality and time. This paper presents an efficient GPU-based parallel tabu search algorithm (GPTS) for HW/SW partitioning. A single GPU kernel of compacting neighborhood is proposed to reduce the amount of GPU global memory accesses theoretically. A kernel fusion strategy is further proposed to reduce the amount of GPU global memory accesses of GPTS. To further minimize the transfer overhead of GPTS between CPU and GPU, an optimized transfer strategy for GPU-based tabu evaluation is proposed, which considers that all the candidates do not satisfy the given constraint. Experiments show that GPTS outperforms state-of-the-art work of tabu search and is competitive with other methods for HW/SW partitioning. The proposed parallelization is significant when considering the ordinary GPU platform.