Despite significant successes achieved in knowledge discovery, traditional machine learning methods may fail to obtain satisfactory performances when dealing with complex data, such as imbalanced, high-dimensional, noisy data, etc. The reason behind is that it is difficult for these methods to capture multiple characteristics and underlying structure of data. In this context, it becomes an important topic in the data mining field that how to effectively construct an efficient knowledge discovery and mining model. Ensemble learning, as one research hot spot, aims to integrate data fusion, data modeling, and data mining into a unified framework. Specifically, ensemble learning firstly extracts a set of features with a variety of transformations. Based on these learned features, multiple learning algorithms are utilized to produce weak predictive results. Finally, ensemble learning fuses the informative knowledge from the above results obtained to achieve knowledge discovery and better predictive performance via voting schemes in an adaptive way. In this paper, we review the research progress of the mainstream approaches of ensemble learning and classify them based on different characteristics. In addition, we present challenges and possible research directions for each mainstream approach of ensemble learning, and we also give an extra introduction for the combination of ensemble learning with other machine learning hot spots such as deep learning, reinforcement learning, etc.
Text, as one of the most influential inventions of humanity, has played an important role in human life, so far from ancient times. The rich and precise information embodied in text is very useful in a wide range of vision-based applications, therefore text detection and recognition in natural scenes have become important and active research topics in computer vision and document analysis. Especially in recent years, the community has seen a surge of research efforts and substantial progresses in these fields, though a variety of challenges (e.g. noise, blur, distortion, occlusion and variation) still remain. The purposes of this survey are three-fold: 1) introduce up-to-date works, 2) identify state-of-the-art algorithms, and 3) predict potential research directions in the future. Moreover, this paper provides comprehensive links to publicly available resources, including benchmark datasets, source codes, and online demos. In summary, this literature review can serve as a good reference for researchers in the areas of scene text detection and recognition.
There is a trend that, virtually everyone, ranging from big Web companies to traditional enterprisers to physical science researchers to social scientists, is either already experiencing or anticipating unprecedented growth in the amount of data available in their world, as well as new opportunities and great untapped value. This paper reviews big data challenges from a data management respective. In particular, we discuss big data diversity, big data reduction, big data integration and cleaning, big data indexing and query, and finally big data analysis and mining. Our survey gives a brief overview about big-data-oriented research and problems.
Multi-label learning deals with problems where each example is represented by a single instance while being associated with multiple class labels simultaneously. Binary relevance is arguably the most intuitive solution for learning from multi-label examples. It works by decomposing the multi-label learning task into a number of independent binary learning tasks (one per class label). In view of its potential weakness in ignoring correlations between labels, many correlation-enabling extensions to binary relevance have been proposed in the past decade. In this paper, we aim to review the state of the art of binary relevance from three perspectives. First, basic settings for multi-label learning and binary relevance solutions are briefly summarized. Second, representative strategies to provide binary relevancewith label correlation exploitation abilities are discussed. Third, some of our recent studies on binary relevance aimed at issues other than label correlation exploitation are introduced. As a conclusion, we provide suggestions on future research directions.
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity-off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16-core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.
DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.
This paper provides a short review of some of the main topics in which the current research in evolutionary multi-objective optimization is being focused. The topics discussed include new algorithms, efficiency, relaxed forms of dominance, scalability, and alternative metaheuristics. This discussion motivates some further topics which, from the author’s perspective, constitute good potential areas for future research, namely, constraint-handling techniques, incorporation of user’s preferences and parameter control. This information is expected to be useful for those interested in pursuing research in this area.
The internet of things (IoT) attracts great interest in many application domains concerned with monitoring and control of physical phenomena. However, application development is still one of the main hurdles to a wide adoption of IoT technology. Application development is done at a low level, very close to the operating system and requires programmers to focus on low-level system issues. The underlying APIs can be very complicated and the amount of data collected can be huge. This can be very hard to deal with as a developer. In this paper, we present a runtime model based approach to IoT application development. First, the manageability of sensor devices is abstracted as runtime models that are automatically connected with the corresponding systems. Second, a customized model is constructed according to a personalized application scenario and the synchronization between the customized model and sensor device runtime models is ensured through model transformation. Thus, all the application logic can be carried out by executing programs on the customized model. An experiment on a real-world application scenario demonstrates the feasibility, effectiveness, and benefits of the new approach to IoT application development.
Graphical models have become the basic framework for topic based probabilistic modeling. Especially models with latent variables have proved to be effective in capturing hidden structures in the data. In this paper, we survey an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora. From an unsupervised learning perspective, “topics are semantically related probabilistic clusters of words in text corpora; and the process for finding these topics is called topic modeling”. In topic modeling, a document consists of different hidden topics and the topic probabilities provide an explicit representation of a document to smooth data from the semantic level. It has been an active area of research during the last decade. Many models have been proposed for handling the problems of modeling text corpora with different characteristics, for applications such as document classification, hidden association finding, expert finding, community discovery and temporal trend analysis. We give basic concepts, advantages and disadvantages in a chronological order, existing models classification into different categories, their parameter estimation and inference making algorithms with models performance evaluation measures. We also discuss their applications, open challenges and future directions in this dynamic area of research.
Remote photoplethysmography (rPPG) allows remote measurement of the heart rate using low-cost RGB imaging equipment. In this study, we review the development of the field of rPPG since its emergence in 2008. We also classify existing rPPG approaches and derive a framework that provides an overview of modular steps. Based on this framework, practitioners can use our classification to design algorithms for an rPPG approach that suits their specific needs. Researchers can use the reviewed and classified algorithms as a starting point to improve particular features of an rPPG algorithm.
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems.
Robust face representation is imperative to highly accurate face recognition. In this work, we propose an open source face recognition method with deep representation named as VIPLFaceNet, which is a 10-layer deep convolutional neural network with seven convolutional layers and three fully-connected layers. Compared with the well-known AlexNet, our VIPLFaceNet takes only 20% training time and 60% testing time, but achieves 40% drop in error rate on the real-world face recognition benchmark LFW. Our VIPLFaceNet achieves 98.60% mean accuracy on LFW using one single network. An open-source C++ SDK based on VIPLFaceNet is released under BSD license. The SDK takes about 150ms to process one face image in a single thread on an i7 desktop CPU. VIPLFaceNet provides a state-of-the-art start point for both academic and industrial face recognition applications.
With the explosive growth of information, more and more organizations are deploying private cloud systems or renting public cloud systems to process big data. However, there is no existing benchmark suite for evaluating cloud performance on the whole system level. To the best of our knowledge, this paper proposes the first benchmark suite
RDF is increasingly being used to encode data for the semantic web and data exchange. There have been a large number of works that address RDF data management following different approaches. In this paper we provide an overview of these works. This review considers centralized solutions (what are referred to as warehousing approaches), distributed solutions, and the techniques that have been developed for querying linked data. In each category, further classifications are provided that would assist readers in understanding the identifying characteristics of different approaches.
In the real-world applications, most optimization problems are subject to different types of constraints. These problems are known as constrained optimization problems (COPs). Solving COPs is a very important area in the optimization field. In this paper, a hybrid multi-swarm particle swarm optimization (HMPSO) is proposed to deal with COPs. This method adopts a parallel search operator in which the current swarm is partitioned into several subswarms and particle swarm optimization (PSO) is severed as the search engine for each sub-swarm. Moreover, in order to explore more promising regions of the search space, differential evolution (DE) is incorporated to improve the personal best of each particle. First, the method is tested on 13 benchmark test functions and compared with three stateof-the-art approaches. The simulation results indicate that the proposed HMPSO is highly competitive in solving the 13 benchmark test functions. Afterward, the effectiveness of some mechanisms proposed in this paper and the effect of the parameter setting were validated by various experiments. Finally, HMPSO is further applied to solve 24 benchmark test functions collected in the 2006 IEEE Congress on Evolutionary Computation (CEC2006) and the experimental results indicate that HMPSO is able to deal with 22 test functions.
In this paper, several diversity measurements will be discussed and defined. As in other evolutionary algorithms, first the population position diversity will be discussed followed by the discussion and definition of population velocity diversity which is different from that in other evolutionary algorithms since only PSO has the velocity parameter. Furthermore, a diversity measurement called cognitive diversity is discussed and defined, which can reveal clustering information about where the current population of particles intends to move towards. The diversity of the current population of particles and the cognitive diversity together tell what the convergence/divergence stage the current population of particles is at and which stage it moves towards.
Standard support vector machines (SVMs) training algorithms have O(l3) computational and O(l2) space complexities, where l is the training set size. It is thus computationally infeasible on very large data sets. To alleviate the computational burden in SVM training, we propose an algorithm to train SVMs on a bound vectors set that is extracted based on Fisher projection. For linear separate problems, we use linear Fisher discriminant to compute the projection line, while for non-linear separate problems, we use kernel Fisher discriminant to compute the projection line. For each case, we select a certain ratio samples whose projections are adjacent to those of the other class as bound vectors. Theoretical analysis shows that the proposed algorithm is with low computational and space complexities. Extensive experiments on several classification benchmarks demonstrate the effectiveness of our approach.
Dynamic consolidation of virtual machines (VMs) in a data center is an effective way to reduce the energy consumption and improve physical resource utilization. Determining which VMs should be migrated from an overloaded host directly influences the VM migration time and increases energy consumption for the whole data center, and can cause the service level of agreement (SLA), delivered by providers and users, to be violated. So when designing a VM selection policy, we not only consider CPU utilization, but also define a variable that represents the degree of resource satisfaction to select the VMs. In addition, we propose a novel VM placement policy that prefers placing a migratable VM on a host that has the minimum correlation coefficient. The bigger correlation coefficient a host has, the greater the influence will be on VMs located on that host after the migration. Using CloudSim, we run simulations whose results let draw us to conclude that the policies we propose in this paper perform better than existing policies in terms of energy consumption, VM migration time, and SLA violation percentage.
As an effective image segmentation method, the standard fuzzy c-means (FCM) clustering algorithm is very sensitive to noise in images. Several modified FCM algorithms, using local spatial information, can overcome this problem to some degree. However, when the noise level in the image is high, these algorithms still cannot obtain satisfactory segmentation performance. In this paper, we introduce a non local spatial constraint term into the objective function of FCM and propose a fuzzy c-means clustering algorithm with non local spatial information (FCM_NLS). FCM_NLS can deal more effectively with the image noise and preserve geometrical edges in the image. Performance evaluation experiments on synthetic and real images, especially magnetic resonance (MR) images, show that FCM_NLS is more robust than both the standard FCM and the modified FCM algorithms using local spatial information for noisy image segmentation.
Mobile ad-hoc networks (MANETs) and wireless sensor networks (WSNs) have gained remarkable appreciation and technological development over the last few years. Despite ease of deployment, tremendous applications and significant advantages, security has always been a challenging issue due to the nature of environments in which nodes operate. Nodes’ physical capture, malicious or selfish behavior cannot be detected by traditional security schemes. Trust and reputation based approaches have gained global recognition in providing additional means of security for decision making in sensor and ad-hoc networks. This paper provides an extensive literature review of trust and reputation based models both in sensor and ad-hoc networks. Based on the mechanism of trust establishment, we categorize the stateof-the-art into two groups namely node-centric trust models and system-centric trust models. Based on trust evidence, initialization, computation, propagation and weight assignments, we evaluate the efficacy of the existing schemes. Finally, we conclude our discussion with identification of some unresolved issues in pursuit of trust and reputation management.
This paper presents a novel tracking algorithm which integrates two complementary trackers. Firstly, an improved Bayesian tracker(B-tracker) with adaptive learning rate is presented. The classification score of B-tracker reflects tracking reliability, and a low score usually results from large appearance change. Therefore, if the score is low, we decrease the learning rate to update the classifier fast so that B-tracker can adapt to the variation and vice versa. In this way, B-tracker is more suitable than its traditional version to solve appearance change problem. Secondly, we present an improved incremental subspace learning method tracker(Stracker). We propose to calculate projected coordinates using maximum posterior probability, which results in a more accurate reconstruction error than traditional subspace learning tracker. Instead of updating at every time, we present a stopstrategy to deal with occlusion problem. Finally, we present an integrated framework(BAST), in which the pair of trackers run in parallel and return two candidate target states separately. For each candidate state, we define a tracking reliability metrics to measure whether the candidate state is reliable or not, and the reliable candidate state will be chosen as the target state at the end of each frame. Experimental results on challenging sequences show that the proposed approach is very robust and effective in comparison to the state-of-the-art trackers.
Hardware/software partitioning is an essential step in hardware/software co-design. For large size problems, it is difficult to consider both solution quality and time. This paper presents an efficient GPU-based parallel tabu search algorithm (GPTS) for HW/SW partitioning. A single GPU kernel of compacting neighborhood is proposed to reduce the amount of GPU global memory accesses theoretically. A kernel fusion strategy is further proposed to reduce the amount of GPU global memory accesses of GPTS. To further minimize the transfer overhead of GPTS between CPU and GPU, an optimized transfer strategy for GPU-based tabu evaluation is proposed, which considers that all the candidates do not satisfy the given constraint. Experiments show that GPTS outperforms state-of-the-art work of tabu search and is competitive with other methods for HW/SW partitioning. The proposed parallelization is significant when considering the ordinary GPU platform.
Very large scale integration (VLSI) circuit partitioning is an important problem in design automation of VLSI chips and multichip systems; it is an NP-hard combinational optimization problem. In this paper, an effective hybrid multi-objective partitioning algorithm, based on discrete particle swarm optimzation (DPSO) with local search strategy, called MDPSO-LS, is presented to solve the VLSI twoway partitioning with simultaneous cutsize and circuit delay minimization. Inspired by the physics of genetic algorithm, uniform crossover and random two-point exchange operators are designed to avoid the case of generating infeasible solutions. Furthermore, the phenotype sharing function of the objective space is applied to circuit partitioning to obtain a better approximation of a true Pareto front, and the theorem of Markov chains is used to prove global convergence. To improve the ability of local exploration, Fiduccia-Matteyses (FM) strategy is also applied to further improve the cutsize of each particle, and a local search strategy for improving circuit delay objective is also designed. Experiments on ISCAS89 benchmark circuits show that the proposed algorithm is efficient.
With the enforcement of the removal system for distressed firms and the new Bankruptcy Law in China’s securities market in June 2007, the development of the bankruptcy process for firms in China is expected to create a huge impact. Therefore, identification of potential corporate distress and offering early warnings to investors, analysts, and regulators has become important. There are very distinct differences, in accounting procedures and quality of financial documents, between firms in China and those in the western world. Therefore, it may not be practical to directly apply those models or methodologies developed elsewhere to support identification of such potential distressed situations. Moreover, localized models are commonly superior to ones imported from other environments.
Based on the
This paper presents a high-quality very large scale integration (VLSI) global router in X-architecture, called XGRouter, that heavily relies on integer linear programming (ILP) techniques, partition strategy and particle swarm optimization (PSO). A new ILP formulation, which can achieve more uniform routing solution than other formulations and can be effectively solved by the proposed PSO is proposed. To effectively use the new ILP formulation, a partition strategy that decomposes a large-sized problem into some small-sized sub-problems is adopted and the routing region is extended progressively from the most congested region. In the post-processing stage of XGRouter, maze routing based on new routing edge cost is designed to further optimize the total wire length and mantain the congestion uniformity. To our best knowledge, XGRouter is the first work to use a concurrent algorithm to solve the global routing problem in X-architecture. Experimental results show that XGRouter can produce solutions of higher quality than other global routers. And, like several state-of-the-art global routers, XGRouter has no overflow.
Cloud computing is becoming a very popular word in industry and is receiving a large amount of attention from the research community. Replica management is one of the most important issues in the cloud, which can offer fast data access time, high data availability and reliability. By keeping all replicas active, the replicas may enhance system task successful execution rate if the replicas and requests are reasonably distributed. However, appropriate replica placement in a large-scale, dynamically scalable and totally virtualized data centers is much more complicated. To provide cost-effective availability, minimize the response time of applications and make load balancing for cloud storage, a new replica placement is proposed. The replica placement is based on five important parameters: mean service time, failure probability, load variance, latency and storage usage. However, replication should be used wisely because the storage size of each site is limited. Thus, the site must keep only the important replicas.We also present a new replica replacement strategy based on the availability of the file, the last time the replica was requested, number of access, and size of replica. We evaluate our algorithm using the CloudSim simulator and find that it offers better performance in comparison with other algorithms in terms of mean response time, effective network usage, load balancing, replication frequency, and storage usage.