This paper investigates human mobility patterns in an urban taxi transportation system. This work focuses on predicting humanmobility fromdiscovering patterns of in the number of passenger pick-ups quantity (PUQ) from urban hotspots. This paper proposes an improved ARIMA based prediction method to forecast the spatial-temporal variation of passengers in a hotspot. Evaluation with a large-scale realworld data set of 4 000 taxis’ GPS traces over one year shows a prediction error of only 5.8%. We also explore the application of the prediction approach to help drivers find their next passengers. The simulation results using historical real-world data demonstrate that, with our guidance, drivers can reduce the time taken and distance travelled, to find their next passenger, by 37.1% and 6.4%, respectively.
Text, as one of the most influential inventions of humanity, has played an important role in human life, so far from ancient times. The rich and precise information embodied in text is very useful in a wide range of vision-based applications, therefore text detection and recognition in natural scenes have become important and active research topics in computer vision and document analysis. Especially in recent years, the community has seen a surge of research efforts and substantial progresses in these fields, though a variety of challenges (e.g. noise, blur, distortion, occlusion and variation) still remain. The purposes of this survey are three-fold: 1) introduce up-to-date works, 2) identify state-of-the-art algorithms, and 3) predict potential research directions in the future. Moreover, this paper provides comprehensive links to publicly available resources, including benchmark datasets, source codes, and online demos. In summary, this literature review can serve as a good reference for researchers in the areas of scene text detection and recognition.
Multi-label learning deals with problems where each example is represented by a single instance while being associated with multiple class labels simultaneously. Binary relevance is arguably the most intuitive solution for learning from multi-label examples. It works by decomposing the multi-label learning task into a number of independent binary learning tasks (one per class label). In view of its potential weakness in ignoring correlations between labels, many correlation-enabling extensions to binary relevance have been proposed in the past decade. In this paper, we aim to review the state of the art of binary relevance from three perspectives. First, basic settings for multi-label learning and binary relevance solutions are briefly summarized. Second, representative strategies to provide binary relevancewith label correlation exploitation abilities are discussed. Third, some of our recent studies on binary relevance aimed at issues other than label correlation exploitation are introduced. As a conclusion, we provide suggestions on future research directions.
DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity-off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16-core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.
Graphical models have become the basic framework for topic based probabilistic modeling. Especially models with latent variables have proved to be effective in capturing hidden structures in the data. In this paper, we survey an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora. From an unsupervised learning perspective, “topics are semantically related probabilistic clusters of words in text corpora; and the process for finding these topics is called topic modeling”. In topic modeling, a document consists of different hidden topics and the topic probabilities provide an explicit representation of a document to smooth data from the semantic level. It has been an active area of research during the last decade. Many models have been proposed for handling the problems of modeling text corpora with different characteristics, for applications such as document classification, hidden association finding, expert finding, community discovery and temporal trend analysis. We give basic concepts, advantages and disadvantages in a chronological order, existing models classification into different categories, their parameter estimation and inference making algorithms with models performance evaluation measures. We also discuss their applications, open challenges and future directions in this dynamic area of research.
This paper provides a short review of some of the main topics in which the current research in evolutionary multi-objective optimization is being focused. The topics discussed include new algorithms, efficiency, relaxed forms of dominance, scalability, and alternative metaheuristics. This discussion motivates some further topics which, from the author’s perspective, constitute good potential areas for future research, namely, constraint-handling techniques, incorporation of user’s preferences and parameter control. This information is expected to be useful for those interested in pursuing research in this area.
There is a trend that, virtually everyone, ranging from big Web companies to traditional enterprisers to physical science researchers to social scientists, is either already experiencing or anticipating unprecedented growth in the amount of data available in their world, as well as new opportunities and great untapped value. This paper reviews big data challenges from a data management respective. In particular, we discuss big data diversity, big data reduction, big data integration and cleaning, big data indexing and query, and finally big data analysis and mining. Our survey gives a brief overview about big-data-oriented research and problems.
Remote photoplethysmography (rPPG) allows remote measurement of the heart rate using low-cost RGB imaging equipment. In this study, we review the development of the field of rPPG since its emergence in 2008. We also classify existing rPPG approaches and derive a framework that provides an overview of modular steps. Based on this framework, practitioners can use our classification to design algorithms for an rPPG approach that suits their specific needs. Researchers can use the reviewed and classified algorithms as a starting point to improve particular features of an rPPG algorithm.
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems.
As an effective image segmentation method, the standard fuzzy c-means (FCM) clustering algorithm is very sensitive to noise in images. Several modified FCM algorithms, using local spatial information, can overcome this problem to some degree. However, when the noise level in the image is high, these algorithms still cannot obtain satisfactory segmentation performance. In this paper, we introduce a non local spatial constraint term into the objective function of FCM and propose a fuzzy c-means clustering algorithm with non local spatial information (FCM_NLS). FCM_NLS can deal more effectively with the image noise and preserve geometrical edges in the image. Performance evaluation experiments on synthetic and real images, especially magnetic resonance (MR) images, show that FCM_NLS is more robust than both the standard FCM and the modified FCM algorithms using local spatial information for noisy image segmentation.
With the explosive growth of information, more and more organizations are deploying private cloud systems or renting public cloud systems to process big data. However, there is no existing benchmark suite for evaluating cloud performance on the whole system level. To the best of our knowledge, this paper proposes the first benchmark suite
Robust face representation is imperative to highly accurate face recognition. In this work, we propose an open source face recognition method with deep representation named as VIPLFaceNet, which is a 10-layer deep convolutional neural network with seven convolutional layers and three fully-connected layers. Compared with the well-known AlexNet, our VIPLFaceNet takes only 20% training time and 60% testing time, but achieves 40% drop in error rate on the real-world face recognition benchmark LFW. Our VIPLFaceNet achieves 98.60% mean accuracy on LFW using one single network. An open-source C++ SDK based on VIPLFaceNet is released under BSD license. The SDK takes about 150ms to process one face image in a single thread on an i7 desktop CPU. VIPLFaceNet provides a state-of-the-art start point for both academic and industrial face recognition applications.
RDF is increasingly being used to encode data for the semantic web and data exchange. There have been a large number of works that address RDF data management following different approaches. In this paper we provide an overview of these works. This review considers centralized solutions (what are referred to as warehousing approaches), distributed solutions, and the techniques that have been developed for querying linked data. In each category, further classifications are provided that would assist readers in understanding the identifying characteristics of different approaches.
A new meaningful image encryption algorithm based on compressive sensing (CS) and integer wavelet transformation (IWT) is proposed in this study. First of all, the initial values of chaotic system are encrypted by RSA algorithm, and then they are open as public keys. To make the chaotic sequence more random, a mathematical model is constructed to improve the random performance. Then, the plain image is compressed and encrypted to obtain the secret image. Secondly, the secret image is inserted with numbers zero to extend its size same to the plain image. After applying IWT to the carrier image and discrete wavelet transformation (DWT) to the inserted image, the secret image is embedded into the carrier image. Finally, a meaningful carrier image embedded with secret plain image can be obtained by inverse IWT. Here, the measurement matrix is built by both chaotic system and Hadamard matrix, which not only retains the characteristics of Hadamard matrix, but also has the property of control and synchronization of chaotic system. Especially, information entropy of the plain image is employed to produce the initial conditions of chaotic system. As a result, the proposed algorithm can resist known-plaintext attack (KPA) and chosen-plaintext attack (CPA). By the help of asymmetric cipher algorithm RSA, no extra transmission is needed in the communication. Experimental simulations show that the normalized correlation (NC) values between the host image and the cipher image are high. That is to say, the proposed encryption algorithm is imperceptible and has good hiding effect.
N6-methyladenosine (m 6A) is a prevalent methylation modification and plays a vital role in various biological processes, such as metabolism, mRNA processing, synthesis, and transport. Recent studies have suggested that m 6A modification is related to common diseases such as cancer, tumours, and obesity. Therefore, accurate prediction of methylation sites in RNA sequences has emerged as a critical issue in the area of bioinformatics. However, traditional high-throughput sequencing and wet bench experimental techniques have the disadvantages of high costs, significant time requirements and inaccurate identification of sites. But through the use of traditional experimental methods, researchers have produced many large databases of m 6A sites. With the support of these basic databases and existing deep learning methods, we developed an m 6A site predictor named DeepM6ASeq-EL, which integrates an ensemble of five LSTM and CNN classifiers with the combined strategy of hard voting. Compared to the state-of-the-art prediction method WHISTLE (average AUC 0.948 and 0.880), the DeepM6ASeq-EL had a lower accuracy in m 6A site prediction (average AUC: 0.861 for the full transcript models and 0.809 for the mature messenger RNA models) when tested on six independent datasets.
This paper presents a novel tracking algorithm which integrates two complementary trackers. Firstly, an improved Bayesian tracker(B-tracker) with adaptive learning rate is presented. The classification score of B-tracker reflects tracking reliability, and a low score usually results from large appearance change. Therefore, if the score is low, we decrease the learning rate to update the classifier fast so that B-tracker can adapt to the variation and vice versa. In this way, B-tracker is more suitable than its traditional version to solve appearance change problem. Secondly, we present an improved incremental subspace learning method tracker(Stracker). We propose to calculate projected coordinates using maximum posterior probability, which results in a more accurate reconstruction error than traditional subspace learning tracker. Instead of updating at every time, we present a stopstrategy to deal with occlusion problem. Finally, we present an integrated framework(BAST), in which the pair of trackers run in parallel and return two candidate target states separately. For each candidate state, we define a tracking reliability metrics to measure whether the candidate state is reliable or not, and the reliable candidate state will be chosen as the target state at the end of each frame. Experimental results on challenging sequences show that the proposed approach is very robust and effective in comparison to the state-of-the-art trackers.
In the real-world applications, most optimization problems are subject to different types of constraints. These problems are known as constrained optimization problems (COPs). Solving COPs is a very important area in the optimization field. In this paper, a hybrid multi-swarm particle swarm optimization (HMPSO) is proposed to deal with COPs. This method adopts a parallel search operator in which the current swarm is partitioned into several subswarms and particle swarm optimization (PSO) is severed as the search engine for each sub-swarm. Moreover, in order to explore more promising regions of the search space, differential evolution (DE) is incorporated to improve the personal best of each particle. First, the method is tested on 13 benchmark test functions and compared with three stateof-the-art approaches. The simulation results indicate that the proposed HMPSO is highly competitive in solving the 13 benchmark test functions. Afterward, the effectiveness of some mechanisms proposed in this paper and the effect of the parameter setting were validated by various experiments. Finally, HMPSO is further applied to solve 24 benchmark test functions collected in the 2006 IEEE Congress on Evolutionary Computation (CEC2006) and the experimental results indicate that HMPSO is able to deal with 22 test functions.
Mobile ad-hoc networks (MANETs) and wireless sensor networks (WSNs) have gained remarkable appreciation and technological development over the last few years. Despite ease of deployment, tremendous applications and significant advantages, security has always been a challenging issue due to the nature of environments in which nodes operate. Nodes’ physical capture, malicious or selfish behavior cannot be detected by traditional security schemes. Trust and reputation based approaches have gained global recognition in providing additional means of security for decision making in sensor and ad-hoc networks. This paper provides an extensive literature review of trust and reputation based models both in sensor and ad-hoc networks. Based on the mechanism of trust establishment, we categorize the stateof-the-art into two groups namely node-centric trust models and system-centric trust models. Based on trust evidence, initialization, computation, propagation and weight assignments, we evaluate the efficacy of the existing schemes. Finally, we conclude our discussion with identification of some unresolved issues in pursuit of trust and reputation management.
In this paper, several diversity measurements will be discussed and defined. As in other evolutionary algorithms, first the population position diversity will be discussed followed by the discussion and definition of population velocity diversity which is different from that in other evolutionary algorithms since only PSO has the velocity parameter. Furthermore, a diversity measurement called cognitive diversity is discussed and defined, which can reveal clustering information about where the current population of particles intends to move towards. The diversity of the current population of particles and the cognitive diversity together tell what the convergence/divergence stage the current population of particles is at and which stage it moves towards.
This paper presents a high-quality very large scale integration (VLSI) global router in X-architecture, called XGRouter, that heavily relies on integer linear programming (ILP) techniques, partition strategy and particle swarm optimization (PSO). A new ILP formulation, which can achieve more uniform routing solution than other formulations and can be effectively solved by the proposed PSO is proposed. To effectively use the new ILP formulation, a partition strategy that decomposes a large-sized problem into some small-sized sub-problems is adopted and the routing region is extended progressively from the most congested region. In the post-processing stage of XGRouter, maze routing based on new routing edge cost is designed to further optimize the total wire length and mantain the congestion uniformity. To our best knowledge, XGRouter is the first work to use a concurrent algorithm to solve the global routing problem in X-architecture. Experimental results show that XGRouter can produce solutions of higher quality than other global routers. And, like several state-of-the-art global routers, XGRouter has no overflow.
The task of next POI recommendations has been studied extensively in recent years. However, developing a unified recommendation framework to incorporate multiple factors associated with both POIs and users remains challenging, because of the heterogeneity nature of these information. Further, effective mechanisms to smoothly handle cold-start cases are also a difficult topic. Inspired by the recent success of neural networks in many areas, in this paper, we propose a simple yet effective neural network framework, named NEXT, for next POI recommendations. NEXT is a unified framework to learn the hidden intent regarding user’s next move, by incorporating different factors in a unified manner. Specifically, in NEXT, we incorporatemeta-data information, e.g., user friendship and textual descriptions of POIs, and two kinds of temporal contexts (i.e., time interval and visit time). To leverage sequential relations and geographical influence, we propose to adopt DeepWalk, a network representation learning technique, to encode such knowledge. We evaluate the effectiveness of NEXT against other state-of-the-art alternatives and neural networks based solutions. Experimental results on three publicly available datasets demonstrate that NEXT significantly outperforms baselines in real-time next POI recommendations. Further experiments show inherent ability of NEXT in handling cold-start.
Blockchain has recently emerged as a research trend, with potential applications in a broad range of industries and context. One particular successful Blockchain technology is smart contract, which is widely used in commercial settings (e.g., high value financial transactions). This, however, has security implications due to the potential to financially benefit froma security incident (e.g., identification and exploitation of a vulnerability in the smart contract or its implementation). Among, Ethereum is the most active and arresting. Hence, in this paper, we systematically review existing research efforts on Ethereum smart contract security, published between 2015 and 2019. Specifically, we focus on how smart contracts can be maliciously exploited and targeted, such as security issues of contract program model, vulnerabilities in the program and safety consideration introduced by program execution environment. We also identify potential research opportunities and future research agenda.
High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result, analyzing HUIs can be very time consuming for users. Moreover, a large set of HUIs also makes HUIM algorithms less efficient in terms of execution time and memory consumption. To address this problem, closed high-utility itemsets (CHUIs), concise and lossless representations of all HUIs, were proposed recently. Although mining CHUIs is useful and desirable, it remains a computationally expensive task. This is because current algorithms often generate a huge number of candidate itemsets and are unable to prune the search space effectively. In this paper, we address these issues by proposing a novel algorithm called CLS-Miner. The proposed algorithm utilizes the utility-list structure to directly compute the utilities of itemsets without producing candidates. It also introduces three novel strategies to reduce the search space, namely chain-estimated utility co-occurrence pruning, lower branch pruning, and pruning by coverage. Moreover, an effective method for checking whether an itemset is a subset of another itemset is introduced to further reduce the time required for discovering CHUIs. To evaluate the performance of the proposed algorithm and its novel strategies, extensive experiments have been conducted on six benchmark datasets having various characteristics. Results show that the proposed strategies are highly efficient and effective, that the proposed CLS-Miner algorithmoutperforms the current state-ofthe- art CHUD and CHUI-Miner algorithms, and that CLSMiner scales linearly.
System management is becoming increasingly complex and brings high costs, especially with the advent of cloud computing. Cloud computing involves numerous platforms often of virtual machines (VMs) and middleware has to be managed to make the whole system work costeffectively after an application is deployed. In order to reduce management costs, in particular for the manual activities, many computer programs have been developed remove or reduce the complexity and difficulty of system mamnagement. These programs are usually hard-coded in languages like Java and C++, which bring enough capability and flexibility but also cause high programming effort and cost. This paper proposes an architecture for developing management programs in a simple but powerful way. First of all, the manageability of a given platform (via APIs, configuration files, and scripts) is abstracted as a runtime model of the platform’s software architecture, which can automatically and immediately propagate any observable runtime changes of the target platforms to the corresponding architecture models, and vice versa. The management programs are developed using modeling languages, instead of those relatively low-level programming languages. Architecture-level management programs bring many advantages related to performance, interoperability, reusability, and simplicity. An experiment on a real-world cloud deployment and comparisonwith traditional programming language approaches demonstrate the feasibility, effectiveness, and benefits of the new model based approach for management program development.
Semi-supervised learning constructs the predictive model by learning from a few labeled training examples and a large pool of unlabeled ones. It has a wide range of application scenarios and has attracted much attention in the past decades. However, it is noteworthy that although the learning performance is expected to be improved by exploiting unlabeled data, some empirical studies show that there are situations where the use of unlabeled data may degenerate the performance. Thus, it is advisable to be able to exploit unlabeled data safely. This article reviews some research progress of safe semi-supervised learning, focusing on three types of safeness issue: data quality, where the training data is risky or of low-quality;model uncertainty, where the learning algorithm fails to handle the uncertainty during training; measure diversity, where the safe performance could be adapted to diverse measures.