String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems.
RDF is increasingly being used to encode data for the semantic web and data exchange. There have been a large number of works that address RDF data management following different approaches. In this paper we provide an overview of these works. This review considers centralized solutions (what are referred to as warehousing approaches), distributed solutions, and the techniques that have been developed for querying linked data. In each category, further classifications are provided that would assist readers in understanding the identifying characteristics of different approaches.
Recent years have witnessed the rapid growth of social multimedia data available over the Internet. The age of huge amount of media collection provides users facilities to share and access data, while it also demands the revolution of data management techniques, since the exponential growth of social multimedia requires more scalable, effective and robust technologies to manage and index them. The event is one of the most important cues to recall people’s past memory. The reminder value of an event makes it extremely helpful in organizing data. The study of event based analysis on social multimedia data has drawn intensive attention in research community. In this article, we provide a comprehensive survey on event based analysis over social multimedia data, including event enrichment, detection, and categorization. We introduce each paradigm and summarize related research efforts. In addition, we also suggest the emerging trends in this research area.
Traditionally, complex engineering applications (CEAs), which consist of numerous components (software) and require a large amount of computing resources, usually run in dedicated clusters or high performance computing (HPC) centers. Nowadays, Cloud computing system with the ability of providing massive computing resources and customizable execution environment is becoming an attractive option for CEAs. As a new type on Cloud applications, CEA also brings the challenges of dealing with Cloud resources. In this paper, we provide a comprehensive survey of Cloud resource management research for CEAs. The survey puts forward two important questions: 1) what are the main challenges for CEAs to run in Clouds? and 2) what are the prior research topics addressing these challenges? We summarize and highlight the main challenges and prior research topics. Our work can be probably helpful to those scientists and engineers who are interested in running CEAs in Cloud environment.
With the popularity of social network, the demand for real-time processing of graph data is increasing. However, most of the existing graph systems adopt a batch processing mode, therefore the overhead of maintaining and processing of dynamic graph is significantly high. In this paper, we design iGraph, an incremental graph processing system for dynamic graph with its continuous updates. The contributions of iGraph include: 1) a hash-based graph partition strategy to enable fine-grained graph updates; 2) a vertexbased graph computing model to support incremental data processing; 3) detection and rebalance methods of hotspot to address the workload imbalance problem during incremental processing. Through the general-purpose API, iGraph can be used to implement various graph processing algorithms such as PageRank. We have implemented iGraph on Apache Spark, and experimental results show that for real life datasets, iGraph outperforms the original GraphX in respect of graph update and graph computation.
A co-location pattern is a set of spatial features whose instances frequently appear in a spatial neighborhood. This paper efficiently mines the top-k probabilistic prevalent co-locations over spatially uncertain data sets and makes the following contributions: 1) the concept of the top-k probabilistic prevalent co-locations based on a possible world model is defined; 2) a framework for discovering the top-k probabilistic prevalent co-locations is set up; 3) a matrix method is proposed to improve the computation of the prevalence probability of a top-k candidate, and two pruning rules of the matrix block are given to accelerate the search for exact solutions; 4) a polynomial matrix is developed to further speed up the top-k candidate refinement process; 5) an approximate algorithm with compensation factor is introduced so that relatively large quantity of data can be processed quickly. The efficiency of our proposed algorithms as well as the accuracy of the approximation algorithms is evaluated with an extensive set of experiments using both synthetic and real uncertain data sets.
Recent studies have applied different approaches for summarizing software artifacts, and yet very few efforts have been made in summarizing the source code fragments available on web. This paper investigates the feasibility of generating code fragment summaries by using supervised learning algorithms. We hire a crowd of ten individuals from the same work place to extract source code features on a corpus of 127 code fragments retrieved from Eclipse and Net-Beans Official frequently asked questions (FAQs). Human annotators suggest summary lines. Our machine learning algorithms produce better results with the precision of 82% and perform statistically better than existing code fragment classifiers. Evaluation of algorithms on several statistical measures endorses our result. This result is promising when employing mechanisms such as data-driven crowd enlistment improve the efficacy of existing code fragment classifiers.
Social networks are fundamental mediums for diffusion of information and contagions appear at some node of the network and get propagated over the edges. Prior researches mainly focus on each contagion spreading independently, regardless of multiple contagions’ interactions as they propagate at the same time. In the real world, simultaneous news and events usually have to compete for user’s attention to get propagated. In some other cases, they can cooperate with each other and achieve more influences.
In this paper, an evolutionary game theoretic framework is proposed to model the interactions among multiple contagions. The basic idea is that different contagions in social networks are similar to the multiple organisms in a population, and the diffusion process is as organisms interact and then evolve from one state to another. This framework statistically learns the payoffs as contagions interacting with each other and builds the payoff matrix. Since learning payoffs for all pairs of contagions IS almost impossible (quadratic in the number of contagions), a contagion clustering method is proposed in order to decrease the number of parameters to fit, which makes our approach efficient and scalable. To verify the proposed framework, we conduct experiments by using real-world information spreading dataset of Digg. Experimental results show that the proposed game theoretic framework helps to comprehend the information diffusion process better and can predict users’ forwarding behaviors with more accuracy than the previous studies. The analyses of evolution dynamics of contagions and evolutionarily stable strategy reveal whether a contagion can be promoted or suppressed by others in the diffusion process.
In today’s data centers supporting Internet-scale computing and input/output (I/O) services, increasinglymore network-intensive applications are deployed on the network as a service. To this end, it is critical for the applications to quickly retrieve requests from the network and send their responses to the network. To facilitate this network function, operating system usually provides an event notification mechanism so that the applications (or the library) know if the network is ready to supply data for them to read or to receive data for them to write. As a widely used and representative notification mechanism, epoll in Linux provides a scalable and high-performance implementation by allowing applications to specifically indicate which connections and what events on them need to be watched.
As epoll has been used in some major systems, including key-value (KV) systems, such as Redis and Memcached, and web server systems such as NGINX, we have identified a substantial performance issue in its use. For the sake of efficiency, applications usually use epoll’s system calls to inform the kernel exactly of what events they are interested in and always keep the information up-to-date. However, in a system with demanding network traffic, such a rigid maintenance of the information is not necessary and the excess number of system calls for this purpose can substantially degrade the system’s performance. In this paper, we use Redis as an example to explore the issue. We propose a strategy of informing the kernel of the interest events in a manner adaptive to the current network load, so that the epoll system calls can be reduced and the events can be efficiently delivered. We have implemented an event-polling library, named as FlexPoll, purely in user-level without modifying any kernel code.
Our evaluation on Redis shows that the query throughput can be improved by up to 46.9% on micro-benchmarks, and even up to 67.8% on workloads emulating real-world access patterns. FlexPoll is a generic mechanism thus it can be adopted by other applications in a straightforward manner, such as NGINX and Memcached.
Automated trust negotiation (ATN) offers an attractive means for trust establishments, which establishes mutual trust among strangers wishing to share resources or conduct business, but it comes at the cost of non-trivial computation and communication overheads. The deployment of ATN strategies on a resource-constrained mobile device may lead to user-obstructive latency for operations. In this paper, we propose a trust negotiation strategy called trust target Petri nets negotiation strategy (TPNNS). It highly reduces the negotiation latency in the mobile device compared with other negotiation strategies, since it considers all the alternative responses at each step and chooses the best one. TPNNS supports cycle avoidance and employs skipped TPN which is a new approach presented in this paper. What is more, it is complete and ensures no irrelevant credentials are disclosed during the trust negotiation.
Enterprises build private clouds to provide IT resources for geographically distributed subsidiaries or product divisions. Public cloud providers like Amazon lease their platforms to enterprise users, thus, enterprises can also rent a number of virtual machines (VMs) from their data centers in the service provider networks. Unfortunately, the network cannot always guarantee stable connectivity for their clients to access the VMs or low-latency transfer among data centers. Usually, both latency and bandwidth are in unstable network environment. Being affected by background traffics, the network status can be volatile. To reduce the latency uncertainty of client accesses, enterprises should consider the network status when they deploy data centers or rent virtual data centers from cloud providers. In this paper, we first develop a data center deployment and assignment scheme for an enterprise to meet its users’ requirements under uncertain network status. To accommodate to the changes of the network status and users’ demands, a VMs migration-based redeployment scheme is adopted. These two schemes work in a joint way, and lay out a framework to help enterprises make better use of private or public clouds.
There has been growing concern about energy consumption and environmental impact of datacenters. Some pioneers begin to power datacenters with renewable energy to offset carbon footprint. However, it is challenging to integrate intermittent renewable energy into datacenter power system. Grid-tied system is widely deployed in renewable energy powered datacenters. But the drawbacks (e.g. Harmonic disturbance and costliness) of grid tie inverter harass this design. Besides, the mixture of green load and brown load makes power management heavily depend on software measurement and monitoring, which often suffers inaccuracy. We propose DualPower, a novel power provisioning architecture that enables green datacenters to integrate renewable power supply without grid tie inverters. To optimize DualPower operation, we propose a specially designed power management framework to coordinate workload balancing with power supply switching. We evaluate three optimization schemes (LM, PS and JO) under different datacenter operation scenarios on our trace-driven simulation platform. The experimental results show that DualPower can be as efficient as grid-tied system and has good scalability. In contrast to previous works, DualPower integrates renewable power at lower cost and maintains full availability of datacenter servers.