gStore is an open-source native Resource Description Framework (RDF) triple store that answers SPARQL queries by subgraph matching over RDF graphs. However, there are some deficiencies in the original system design, such as answering simple queries (including onetriple pattern queries). To improve the efficiency of the system, we reconsider the system design in this paper. Specifically, we propose a new query plan generation module that generates different query plans according to the structures of query graphs. Furthermore, we re-design our vertex encoding strategy to achieve more pruning power and a new multi-join algorithm to speed up the subgraph matching process. Extensive experiments on synthetic and real RDF datasets show that our method outperforms the state-of-the-art algorithms significantly.
Two traditional recommendation techniques, content-based and collaborative filtering (CF), have been widely used in a broad range of domain areas. Both methods have their advantages and disadvantages, and some of the defects can be resolved by integrating both techniques in a hybrid model to improve the quality of the recommendation. In this article, we will present a problem-oriented approach to design a hybrid immunizing solution for job recommendation problem from applicant’s perspective. The proposed approach aims to recommend the best chances of opening jobs to the applicant who searches for job. It combines the artificial immune system (AIS), which has a powerful exploration capability in polynomial time, with the collaborative filtering, which can exploit the neighbors’ interests. We will discuss the design issues, as well as the hybridization process that should be applied to the problem. Finally, experimental studies are conducted and the results show the importance of our approach for solving the job recommendation problem.
Career indecision is a difficult obstacle confronting adolescents. Traditional vocational assessment research measures it by means of questionnaires and diagnoses the potential sources of career indecision. Based on the diagnostic outcomes, career counselors develop treatment plans tailored to students. However, because of personal motives and the architecture of the mind, it may be difficult for students to know themselves, and the outcome of questionnaires may not fully reflect their inner states and statuses. Selfperception theory suggests that students’ behavior could be used as a clue for inference. Thus, we proposed a data-driven framework for forecasting student career choice upon graduation based on their behavior in and around the campus, thereby playing an important role in supporting career counseling and career guidance. By evaluating on 10M behavior data of over four thousand students, we show the potential of this framework for this functionality.
In the era of big data, the dimensionality of data is increasing dramatically in many domains. To deal with high dimensionality, online feature selection becomes critical in big data mining. Recently, online selection of dynamic features has received much attention. In situations where features arrive sequentially over time, we need to perform online feature selection upon feature arrivals. Meanwhile, considering grouped features, it is necessary to deal with features arriving by groups. To handle these challenges, some state-ofthe- art methods for online feature selection have been proposed. In this paper, we first give a brief review of traditional feature selection approaches. Then we discuss specific problems of online feature selection with feature streams in detail. A comprehensive review of existing online feature selection methods is presented by comparing with each other. Finally, we discuss several open issues in online feature selection.
With the increasing availability of modern mobile devices and location acquisition technologies, massive trajectory data of moving objects are collected continuously in a streaming manner. Clustering streaming trajectories facilitates finding the representative paths or common moving trends shared by different objects in real time. Although data stream clustering has been studied extensively in the past decade, little effort has been devoted to dealing with streaming trajectories. The main challenge lies in the strict space and time complexities of processing the continuously arriving trajectory data, combined with the difficulty of concept drift. To address this issue, we present two novel synopsis structures to extract the clustering characteristics of trajectories, and develop an incremental algorithm for the online clustering of streaming trajectories (called OCluST). It contains a micro-clustering component to cluster and summarize the most recent sets of trajectory line segments at each time instant, and a macro-clustering component to build large macro-clusters based on micro-clusters over a specified time horizon. Finally, we conduct extensive experiments on four real data sets to evaluate the effectiveness and efficiency of OCluST, and compare it with other congeneric algorithms. Experimental results show that OCluST can achieve superior peformance in clustering streaming trajectories.
Because of users’ growing utilization of unclear and imprecise keywords when characterizing their information need, it has become necessary to expand their original search queries with additional words that best capture their actual intent. The selection of the terms that are suitable for use as additional words is in general dependent on the degree of relatedness between each candidate expansion term and the query keywords. In this paper, we propose two criteria for evaluating the degree of relatedness between a candidate expansion word and the query keywords: (1) co-occurrence frequency, where more importance is attributed to terms occurring in the largest possible number of documents where the query keywords appear; (2) proximity, where more importance is assigned to terms having a short distance from the query terms within documents. We also employ the strength Pareto fitness assignment in order to satisfy both criteria simultaneously. The results of our numerical experiments on MEDLINE, the online medical information database, show that the proposed approach significantly enhances the retrieval performance as compared to the baseline.
Indexing is one of the most important techniques to facilitate query processing over a multi-dimensional dataset. A commonly used strategy for such indexing is to keep the tree-structured index balanced. This strategy reduces query processing cost in the worst case, and can handle all different queries equally well. In other words, this strategy implies that all queries are uniformly issued, which is partially because the query distribution is not possibly known and will change over time in practice. A key issue we study in this work is whether it is the best to fully rely on a balanced tree-structured index in particular when datasets become larger and larger in the big data era. This means that, when a dataset becomes very large, it becomes unreasonable to assume that all data in any subspace are equally important and are uniformly accessed by all queries at the index level. Given the existence of query skew and the possible changes of query skew, in this paper, we study how to handle such query skew and such query skew changes at the index level without sacrifice of supporting any possible queries in a wellbalanced tree index and without a high overhead. To tackle the issue, we propose index-view at the index level, where an index-view is a short-cut in a balanced tree-structured index to access objects in the subspaces that are more frequently accessed, and propose a new index-view-centric framework for query processing using index-views in a bottom-up manner. We study index-views selection problem in both static and dynamic setting, and we confirm the effectiveness of our approach using large real and synthetic datasets.
Incomplete data accompanies our life processes and covers almost all fields of scientific studies, as a result of delivery failure, no power of battery, accidental loss, etc. However, how to model, index, and query incomplete data incurs big challenges. For example, the queries struggling with incomplete data usually have dissatisfying query results due to the improper incompleteness handling methods. In this paper, we systematically review the management of incomplete data, including modelling, indexing, querying, and handling methods in terms of incomplete data. We also overview several application scenarios of incomplete data, and summarize the existing systems related to incomplete data. It is our hope that this survey could provide insights to the database community on how incomplete data is managed, and inspire database researchers to develop more advanced processing techniques and tools to cope with the issues resulting from incomplete data in the real world.
The importance of product recommendation has been well recognized as a central task in business intelligence for e-commerce websites. Interestingly, what has been less aware of is the fact that different products take different time periods for conversion. The “conversion” here refers to actually a more general set of pre-defined actions, including for example purchases or registrations in recommendation and advertising systems. The mismatch between the product’s actual conversion period and the application’s target conversion period has been the subtle culprit compromising many existing recommendation algorithms.
The challenging question: what products should be recommended for a given time period to maximize conversion—is what has motivated us in this paper to propose a rank-based time-aware conversion prediction model (rTCP), which considers both recommendation relevance and conversion time. We adopt lifetime models in survival analysis to model the conversion time and personalize the temporal prediction by incorporating context information such as user preference. A novel mixture lifetime model is proposed to further accommodate the complexity of conversion intervals. Experimental results on two real-world data sets illustrate the high goodness of fit of our proposed model rTCP and demonstrate its effectiveness in time-aware conversion rate prediction for advertising and product recommendation.
Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blockingbased method, tends to select a small proportion of record pairs for verification with a far lower cost thanO(n2), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly.
It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some complicated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce framework is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called MrEm, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.
Smartphone applications (apps) are becoming increasingly popular all over the world, particularly in the Chinese Generation Y population; however, surprisingly, only a small number of studies on app factors valued by this important group have been conducted. Because the competition among app developers is increasing, app factors that attract users’ attention are worth studying for sales promotion. This paper examines these factors through two separate studies. In the first study, i.e., Experiment 1, which consists of a survey, perceptual rating and verbal protocol methods are employed, and 90 randomly selected app websites are rated by 169 experienced smartphone users according to app attraction. Twelve of the most rated apps (six highest rated and six lowest rated) are selected for further investigation, and 11 influential factors that Generation Y members value are listed. A second study, i.e., Experiment 2, is conducted using the most and least rated app websites from Experiment 1, and eye tracking and verbal protocol methods are used. The eye movements of 45 participants are tracked while browsing these websites, providing evidence about what attracts these users’ attention and the order in which the app components are viewed. The results of these two studies suggest that Chinese Generation Y is a content-centric group when they browse the smartphone app marketplace. Icon, screenshot, price, rating, and name are the dominant and indispensable factors that influence purchase intentions, among which icon and screenshot should be meticulously designed. Price is another key factor that drives Chinese Generation Y’s attention. The recommended apps are the least dominant element. Design suggestions for app websites are also proposed. This research has important implications.
String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports different similarity thresholds in a single operator. In order to support different thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while providing a more flexible threshold specification.
Bike sharing systems are booming globally as a green and flexible transportationmode, but the flexibility also brings difficulties in keeping the bike stations balanced with enough bikes and docks. Understanding the spatio-temporal bike trip patterns in a bike sharing system, such as the popular trip origins and destinations during rush hours, is important for researchers to design models for bike scheduling and station management. However, due to privacy and operational concerns, bike trip data are usually not publicly available in many cities. Instead, the station feeds about real-time bike and dock number in stations are usually public, which we refer to as bike sharing system open data. In this paper, we propose an approach to infer the spatio-temporal bike trip patterns from the public station feeds. Since the number of possible trips (i.e., origin-destination station pairs) is much larger than the number of stations, we define the trip inference as an ill-posed inverse problem. To solve this problem, we identify the sparsity and locality properties of bike trip patterns, and propose a sparse and weighted regularization model to impose both properties in the solution. We evaluate our method using real-world data fromWashington, D.C. and New York City. Results show that our method can effectively infer the spatio-temporal bike trip patterns and outperform the baselines in both cities.
How to identify the critical links of the urban road network for actual traffic management and intelligent transportation control is an urgent problem, especially in the congestion environment. Most previous methods focus on traffic static characteristics for traffic planning and design. However, actual traffic management and intelligent control need to identify relevant sections by dynamic traffic information for solving the problems of variable transportation system. Therefore, a city-wide traffic model that consists of three relational algorithms, is proposed to identify significant links of the road network by using macroscopic fundamental diagram (MFD) as traffic dynamic characteristics. Firstly, weightedtraffic flow and density extraction algorithm is provided with simulation modeling and regression analysis methods, based on MFD theory. Secondly, critical links identification algorithm is designed on the first algorithm, under specified principles. Finally, threshold algorithm is developed by cluster analysis. In addition, the algorithms are analyzed and applied in the simulation experiment of the road network of the central district in Hefei city, China. The results show that the model has good maneuverability and improves the shortcomings of the threshold judged by human. It provides an approach to identify critical links for actual traffic management and intelligent control, and also gives a new method for evaluating the planning and design effect of the urban road network.