With the rapid increase in social websites that has dramatically increased the volume of social media, which includes the use of images and videos, visual understanding has attracted great interest in several areas such as multimedia, computer vision, and pattern recognition. Valuable auxiliary resources available on social websites, such as user-provided tags, aid in the tasks of visual understanding. Therefore, several methods have been proposed for exploring the auxiliary resources for tag refinement, image retrieval, and media summarization. This work conducts a comprehensive survey of recent advances in visual understanding by mining social media in order to discuss their merits and limitations. We then analyze the difficulties and challenges of visual understanding followed by several possible future research directions.
Wireless indoor localization has attracted growing research interest in the mobile computing community for the last decade. Various available indoor signals, including radio frequency, ambient, visual, and motion signals, are extensively exploited for location estimation in indoor environments. The physical measurements of these signals, however, are still limited by both the resolution of devices and the spatial-temporal variability of the signals. One type of noisy signal complemented by another type of signal can benefit the wireless indoor localization in many ways, since these signals are related in their physics and independent in noise. In this article, we survey the new trend of integrating multiple chaotic signals to facilitate the creation of a crowdsourced localization system. Specifically, we first present a three-layer framework for crowdsourcing-based indoor localization by integratingmultiple signals, and illustrate the basic methodology for making use of the available signals. Next, we study the mainstream signals involved in indoor localization approaches in terms of their characteristics and typical usages. Furthermore, considering multiple different outputs from different signals, we present significant insights to integrate them together, to achieve localizability in different scenarios.
Verifiable computation (VC) paradigm has got the captivation that in real term is highlighted by the concept of third party computation. In more explicate terms, VC allows resource constrained clients/organizations to securely outsource expensive computations to untrusted service providers, while acquiring the publicly or privately verifiable results. Many mainstream solutions have been proposed to address the diverse problems within the VC domain. Some of them imposed assumptions over performed computations, while the others took advantage of interactivity/non-interactivity, zero knowledge proofs, and arguments. Further proposals utilized the powers of probabilistic checkable or computationally sound proofs. In this survey, we present a chronological study and classify the VC proposals based on their adopted domains. First, we provide a broader overview of the theoretical advancements while critically analyzing them. Subsequently, we present a comprehensive view of their utilization in the state of the art VC approaches. Moreover, a brief overviewof recent proof based VC systems is also presented that lifted up the VC domain to the verge of practicality. We use the presented study and reviewed results to identify the similarities and alterations, modifications, and hybridization of different approaches, while comparing their advantages and reporting their overheads. Finally, we discuss implementation of such VC based systems, their applications, and the likely future directions.
In the era of big data, the dimensionality of data is increasing dramatically in many domains. To deal with high dimensionality, online feature selection becomes critical in big data mining. Recently, online selection of dynamic features has received much attention. In situations where features arrive sequentially over time, we need to perform online feature selection upon feature arrivals. Meanwhile, considering grouped features, it is necessary to deal with features arriving by groups. To handle these challenges, some state-ofthe- art methods for online feature selection have been proposed. In this paper, we first give a brief review of traditional feature selection approaches. Then we discuss specific problems of online feature selection with feature streams in detail. A comprehensive review of existing online feature selection methods is presented by comparing with each other. Finally, we discuss several open issues in online feature selection.
Career indecision is a difficult obstacle confronting adolescents. Traditional vocational assessment research measures it by means of questionnaires and diagnoses the potential sources of career indecision. Based on the diagnostic outcomes, career counselors develop treatment plans tailored to students. However, because of personal motives and the architecture of the mind, it may be difficult for students to know themselves, and the outcome of questionnaires may not fully reflect their inner states and statuses. Selfperception theory suggests that students’ behavior could be used as a clue for inference. Thus, we proposed a data-driven framework for forecasting student career choice upon graduation based on their behavior in and around the campus, thereby playing an important role in supporting career counseling and career guidance. By evaluating on 10M behavior data of over four thousand students, we show the potential of this framework for this functionality.
Online prediction is a process that repeatedly predicts the next element in the coming period from a sequence of given previous elements. This process has a broad range of applications in various areas, such as medical, streaming media, and finance. The greatest challenge for online prediction is that the sequence data may not have explicit features because the data is frequently updated, which means good predictions are difficult to maintain. One of the popular solutions is to make the prediction with expert advice, and the challenge is to pick the right experts with minimum cumulative loss. In this research, we use the forex trading prediction, which is a good example for online prediction, as a case study. We also propose an improved expert selection model to select a good set of forex experts by learning previously observed sequences. Our model considers not only the average mistakes made by experts, but also the average profit earned by experts, to achieve a better performance, particularly in terms of financial profit. We demonstrate the merits of our model on two real major currency pairs corpora with extensive experiments.
Developer recommendation is an essential task for resolving incoming issues in the evolution of software. Many developer recommendation techniques have been developed in the literature; among these studies, most techniques usually combined historical commits as supplementary information with bug repositories and/or source-code repositories to recommend developers. However, the question of whether themessages in historical commits are always useful has not yet been answered. This article aims at solving this problem by conducting an empirical study on four open-source projects. The results show that: (1) the number of meaningfulwords of the commit description has an impact on the quality of the commit, and a larger number of meaningful words in the description means that it can generally better reflect developers’ expertise; (2) using commit description to recommend the relevant developers is better than that using relevant files that are recorded in historical commits; (3) developers tend to change the relevant files that they have changed many times before; (4) developers generally tend to change the files that they have changed recently.
The emerging integrated CPU–GPU architectures facilitate short computational kernels to utilize GPU acceleration. Evidence has shown that, on such systems, the GPU control responsiveness (how soon the host program finds out about the completion of a GPU kernel) is essential for the overall performance. This study identifies the GPU responsiveness dilemma: host busy polling responds quickly, but at the expense of high energy consumption and interference with co-running CPU programs; interrupt-based notification minimizes energy and CPU interference costs, but suffers from substantial response delay. We present a programlevel solution that wakes up the host program in anticipation of GPU kernel completion.We systematically explore the design space of an anticipatorywakeup scheme through a timerdelayed wakeup or kernel splitting-based pre-completion notification. Experiments show that our proposed technique can achieve the best of both worlds, high responsivenesswith low power and CPU costs, for a wide range of GPU workloads.
Decision trees are a kind of off-the-shelf predictive models, and they have been successfully used as the base learners in ensemble learning. To construct a strong classifier ensemble, the individual classifiers should be accurate and diverse. However, diversity measure remains a mystery although there were many attempts. We conjecture that a deficiency of previous diversity measures lies in the fact that they consider only behavioral diversity, i.e., how the classifiers be have when making predictions, neglecting the fact that classifiers may be potentially different even when they make the same predictions. Based on this recognition, in this paper, we advocate to consider structural diversity in addition to behavioral diversity, and propose the TMD (tree matching diversity) measure for decision trees. To investigate the usefulness of TMD, we empirically evaluate performances of selective ensemble approacheswith decision forests by incorporating different diversity measures. Our results validate that by considering structural and behavioral diversities together, stronger ensembles can be constructed. This may raise a new direction to design better diversity measures and ensemble methods.
Previous work on the one-class collaborative filtering (OCCF) problem can be roughly categorized into pointwise methods, pairwise methods, and content-based methods. A fundamental assumption of these approaches is that all missing values in the user-item rating matrix are considered negative. However, this assumption may not hold because the missing values may contain negative and positive examples. For example, a user who fails to give positive feedback about an item may not necessarily dislike it; he may simply be unfamiliar with it. Meanwhile, content-based methods, e.g. collaborative topic regression (CTR), usually require textual content information of the items, and thus their applicability is largely limited when the text information is not available. In this paper, we propose to apply the latent Dirichlet allocation (LDA) model on OCCF to address the above-mentioned problems. The basic idea of this approach is that items are regarded as words, users are considered as documents, and the user-item feedback matrix constitutes the corpus. Our model drops the strong assumption that missing values are all negative and only utilizes the observed data to predict a user’s interest. Additionally, the proposed model does not need content information of the items. Experimental results indicate that the proposed method outperforms previous methods on various ranking-oriented evaluation metrics. We further combine this method with a matrix factorizationbased method to tackle the multi-class collaborative filtering (MCCF) problem, which also achieves better performance on predicting user ratings.
With the rapid development of location-based services, a particularly important aspect of start-up marketing research is to explore and characterize points of interest (PoIs) such as restaurants and hotels on maps. However, due to the lack of direct access to PoI databases, it is necessary to rely on existing APIs to query PoIs within a region and calculate PoI statistics. Unfortunately, public APIs generally impose a limit on the maximum number of queries. Therefore, we propose effective and efficient sampling methods based on road networks to sample PoIs on maps and provide unbiased estimators for calculating PoI statistics. In general, the more intense the roads, the denser the distribution of PoIs is within a region. Experimental results show that compared with state-of-the-art methods, our sampling methods improve the efficiency of aggregate statistical estimations.
Cloud computing provides elastic data storage and processing services. Although existing research has proposed preferred search on the plaintext files and encrypted search, no method has been proposed that integrates the two techniques to efficiently conduct preferred and privacypreserving search over large datasets in the cloud.
In this paper, we propose a scheme for preferred search over encrypted data (PSED) that can take users’ search preferences into the search over encrypted data. In the search process, we ensure the confidentiality of not only keywords but also quantified preferences associated with them. PSED constructs its encrypted search index using Lagrange coefficients and employs secure inner-product calculation for both search and relevance measurement. The dynamic and scalable property of cloud computing is also considered in PSED. A series of experiments have been conducted to demonstrate the efficiency of the proposed scheme when deploying it in realworld scenarios.