During a two-day strategic workshop in February 2018, 22 information retrieval researchers met to discuss the future challenges and opportunities within the field. The outcome is a list of potential research directions, project ideas, and challenges. This report describes themajor conclusionswe have obtained during the workshop. A key result is that we need to open our mind to embrace a broader IR field by rethink the definition of information, retrieval, user, system, and evaluation of IR. By providing detailed discussions on these topics, this report is expected to inspire our IR researchers in both academia and industry, and help the future growth of the IR research community.
Resource planning is becoming an increasingly important and timely problem for cloud users. As more Web services are moved to the cloud, minimizing network usage is often a key driver of cost control. Most existing approaches focus on resources such as CPU, memory, and disk I/O. In particular, CPU receives the most attention from researchers, but the bandwidth is somehow neglected. It is challenging to predict the network throughput of modern Web services, due to the factors of diverse and complex response, evolvingWeb services, and complex network transportation. In this paper, we propose a methodology of what-if analysis, named Log2Sim, to plan the bandwidth resource of Web services. Log2Sim uses a lightweight workload model to describe user behavior, an automated mining approach to obtain characteristics of workloads and responses from massive Web logs, and traffic-aware simulations to predict the impact on the bandwidth consumption and the response time in changing contexts. We use a real-life Web system and a classic benchmark to evaluate Log2Sim in multiple scenarios. The evaluation result shows that Log2Sim has good performance in the prediction of bandwidth consumption. The average relative error is 2% for the benchmark and 8% for the real-life system. As for the response time, Log2Sim cannot produce accurate predictions for every single service request, but the simulation results always show similar trends on average response time with the increase of workloads in different changing contexts. It can provide sufficient information for the system administrator in proactive bandwidth planning.
Sparse representation has been widely used in signal processing, pattern recognition and computer vision etc. Excellent achievements have been made in both theoretical researches and practical applications. However, there are two limitations on the application of classification. One is that sufficient training samples are required for each class, and the other is that samples should be uncorrupted. In order to alleviate above problems, a sparse and dense hybrid representation (SDR) framework has been proposed, where the training dictionary is decomposed into a class-specific dictionary and a non-class-specific dictionary. SDR puts
Privacy preservation is a primary concern in social networkswhich employ a variety of privacy preservations mechanisms to preserve and protect sensitive user information including age, location, education, interests, and others. The task of matching user identities across different social networks is considered a challenging task. In this work, we propose an algorithm to reveal user identities as a set of linked accounts from different social networks using limited user profile data, i.e., user-name and friendship. Thus, we propose a framework, ExpandUIL, that includes three standalone algorithms based on (i) the percolation graph matching in ExpandFullName algorithm, (ii) a supervised machine learning algorithm that works with the graph embedding, and (iii) a combination of the two, ExpandUserLinkage algorithm. The proposed framework as a set of algorithms is significant as, (i) it is based on the network topology and requires only name feature of the nodes, (ii) it requires a considerably low initial seed, as low as one initial seed suffices, (iii) it is iterative and scalable with applicability to online incoming stream graphs, and (iv) it has an experimental proof of stability over a real ground-truth dataset. Experiments on real datasets, Instagram and VK social networks, show upto 75% recall for linked accounts with 96% accuracy using only one given seed pair.
Emerging Internet services and applications attract increasing users to involve in diverse video-related activities, such as video searching, video downloading, video sharing and so on. As normal operations, they lead to an explosive growth of online video volume, and inevitably give rise to the massive near-duplicate contents. Near-duplicate video retrieval (NDVR) has always been a hot topic. The primary purpose of this paper is to present a comprehensive survey and an updated reviewof the advance on large-scaleNDVR to supply guidance for researchers. Specifically, we summarize and compare the definitions of near-duplicate videos (NDVs) in the literature, analyze the relationship between NDVR and its related research topics theoretically, describe its generic framework in detail, investigate the existing state-of-the-art NDVR systems. Finally, we present the development trends and research directions of this topic.
Mobile computing has fast emerged as a pervasive technology to replace the old computing paradigms with portable computation and context-aware communication. Existing software systems can be migrated (while preserving their data and logic) to mobile computing platforms that support portability, context-sensitivity, and enhanced usability. In recent years, some research and development efforts have focused on a systematic migration of existing software systems to mobile computing platforms.
To investigate the research state-of-the-art on the migration of existing software systems to mobile computing platforms. We aim to analyze the progression and impacts of existing research, highlight challenges and solutions that reflect dimensions of emerging and futuristic research.
We followed evidence-based software engineering (EBSE) method to conduct a systematic mapping study (SMS) of the existing research that has progressed over more than a decade (25 studies published from 1996–2017).We have derived a taxonomical classification and a holistic mapping of the existing research to investigate its progress, impacts, and potential areas of futuristic research and development.
The SMS has identified three types of migration namely Static, Dynamic, and State-based Migration of existing software systems to mobile computing platforms.Migration to mobile computing platforms enables existing software systems to achieve portability, context-sensitivity, and high connectivity. However, mobile systems may face some challenges such as resource poverty, data security, and privacy. The emerging and futuristic research aims to support patterns and tool support to automate the migration process. The results of this SMS can benefit researchers and practitioners–by highlighting challenges, solutions, and tools, etc., –to conceptualize the state-ofthe- art and futuristic trends that support migration of existing software to mobile computing.
Principal component analysis (PCA) is a widely used method for multivariate data analysis that projects the original high-dimensional data onto a low-dimensional subspace with maximum variance. However, in practice, we would be more likely to obtain a few compressed sensing (CS) measurements than the complete high-dimensional data due to the high cost of data acquisition and storage. In this paper, we propose a novel Bayesian algorithm for learning the solutions of PCA for the original data just from these CS measurements. To this end, we utilize a generative latent variable model incorporated with a structure prior to model both sparsity of the original data and effective dimensionality of the latent space. The proposed algorithm enjoys two important advantages: 1) The effective dimensionality of the latent space can be determined automatically with no need to be pre-specified; 2) The sparsity modeling makes us unnecessary to employ multiple measurement matrices to maintain the original data space but a single one, thus being storage efficient. Experimental results on synthetic and realworld datasets show that the proposed algorithm can accurately learn the solutions of PCA for the original data, which can in turn be applied in reconstruction task with favorable results.
With the fast development of software defined network (SDN), numerous researches have been conducted for maximizing the performance of SDN. Currently, flow tables are utilized in OpenFlows witch for routing. Due to the space limitation of flow table and switch capacity, variousissues exist in dealing with the flows.The existing schemes typically employ reactive approach such that the selection of evicted entries occurs when timeout or table miss occurs. In this paper a proactive approach is proposed based on the prediction of the probability of matching of the entries. Here eviction occurs proactively when the utilization of flow table exceeds a threshold, and the flow entry of the lowestmatching probability is evicted. The matching probability is estimated using hiddenMarkov model (HMM).Computersimulation reveals that it significantly enhances the prediction accuracy and decreases the number of table misses compared to the standard Hard timeout scheme and Flow master scheme.
Real-life events are emerging and evolving in social and news streams. Recent methods have succeeded in capturing designed features of monolingual events, but lack of interpretability and multi-lingual considerations. To this end, we propose a multi-lingual event mining model, namely MLEM, to automatically detect events and generate evolution graph in multilingual hybrid-length text streams including English, Chinese, French, German, Russian and Japanese. Specially, we merge the same entities and similar phrases and present multiple similarity measures by incremental word2vec model. We propose an 8-tuple to describe event for correlation analysis and evolution graph generation. We evaluate the MLEM model using a massive humangenerated dataset containing real world events. Experimental results show that our new model MLEM outperforms the baseline method both in efficiency and effectiveness.
A comprehensive understanding of city structures and urban dynamics can greatly improve the efficiency and quality of urban planning and management, while the traditional approaches of which, such as manual surveys, usually incur substantial labor and time. In this paper, we propose a data-driven framework to sense urban structures and dynamics from large-scale vehicle mobility data. First, we divide the city into fine-grained grids, and cluster the grids with similar mobility features into structured urban areas with a proposed distance-constrained clustering algorithm (DCCA). Second, we detect irregular mobility traffic patterns in each area leveraging an ARIMA-based anomaly detection algorithm (ADAM), and correlate them to the urban social and emergency events. Finally, we build a visualization system to demonstrate the urban structures and crowd dynamics.We evaluate our framework using real-world datasets collected from Xiamen city, China, and the results show that the proposed framework can sense urban structures and crowd comprehensively and effectively.
Visual information is highly advantageous for the evolutionary success of almost all animals. This information is likewise critical for many computing tasks, and visual computing has achieved tremendous successes in numerous applications over the last 60 years or so. In that time, the development of visual computing has moved forwards with inspiration from biological mechanisms many times. In particular, deep neural networks were inspired by the hierarchical processing mechanisms that exist in the visual cortex of primate brains (including ours), and have achieved huge breakthroughs in many domainspecific visual tasks. In order to better understand biologically inspired visual computing, we will present a survey of the current work, and hope to offer some new avenues for rethinking visual computing and designing novel neural network architectures.
In higher education, the initial studying period of each course plays a crucial role for students, and seriously influences the subsequent learning activities. However, given the large size of a course’s students at universities, it has become impossible for teachers to keep track of the performance of individual students. In this circumstance, an academic early warning system is desirable, which automatically detects students with difficulties in learning (i.e., at-risk students) prior to a course starting. However, previous studies are not well suited to this purpose for two reasons: 1) they have mainly concentrated on e-learning platforms, e.g., massive open online courses (MOOCs), and relied on the data about students’ online activities, which is hardly accessed in traditional teaching scenarios; and 2) they have only made performance prediction when a course is in progress or even close to the end. In this paper, for traditional classroomteaching scenarios, we investigate the task of pre-course student performance prediction, which refers to detecting at-risk students for each course before its commencement. To better represent a student sample and utilize the correlations among courses, we cast the problem as a multi-instance multi-label (MIML) problem. Besides, given the problem of data scarcity, we propose a novel multi-task learning method, i.e., MIML-Circle, to predict the performance of students from different specialties in a unified framework. Extensive experiments are conducted on five real-world datasets, and the results demonstrate the superiority of our approach over the state-of-the-art methods.
Wireless sensor network (WSN) is effective for monitoring the target environment,which consists of a large number of sensor nodes of limited energy. An efficient medium access control (MAC) protocol is thus imperative to maximize the energy efficiency and performance of WSN. The most existing MAC protocols are based on the scheduling of sleep and active period of the nodes, and do not consider the relationship between the load condition and performance. In this paper a novel scheme is proposed to properly determine the duty cycle of the WSN nodes according to the load,which employs the Q-learning technique and function approximation with linear regression. This allows low-latency energy-efficient scheduling for a wide range of traffic conditions, and effectively overcomes the limitation of Q-learning with the problem of continuous state-action space. NS3 simulation reveals that the proposed scheme significantly improves the throughput, latency, and energy efficiency compared to the existing fully active scheme and S-MAC.
As the first attempt, this paper proposes a model for the Chinese high school timetabling problems (CHSTPs) under the new curriculum innovation which was launched in 2014 by the Chinese government. According to the new curriculum innovation, students in high school can choose subjects that they are interested in instead of being forced to select one of the two study directions, namely, Science and Liberal Arts. Meanwhile, they also need to attend compulsory subjects as traditions. CHSTPs are student-oriented and involve more student constraints that make them more complex than the typical “Class-Teacher model”, in which the element “Teacher” is the primary constraint. In this paper, we first describe in detail the mathematical model of CHSTPs and then design a new two-part representation for the candidate solution. Based on the new representation, we adopt a two-phase simulated annealing (SA) algorithm to solve CHSTPs. A total number of 45 synthetic instances with different amounts of classes, teachers, and levels of student constraints are generated and used to illustrate the characteristics of the CHSTP model and the effectiveness of the designed representation and algorithm. Finally,we apply the proposed model, the designed two-part representation and the two-phase SA on10 real high schools.
Network embedding which aims to embed a given network into a low-dimensional vector space has been proved effective in various network analysis and mining tasks such as node classification, link prediction and network visualization. The emerging network embedding methods have shifted of emphasis in utilizing mature deep learning models. The neuralnetwork based network embedding has become a mainstream solution because of its high efficiency and capability of preserving the nonlinear characteristics of the network. In this paper, we propose Adversarial Network Embedding using Structural Similarity (ANESS), a novel, versatile, low-complexity GANbased network embedding model which utilizes the inherent vertex-to-vertex structural similarity attribute of the network. ANESS learns robustness and effective vertex embeddings via a adversarial training procedure. Specifically, our method aims to exploit the strengths of generative adversarial networks in generating high-quality samples and utilize the structural similarity identity of vertexes to learn the latent representations of a network. Meanwhile, ANESS can dynamically update the strategy of generating samples during each training iteration. The extensive experiments have been conducted on the several benchmark network datasets, and empirical results demonstrate that ANESS significantly outperforms other state-of-theart network embedding methods.
Density based clustering algorithms (DBCLAs) rely on the notion of density to identify clusters of arbitrary shapes, sizes with varying densities. Existing surveys on DBCLAs cover only a selected set of algorithms. These surveys fail to provide an extensive information about a variety of DBCLAs proposed till date including a taxonomy of the algorithms. In this paper we present a comprehensive survey of various DBCLAs over last two decades along with their classification. We group the DBCLAs in each of the four categories: density definition, parameter sensitivity, execution mode and nature of data and further divide them into various classes under each of these categories. In addition, we compare the DBCLAs through their common features and variations in citation and conceptual dependencies. We identify various application areas of DBCLAs in domains such as astronomy, earth sciences, molecular biology, geography, multimedia. Our survey also identifies probable future directions of DBCLAs where involvement of density based methods may lead to favorable results.
Proteomics become an important research area of interests in life science after the completion of the human genome project. This scientific is to study the characteristics of proteins at the large-scale data level, and then gain a holistic and comprehensive understanding of the process of disease occurrence and cell metabolism at the protein level. A key issue in proteomics is how to efficiently analyze the massive amounts of protein data produced by high-throughput technologies. Computational technologies with low-cost and short-cycle are becoming the preferred methods for solving some important problems in post-genome era, such as protein-protein interactions (PPIs). In this review, we focus on computational methods for PPIs detection and show recent advancements in this critical area from multiple aspects. First, we analyze in detail the several challenges for computational methods for predicting PPIs and summarize the available PPIs data sources. Second, we describe the stateof-the-art computational methods recently proposed on this topic. Finally, we discuss some important technologies that can promote the prediction of PPI and the development of computational proteomics.
Recently, in the area of big data, some popular applications such as web search engines and recommendation systems, face the problem to diversify results during query processing. In this sense, it is both significant and essential to propose methods to deal with big data in order to increase the diversity of the result set. In this paper, we firstly define the diversity of a set and the ability of an element to improve the overall diversity. Based on these definitions, we propose a diversification framework which has good performance in terms of effectiveness and efficiency. Also, this framework has theoretical guarantee on probability of success. Secondly, we design implementation algorithms based on this framework for both numerical and string data. Thirdly, for numerical and string data respectively, we carry out extensive experiments on real data to verify the performance of our proposed framework, and also perform scalability experiments on synthetic data.
This paper presents a comprehensive survey on the development of Intel SGX (software guard extensions) processors and its applications. With the advent of SGX in 2013 and its subsequent development, the corresponding research works are also increasing rapidly. In order to get a more comprehensive literature review related to SGX, we have made a systematic analysis of the related papers in this area. We first search through five large-scale paper retrieval libraries by keywords (i.e., ACM Digital Library, IEEE/IET Electronic Library, SpringerLink, Web of Science, and Elsevier Science Direct). We read and analyze a total of 128 SGX-related papers. The first round of extensive study is conducted to classify them. The second round of intensive study is carried out to complete a comprehensive analysis of the paper from various aspects. We start with the working environment of SGX and make a conclusive summary of trusted execution environment (TEE).We then focus on the applications of SGX. We also review and study multifarious attack methods to SGX framework and some recent security improvementsmade on SGX. Finally, we summarize the advantages and disadvantages of SGX with some future research opportunities. We hope this review could help the existing and future research works on SGX and its application for both developers and users.
Entity set expansion (ESE) aims to expand an entity seed set to obtain more entities which have common properties. ESE is important for many applications such as dictionary construction and query suggestion. Traditional ESE methods relied heavily on the text and Web information of entities. Recently, some ESE methods employed knowledge graphs (KGs) to extend entities. However, they failed to effectively and efficiently utilize the rich semantics contained in a KG and ignored the text information of entities in Wikipedia. In this paper, we model a KG as a heterogeneous information network (HIN) containing multiple types of objects and relations. Fine-grained multi-type meta paths are proposed to capture the hidden relation among seed entities in a KG and thus to retrieve candidate entities. Then we rank the entities according to the meta path based structural similarity. Furthermore, to utilize the text description of entities in Wikipedia, we propose an extended model CoMeSE++ which combines both structural information revealed by a KG and text information in Wikipedia for ESE. Extensive experiments on real-world datasets demonstrate that our model achieves better performance by combining structural and textual information of entities.
Quality assessment is a critical component in crowdsourcing-based software engineering (CBSE) as software products are developed by the crowd with unknown or varied skills and motivations. In this paper, we propose a novel metric called the project score to measure the performance of projects and the quality of products for competitionbased software crowdsourcing development (CBSCD) activities. To the best of our knowledge, this is the first work to deal with the quality issue of CBSE in the perspective of projects instead of contests. In particular, we develop a hierarchical quality evaluation framework for CBSCD projects and come up with two metric aggregation models for project scores. The first model is a modified squale model that can locate the software modules of poor quality, and the second one is a clustering-based aggregationmodel, which takes different impacts of phases into account. To test the effectiveness of the proposed metrics, we conduct an empirical study on TopCoder, which is a famous CBSCD platform. Results show that the proposed project score is a strong indicator of the performance and product quality of CBSCD projects.We also find that the clustering-based aggregation model outperforms the Squale one by increasing the percentage of the performance evaluation criterion of aggregation models by an additional 29%. Our approach to quality assessment for CBSCD projects could potentially facilitate software managers to assess the overall quality of a crowdsourced project consisting of programming contests.
Blockchain has recently emerged as a research trend, with potential applications in a broad range of industries and context. One particular successful Blockchain technology is smart contract, which is widely used in commercial settings (e.g., high value financial transactions). This, however, has security implications due to the potential to financially benefit froma security incident (e.g., identification and exploitation of a vulnerability in the smart contract or its implementation). Among, Ethereum is the most active and arresting. Hence, in this paper, we systematically review existing research efforts on Ethereum smart contract security, published between 2015 and 2019. Specifically, we focus on how smart contracts can be maliciously exploited and targeted, such as security issues of contract program model, vulnerabilities in the program and safety consideration introduced by program execution environment. We also identify potential research opportunities and future research agenda.
Next location prediction has aroused great interests in the era of internet of things (IoT). With the ubiquitous deployment of sensor devices, e.g., GPS and Wi-Fi, IoT environment offers new opportunities for proactively analyzing human mobility patterns and predicting user’s future visit in low cost, no matter outdoor and indoor. In this paper, we consider the problem of next location prediction in IoT environment via a session-based manner.We suggest that user’s future intention in each session can be better inferred for more accurate prediction if patterns hidden inside both trajectory and signal strength sequences collected from IoT devices can be jointly modeled, which however existing state-of-the-art methods have rarely addressed. To this end, we propose a trajectory and sIgnal sequence (TSIS) model, where the trajectory transition regularities and signal temporal dynamics are jointly embedded in a neural network based model. Specifically, we employ gated recurrent unit (GRU) for capturing the temporal dynamics in the multivariate signal strength sequence. Moreover, we adapt gated graph neural networks (gated GNNs) on location transition graphs to explicitly model the transition patterns of trajectories. Finally, both the low-dimensional representations learned from trajectory and signal sequence are jointly optimized to construct a session embedding, which is further employed to predict the next location. Extensive experiments on two real-world Wi-Fi based mobility datasets demonstrate that TSIS is effective and robust for next location prediction compared with other competitive baselines.
Multi-label classification aims to assign a set of proper labels for each instance, where distance metric learning can help improve the generalization ability of instance-based multi-label classification models. Existing multi-label metric learning techniques work by utilizing pairwise constraints to enforce that examples with similar label assignments should have close distance in the embedded feature space. In this paper, a novel distance metric learning approach for multi-label classification is proposed by modeling structural interactions between instance space and label space. On one hand, compositional distance metric is employed which adopts the representation of a weighted sum of rank-1 PSD matrices based on component bases. On the other hand, compositional weights are optimized by exploiting triplet similarity constraints derived from both instance and label spaces. Due to the compositional nature of employed distance metric, the resulting problem admits quadratic programming formulation with linear optimization complexity w.r.t. the number of training examples.We also derive the generalization bound for the proposed approach based on algorithmic robustness analysis of the compositional metric. Extensive experiments on sixteen benchmark data sets clearly validate the usefulness of compositional metric in yielding effective distance metric for multi-label classification.
Regarding extreme value theory, the unseen novel classes in the open-set recognition can be seen as the extreme values of training classes. Following this idea, we introduce the margin and coverage distribution to model the training classes. A novel visual-semantic embedding framework – extreme vocabulary learning (EVoL) is proposed; the EVoL embeds the visual features into semantic space in a probabilistic way. Notably, we adopt the vast open vocabulary in the semantic space to help further constraint the margin and coverage of training classes. The learned embedding can directly be used to solve supervised learning, zero-shot learning, and open set recognition simultaneously. Experiments on two benchmark datasets demonstrate the effectiveness of the proposed framework against conventional ways.