Ideal homomorphic encryption is theoretically achievable but impractical in reality due to tremendous computing overhead. Homomorphically encrypted databases, such as CryptDB, leverage replication with partially homomorphic encryption schemes to support different SQL queries over encrypted data directly. These databases reach a balance between security and efficiency, but incur considerable storage overhead, especially when making backups. Unfortunately, general data compression techniques relying on data similarity exhibit inefficiency on encrypted data. We present CryptZip, a backup and recovery system that could highly reduce the backup storage cost of encrypted databases. The key idea is to leverage the metadata information of encryption schemes and selectively backup one or several columns among semantically redundant columns. The experimental results show that CryptZip could reduce up to 90.5% backup storage cost on TPC-C benchmark.
Collaborative filtering (CF) is a technique commonly used for personalized recommendation and Web service quality-of-service (QoS) prediction. However, CF is vulnerable to shilling attackers who inject fake user profiles into the system. In this paper, we first present the shilling attack problem on CF-based QoS recommender systems for Web services. Then, a robust CF recommendation approach is proposed from a user similarity perspective to enhance the resistance of the recommender systems to the shilling attack. In the approach, the generally used similarity measures are analyzed, and the DegSim (the degree of similarities with top k neighbors) with those measures is selected for grouping and weighting the users. Then, the weights are used to calculate the service similarities/differences and predictions.We analyzed and evaluated our algorithms using WS-DREAM and Movielens datasets. The experimental results demonstrate that shilling attacks influence the prediction of QoS values, and our proposed features and algorithms achieve a higher degree of robustness against shilling attacks than the typical CF algorithms.
A race condition is a common trigger for concurrency bugs. As a special case, a race condition can also occur across the kernel and user space causing a doublefetch bug, which is a field that has received little research attention. In our work, we first analyzed real-world doublefetch bug cases and extracted two specific patterns for doublefetch bugs. Based on these patterns, we proposed an approach of multi-taint parallel tracking to detect double-fetch bugs. We also implemented a prototype called DFTracker (doublefetch bug tracker), and we evaluated it with our test suite. Our experiments demonstrated that it could effectively find all the double-fetch bugs in the test suite including eight realworld cases with no false negatives and minor false positives. In addition, we tested it on Linux kernel and found a new double-fetch bug. The execution overhead is approximately 2x for single-file cases and approximately 9x for the whole kernel test, which is acceptable. To the best of the authors’ knowledge, this work is the first to introduce multi-taint parallel tracking to double-fetch bug detection—an innovative method that is specific to double-fetch bug features—and has better path coverage as well as lower runtime overhead than the widely used dynamic approaches.
Distributed virtualization changes the pattern of building software systems. However, it brings some problems on dependability assurance owing to the complex social relationships and interactions between service components. The best way to solve the problems in a distributed virtualized environment is dependable service components selection. Dependable service components selection can be modeled as finding a dependable service path, which is a multiconstrained optimal path problem. In this paper, a service components selection method that searches for the dependable service path in a distributed virtualized environment is proposed from the perspective of dependability assurance. The concept of Quality of Dependability is introduced to describe and constrain software system dependability during dynamic composition. Then, we model the dependable service components selection as a multiconstrained optimal path problem, and apply the Adaptive Bonus-Penalty Microcanonical Annealing algorithm to find the optimal dependable service path. The experimental results show that the proposed algorithm has high search success rate and quick converges.
Transfer active learning, which is an emerging learning paradigm, aims to actively select informative instances with the aid of transferred knowledge from related tasks. Recently, several studies have addressed this problem. However, how to handle the distributional differences between the source and target domains remains an open problem. In this paper, a novel transfer active learning algorithm is proposed, inspired by the classical query by committee algorithm. Diverse committee members from both domains are maintained to improve the classification accuracy and a mechanism is included to evaluate each member during the iterations. Extensive experiments on both synthetic and real datasets show that our algorithm performs better and is also more robust than the state-of-the-art methods.
Convolutional neural networks (CNNs) have had great success with regard to the object classification problem. For character classification, we found that training and testing using accurately segmented character regions with CNNs resulted in higher accuracy than when roughly segmented regions were used. Therefore, we expect to extract complete character regions from scene images. Text in natural scene images has an obvious contrast with its attachments. Many methods attempt to extract characters through different segmentation techniques. However, for blurred, occluded, and complex background cases, those methods may result in adjoined or over segmented characters. In this paper, we propose a scene word recognition model that integrates words from small pieces to entire after-cluster-based segmentation. The segmented connected components are classified as four types: background, individual character proposals, adjoined characters, and stroke proposals. Individual character proposals are directly inputted to a CNN that is trained using accurately segmented character images. The sliding window strategy is applied to adjoined character regions. Stroke proposals are considered as fragments of entire characters whose locations are estimated by a stroke spatial distribution system. Then, the estimated characters from adjoined characters and stroke proposals are classified by a CNN that is trained on roughly segmented character images. Finally, a lexicondriven integration method is performed to obtain the final word recognition results. Compared to other word recognition methods, our method achieves a comparable performance on Street View Text and the ICDAR 2003 and ICDAR 2013 benchmark databases. Moreover, our method can deal with recognizing text images of occlusion and improperly segmented text images.
In this paper, we tackle the problem of segmenting out a sequence of actions from videos. The videos contain background and actions which are usually composed of ordered sub-actions. We refer the sub-actions and the background as semantic units. Considering the possible overlap between two adjacent semantic units, we propose a bidirectional sliding window method to generate the label distributions for various segments in the video. The label distribution covers a certain number of semantic unit labels, representing the degree to which each label describes the video segment. The mapping from a video segment to its label distribution is then learned by a Label Distribution Learning (LDL) algorithm. Based on the LDL model, a soft video parsing method with segmental regular grammars is proposed to construct a tree structure for the video. Each leaf of the tree stands for a video clip of background or sub-action. The proposed method shows promising results on the THUMOS’14, MSR-II and UCF101 datasets and its computational complexity is much less than the compared state-of-the-art video parsing method.
Finger vein image retrieval is a biometric identification technology that has recently attracted a lot of attention. It has the potential to reduce the search space and has attracted a considerable amount of research effort recently. It is a challenging problem owing to the large number of images in biometric databases and the lack of efficient retrieval schemes. We apply a hierarchical vocabulary tree modelbased image retrieval approach because of its good scalability and high efficiency.However, there is a large accumulative quantization error in the vocabulary tree (VT)model thatmay degrade the retrieval precision. To solve this problem, we improve the vector quantization coding in the VT model by introducing a non-negative locality-constrained constraint: the non-negative locality-constrained vocabulary tree-based image retrieval model. The proposed method can effectively improve coding performance and the discriminative power of local features. Extensive experiments on a large fused finger vein database demonstrate the superiority of our encoding method. Experimental results also show that our retrieval strategy achieves better performance than other state-of-theart methods, while maintaining low time complexity.
Sponsored search auction has been recently studied and auctioneer’s revenue is an important consideration in probabilistic single-item second-price auctions. Some papers have analyzed the revenue maximization problem on different methods to bundle contexts. In this paper, we propose a more flexible and natural method which is called the bracketing method. We prove that finding a bracketing scheme that maximizes the auctioneer’s revenue is strongly NP-hard. Then, a heuristic algorithm is given. Experiments on three test cases show that the revenue of the optimal bracketing scheme is very close to the optimal revenue without any bundling constraint, and the heuristic algorithm performs very well. Finally, we consider a simpler model that for each row in the valuation matrix, the non-zero cells have the same value. We prove that the revenue maximization problem with K-anonymous signaling scheme and cardinality constrained signaling scheme in this simpler model are both NP-hard.
The inverted index is a key component for search engines to manage billions of documents and quickly respond to users’ queries.Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when constructing the index has been overlooked. Partitioning the index aligning to its clustered distribution can effectively minimize the compressed size while accelerating its construction procedure. In this study, we introduce compression speed as one criterion to evaluate compression techniques, and thoroughly analyze the performance of different partitioning strategies. Optimizations are also proposed to enhance state-of-the-art methods with faster compression speed and more flexibility to partition an index. Experiments show that our methods offer a much better compression speed, while retaining an excellent space occupancy and decompression speed. networks.
High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result, analyzing HUIs can be very time consuming for users. Moreover, a large set of HUIs also makes HUIM algorithms less efficient in terms of execution time and memory consumption. To address this problem, closed high-utility itemsets (CHUIs), concise and lossless representations of all HUIs, were proposed recently. Although mining CHUIs is useful and desirable, it remains a computationally expensive task. This is because current algorithms often generate a huge number of candidate itemsets and are unable to prune the search space effectively. In this paper, we address these issues by proposing a novel algorithm called CLS-Miner. The proposed algorithm utilizes the utility-list structure to directly compute the utilities of itemsets without producing candidates. It also introduces three novel strategies to reduce the search space, namely chain-estimated utility co-occurrence pruning, lower branch pruning, and pruning by coverage. Moreover, an effective method for checking whether an itemset is a subset of another itemset is introduced to further reduce the time required for discovering CHUIs. To evaluate the performance of the proposed algorithm and its novel strategies, extensive experiments have been conducted on six benchmark datasets having various characteristics. Results show that the proposed strategies are highly efficient and effective, that the proposed CLS-Miner algorithmoutperforms the current state-ofthe- art CHUD and CHUI-Miner algorithms, and that CLSMiner scales linearly.
The count of one column for high-dimensional datasets, i.e., the number of records containing this column, has been widely used in numerous applications such as analyzing popular spots based on check-in location information and mining valuable items from shopping records. However, this poses a privacy threat when directly publishing this information. Differential privacy (DP), as a notable paradigm for strong privacy guarantees, is thereby adopted to publish all column counts. Prior studies have verified that truncating records or grouping columns can effectively improve the accuracy of published results. To leverage the advantages of the two techniques, we combine these studies to further boost the accuracy of published results. However, the traditional penalty function, which measures the error imported by a given pair of parameters including truncating length and group size, is so sensitive that the derived parameters deviate from the optimal parameters significantly. To output preferable parameters, we first design a smart penalty function that is less sensitive than the traditional function. Moreover, a two-phase selection method is proposed to compute these parameters efficiently, together with the improvement in accuracy. Extensive experiments on a broad spectrum of real-world datasets validate the effectiveness of our proposals.
We present a very different cause of search engine user behaviors—fascination. It is generally identified as the initial effect of a product attribute on users’ interest and purchase intentions. Considering the fact that in most cases the cursor is driven directly by a hand to move via a mouse (or touchpad), we use the cursor movement as the critical feature to analyze the personal reaction against the fascinating search results. This paper provides a deep insight into the goal-directed cursor movement that occurs within a remarkably short period of time (<30 milliseconds), which is the interval between a user’s click-through and decision-making behaviors. Instead of the fundamentals, we focus on revealing the characteristics of the split-second cursor movement. Our empirical findings showed that a user may push or pull the mouse with a slightly greater strength when fascinated by a search result. As a result, the cursor slides toward the search result with an increased momentum. We model the momentum through a combination of translational and angular kinetic energy calculations. Based on Fitts’ law, we implement goal-directed cursor movement identification. Supported by the momentum, together with other physical features, we built different fascination-based search result reranking systems. Our experiments showed that goal-directed cursor momentum is an effective feature in detecting fascination. In particular, they show feasibility in both the personalized and cross-media cases. In addition, we detail the advantages and disadvantages of both click-through rate and cursor momentum for re-ranking search results.
Network layer multicast is a highly efficient oneto- many transmission mode. Data rates supported by different group members may differ if these members are located in different network environments. Currently there are roughly two types of methods solving the problem, one is limiting the data rate so that every group member can sustain transmissions, and the other is building multiple trees to increase the provision of network bandwidth. The former is inefficient in bandwidth usage, and the latter adds too many states in the network, which is a serious problem in Software-Defined Networks. In this paper, we propose to build localized extra path(s) for each bottleneck link in the tree. By providing extra bandwidth to reinforce the bottleneck links, the overall data rate is increased. As extra paths are only built in small areas around the bottleneck links, the number of states added in the network is restrained to be as small as possible. Experiments on Mininet verify the effectiveness of our solution.
Collective motion is one of the most fascinating phenomena and mainly caused by the interactions between individuals. Physical-barriers, as the particular facilities which divide the crowd into different lanes, greatly affect the measurement of such interactions. In this paper we propose the physical-barrier detection based collective motion analysis (PDCMA) approach. The main idea is that the interaction between spatially adjacent pedestrians actually does not exist if they are separated by the physical-barrier. Firstly, the physical-barriers are extracted by two-stage clustering. The scene is automatically divided into several motion regions. Secondly, local region collectiveness is calculated to represent the interactions between pedestrians in each region. Finally, extensive evaluations use the three typical methods, i.e., the PDCMA, the Collectiveness, and the average normalized Velocity, to show the efficiency and efficacy of our approach in the scenes with and without physical barriers. Moreover, several escalator scenes are selected as the typical physical-barrier test scenes to demonstrate the performance of our approach.Comparedwith the current collectivemotion analysis methods, our approach better adapts to the scenes with physical barriers.