Oct 2019, Volume 13 Issue 5
    

  • Select all
  • PERSPECTIVE
    Lihong LI
  • RESEARCH ARTICLE
    Linjun MEI, Dan FENG, Lingfang ZENG, Jianxi CHEN, Jingning LIU

    Redundant array of independent SSDs (RAIS) is generally based on the traditional RAID design and implementation. The random small write problem is a serious challenge of RAIS. Random small writes in parity-based RAIS systems generate significantly more pre-reads and writes which can degrade RAIS performance and shorten SSD lifetime. In order to overcome the well-known write-penalty problem in the parity-based RAID5 storage systems, several logging techniques such as Parity Logging and Data Logging have been put forward. However, these techniques are originally based on mechanical characteristics of the HDDs, which ignore the properties of the flash memory.

    In this article, we firstly propose RAISL, a flash-aware logging method that improves the small write performance of RAIS storage systems. RAISL writes new data instead of new data and pre-read data to the log SSD by making full use of the invalid pages on the SSD of RAIS. RAISL does not need to perform the pre-read operations so that the original characteristics of workloads are kept. Secondly, we propose AGCRL on the basis of RAISL to further boost performance. AGCRL combines RAISL with access characteristic to guide read and write cost regulation to improve the performance of RAIS storage systems. Our experiments demonstrate that the RAISL significantly improves write performance and AGCRL improves both of write performance and read performance. AGCRL on average outperforms RAIS5 and RAISL by 39.15% and 16.59% respectively.

  • RESEARCH ARTICLE
    Libing WU, Lei NIE, Samee U. KHAN, Osman KHALID, Dan WU

    Adaptive traffic light scheduling based on realtime traffic information processing has proven effective for urban traffic congestion management. However, fine-grained information regarding individual vehicles is difficult to acquire through traditional data collection techniques and its accuracy cannot be guaranteed because of congestion and harsh environments. In this study, we first build a pipeline model based on vehicle-to-infrastructure communication, which is a salient technique in vehicular adhoc networks. This model enables the acquisition of fine-grained and accurate traffic information in real time via message exchange between vehicles and roadside units. We then propose an intelligent traffic light scheduling method (ITLM) based on a “demand assignment” principle by considering the types and turning intentions of vehicles. In the context of this principle, a signal phase with more vehicles will be assigned a longer green time. Furthermore, a green-way traffic light scheduling method (GTLM) is investigated for special vehicles (e.g., ambulances and fire engines) in emergency scenarios. Signal states will be adjusted or maintained by the traffic light control system to keep special vehicles moving along smoothly. Comparative experiments demonstrate that the ITLM reduces average wait time by 34%–78% and average stop frequency by 12%–34% in the context of traffic management. The GTLM reduces travel time by 22%–44% and 30%–55% under two types of traffic conditions and achieves optimal performance in congested scenarios.

  • RESEARCH ARTICLE
    Xin CHEN, He JIANG, Zhenyu CHEN, Tieke HE, Liming NIE

    In crowdsourced mobile application testing, workers are often inexperienced in and unfamiliar with software testing. Meanwhile, workers edit test reports in descriptive natural language on mobile devices. Thus, these test reports generally lack important details and challenge developers in understanding the bugs. To improve the quality of inspected test reports, we issue a new problem of test report augmentation by leveraging the additional useful information contained in duplicate test reports. In this paper, we propose a new framework named test report augmentation framework (TRAF) towards resolving the problem. First, natural language processing (NLP) techniques are adopted to preprocess the crowdsourced test reports. Then, three strategies are proposed to augment the environments, inputs, and descriptions of the inspected test reports, respectively. Finally, we visualize the augmented test reports to help developers distinguish the added information. To evaluate TRAF, we conduct experiments over five industrial datasets with 757 crowdsourced test reports. Experimental results show that TRAF can recommend relevant inputs to augment the inspected test reports with 98.49% in terms of NDCG and 88.65% in terms of precision on average, and identify valuable sentences from the descriptions of duplicates to augment the inspected test reports with 83.58% in terms of precision, 77.76% in terms of recall, and 78.72% in terms of F-measure on average. Meanwhile, empirical evaluation also demonstrates that augmented test reports can help developers understand and fix bugs better.

  • RESEARCH ARTICLE
    Zhuo WANG, Qun CHEN, Bo SUO, Wei PAN, Zhanhuai LI

    MapReduce, a parallel computational model, has been widely used in processing big data in a distributed cluster. Consisting of alternate map and reduce phases, MapReduce has to shuffle the intermediate data generated by mappers to reducers. The key challenge of ensuring balanced workload on MapReduce is to reduce partition skew among reducers without detailed distribution information on mapped data.

    In this paper, we propose an incremental data allocation approach to reduce partition skew among reducers on MapReduce. The proposed approach divides mapped data into many micro-partitions and gradually gathers the statistics on their sizes in the process of mapping. The micropartitions are then incrementally allocated to reducers in multiple rounds. We propose to execute incremental allocation in two steps, micro-partition scheduling and micro-partition allocation. We propose a Markov decision process (MDP) model to optimize the problem of multiple-round micropartition scheduling for allocation commitment. We present an optimal solution with the time complexity of O(K · N2), in which K represents the number of allocation rounds and N represents the number of micro-partitions. Alternatively, we also present a greedy but more efficient algorithm with the time complexity of O(K · N ln N). Then, we propose a minmax programming model to handle the allocation mapping between micro-partitions and reducers, and present an effective heuristic solution due to its NP-completeness. Finally, we have implemented the proposed approach on Hadoop, an open-source MapReduce platform, and empirically evaluated its performance. Our extensive experiments show that compared with the state-of-the-art approaches, the proposed approach achieves considerably better data load balance among reducers as well as overall better parallel performance.

  • RESEARCH ARTICLE
    Shuai ZHANG, Xinjun MAO, Fu HOU, Peini LIU

    In diverse and self-governed multiple clouds context, the service management and discovery are greatly challenged by the dynamic and evolving features of services. How to manage the features of cloud services and support accurate and efficient service discovery has become an open problem in the area of cloud computing. This paper proposes a field model of multiple cloud services and corresponding service discovery method to address the issue. Different from existing researches, our approach is inspired by Bohr atom model. We use the abstraction of energy level and jumping mechanism to describe services status and variations, and thereby to support the service demarcation and discovery. The contributions of this paper are threefold. First, we propose the abstraction of service energy level to represent the status of services, and service jumping mechanism to investigate the dynamic and evolving features as the variations and re-demarcation of cloud services according to their energy levels. Second, we present user acceptable service region to describe the services satisfying users’ requests and corresponding service discovery method, which can significantly decrease services search scope and improve the speed and precision of service discovery. Third, a series of algorithms are designed to implement the generation of field model, user acceptable service regions, service jumping mechanism, and user-oriented service discovery.We have conducted an extensive experiments on QWS dataset to validate and evaluate our proposed models and algorithms. The results show that field model can well support the representation of dynamic and evolving aspects of services in multiple clouds context and the algorithms can improve the accuracy and efficiency of service discovery.

  • RESEARCH ARTICLE
    Xu-Ying LIU, Sheng-Tao WANG, Min-Ling ZHANG

    The problem of limited minority class data is encountered in many class imbalanced applications, but has received little attention. Synthetic over-sampling, as popular class-imbalance learning methods, could introduce much noise when minority class has limited data since the synthetic samples are not i.i.d. samples of minority class. Most sophisticated synthetic sampling methods tackle this problem by denoising or generating samples more consistent with ground-truth data distribution. But their assumptions about true noise or ground-truth data distribution may not hold. To adapt synthetic sampling to the problem of limited minority class data, the proposed Traso framework treats synthetic minority class samples as an additional data source, and exploits transfer learning to transfer knowledge from them to minority class. As an implementation, TrasoBoost method firstly generates synthetic samples to balance class sizes. Then in each boosting iteration, the weights of synthetic samples and original data decrease and increase respectively when being misclassified, and remain unchanged otherwise. The misclassified synthetic samples are potential noise, and thus have smaller influence in the following iterations. Besides, the weights of minority class instances have greater change than those of majority class instances to be more influential. And only original data are used to estimate error rate to be immune from noise. Finally, since the synthetic samples are highly related to minority class, all of the weak learners are aggregated for prediction. Experimental results show TrasoBoost outperforms many popular class-imbalance learning methods.

  • RESEARCH ARTICLE
    Ying LI, Xiangwei KONG, Haiyan FU, Qi TIAN

    Image reranking is an effective post-processing step to adjust the similarity order in image retrieval. As key components of initialized ranking lists, top-ranked neighborhoods of a given query usually play important roles in constructing dissimilarity measure. However, the number of pertinent candidates varies with respect to different queries. Thus the images with short lists of ground truth suffer from insufficient contextual information. It consequently introduces noises when using k-nearest neighbor rule to define the context. In order to alleviate this problem, this paper proposes auxiliary points which are added as assistant neighbors in an unsupervised manner. These extra points act on revealing implicit similarity in the metric space and clustering matched image pairs. By isometrically embedding each constructed metric space into the Euclidean space, the image relationships on underlying topological manifolds are locally represented by distance descriptions. Furthermore, by combining Jaccard index with our auxiliary points, we present a contextual modeling on auxiliary points (CMAP) method for image reranking.With richer contextual activations, the Jaccard similarity coefficient defined by local distribution achieves more reliable outputs as well as more stable parameters. Extensive experiments demonstrate the robustness and effectiveness of the proposed method.

  • RESEARCH ARTICLE
    Weinan ZHANG, Ting LIU, Qingyu YIN, Yu ZHANG

    Dropped pronouns (DPs) are ubiquitous in prodrop languages like Chinese, Japanese etc. Previous work mainly focused on painstakingly exploring the empirical features for DPs recovery. In this paper, we propose a neural recovery machine (NRM) to model and recover DPs in Chinese to avoid the non-trivial feature engineering process. The experimental results show that the proposed NRM significantly outperforms the state-of-the-art approaches on two heterogeneous datasets. Further experimental results of Chinese zero pronoun (ZP) resolution show that the performance of ZP resolution can also be improved by recovering the ZPs to DPs.

  • RESEARCH ARTICLE
    Yudong QIN, Deke GUO, Lailong LUO, Geyao CHENG, Zeliu DING

    The visible light communication (VLC) has the potential to provide dense and fast connectivity at low cost. In this paper, we propose a novel VLC enabled Wireless Small-World Data Center (WSWDC). It employs VLC links to achieve a fully wireless data center network (DCN) across racks for the first time. The using of VLC links eliminates hierarchical switches and inter-rack cables, and thus reducing hardware investment, as well as maintenance cost. More precisely, to simplify the configuration and control operations, we propose three DCN design rationales: (1) fully-wireless, all inter-rack links are wireless; (2) easy-deployable, it is not necessary to change the existing infrastructure inside data center; (3) plug-and-play, no extra centralized control operations are required. Previous proposals, however, cannot achieve the three rationales simultaneously. To this end, we first use regular VLC links to interconnect racks as a regular grid DCN and optimize the rack placement to shorten the average path length and the network diameter. To further exploiting the benefits of VLC links, a few random VLC links are carefully introduced to update the wireless grid DCN as a wireless small-world DCN. To avoid the potential interference among VLC links, we deploy VLC transceivers at different heights on the top of each rack. In this way, VLC links would not interfere with others at each height level. Moreover, we design a greedy but efficient routing method for any pair of racks using their identifiers as inputs. Comprehensive evaluation results indicate that our WSWDC exhibits good topological properties and network performance.

  • RESEARCH ARTICLE
    Fei WANG, Tieyun QIAN, Bin LIU, Zhiyong PENG

    Patent prior art search uses dispersed information to retrieve all the relevant documents with strong ambiguity from the massive patent database. This challenging task consists in patent reduction and patent expansion. Existing studies on patent reduction ignore the relevance between technical characteristics and technical domains, and result in ambiguous queries. Works on patent expansion expand terms from external resource by selecting words with similar distribution or similar semantics. However, this splits the relevance between the distribution and semantics of the terms. Besides, common repository hardly meets the requirement of patent expansion for uncommon semantics and unusual terms. In order to solve these problems, we first present a novel composite-domain perspective model which converts the technical characteristic of a query patent to a specific composite classified domain and generates aspect queries. We then implement patent expansionwith double consistency by combining distribution and semantics simultaneously.We also propose to train semantic vector spaces via word embedding under the specific classified domains, so as to provide domain-aware expanded resource. Finally, multiple retrieval results of the same topic are merged based on perspective weight and rank in the results. Our experimental results on CLEP-IP 2010 demonstrate that our method is very effective. It reaches about 5.43% improvement in recall and nearly 12.38% improvement in PRES over the state-of-the-art. Our work also achieves the best performance balance in terms of recall, MAP and PRES.

  • RESEARCH ARTICLE
    Zhefan ZHONG, Xin LIN, Liang HE, Jing YANG

    Being decades of study, the usability of database systems have received more attention in recent years. Now it is especially able to explain missing objects in a query result, which is called “why-not” questions, and is the focus of concern. This paper studies the problem of answering whynot questions on KNN queries. In our real life, many users would like to use KNN queries to investigate the surrounding circumstances. Nevertheless, they often feel disappointed when finding the result not including their expected objects. In this paper, we use the query refinement approach to resolve the problem. Given the original KNN query and a set of missing objects as input, our algorithm offer a refined KNN query that includes the missing objects to the user. The experimental results demonstrate the efficiency of our proposed optimizations and algorithms.

  • RESEARCH ARTICLE
    Chengcheng YU, Fan XIA, Weining QIAN, Aoying ZHOU

    A social stream refers to the data stream that records a series of social entities and the dynamic interactions between two entities. It can be employed to model the changes of entity states in numerous applications. The social streams, the combination of graph and streaming data, pose great challenge to efficient analytical query processing, and are key to better understanding users’ behavior. Considering of privacy and other related issues, a social stream generator is of great significance. A framework of synthetic social stream generator (SSG) is proposed in this paper. The generated social streams using SSG can be tuned to capture several kinds of fundamental social stream properties, including patterns about users’ behavior and graph patterns. Extensive empirical studies with several real-life social stream data sets show that SSG can produce data that better fit to real data. It is also confirmed that SSG can generate social stream data continuously with stable throughput and memory consumption. Furthermore, we propose a parallel implementation of SSG with the help of asynchronized parallel processing model and delayed update strategy. Our experiments verify that the throughput of the parallel implementation can increase linearly by increasing nodes.

  • RESEARCH ARTICLE
    Zhenxue HE, Limin XIAO, Fei GU, Li RUAN, Zhisheng HUO, Mingzhe LI, Mingfa ZHU, Longbing ZHANG, Rui LIU, Xiang WANG

    Delay optimization has recently attracted significant attention. However, few studies have focused on the delay optimization of mixed-polarity Reed-Muller (MPRM) logic circuits. In this paper, we propose an efficient delay optimization approach (EDOA) for MPRM logic circuits under the unit delay model, which can derive an optimal MPRM logic circuit with minimum delay. First, the simplest MPRM expression with the fewest number of product terms is obtained using a novel Reed-Muller expression simplification approach (RMESA) considering don’t-care terms. Second, a minimum delay decomposition approach based on a Huffman tree construction algorithm is utilized on the simplestMPRM expression. Experimental results on MCNC benchmark circuits demonstrate that compared to the Berkeley SIS 1.2 and ABC, the EDOA can significantly reduce delay for most circuits. Furthermore, for a few circuits, while reducing delay, the EDOA incurs an area penalty.

  • RESEARCH ARTICLE
    Kang LI, Fazhi HE, Haiping YU, Xiao CHEN

    This paper presents a novel tracking algorithm which integrates two complementary trackers. Firstly, an improved Bayesian tracker(B-tracker) with adaptive learning rate is presented. The classification score of B-tracker reflects tracking reliability, and a low score usually results from large appearance change. Therefore, if the score is low, we decrease the learning rate to update the classifier fast so that B-tracker can adapt to the variation and vice versa. In this way, B-tracker is more suitable than its traditional version to solve appearance change problem. Secondly, we present an improved incremental subspace learning method tracker(Stracker). We propose to calculate projected coordinates using maximum posterior probability, which results in a more accurate reconstruction error than traditional subspace learning tracker. Instead of updating at every time, we present a stopstrategy to deal with occlusion problem. Finally, we present an integrated framework(BAST), in which the pair of trackers run in parallel and return two candidate target states separately. For each candidate state, we define a tracking reliability metrics to measure whether the candidate state is reliable or not, and the reliable candidate state will be chosen as the target state at the end of each frame. Experimental results on challenging sequences show that the proposed approach is very robust and effective in comparison to the state-of-the-art trackers.

  • RESEARCH ARTICLE
    Xuan LI, Jin LI, Siuming YIU, Chongzhi GAO, Jinbo XIONG

    Internet of Things (IoT) has drawn much attention in recent years. However, the image data captured by IoT terminal devices are closely related to users’ personal information, which are sensitive and should be protected. Though traditional privacy-preserving outsourced computing solutions such as homomorphic cryptographic primitives can support privacy-preserving computing, they consume a significant amount of computation and storage resources. Thus, it becomes a heavy burden on IoT terminal devices with limited resources. In order to reduce the resource consumption of terminal device, we propose an edge-assisted privacy-preserving outsourced computing framework for image processing, including image retrieval and classification. The edge nodes cooperate with the terminal device to protect data and support privacy-preserving computing on the semitrusted cloud server. Under this framework, edge-assisted privacy-preserving image retrieval and classification schemes are proposed in this paper. The security analysis and performance evaluation show that the proposed schemes greatly reduce the computational, communication and storage burden of IoT terminal device while ensuring image data security.

  • LETTER
    Yongyi YAN, Jumei YUE, Zhumu FU, Zengqiang CHEN