Jan 2018, Volume 12 Issue 1
    

  • Select all
  • PERSPECTIVE
    Doheon LEE
  • REVIEW ARTICLE
    Xiaoye MIAO, Yunjun GAO, Su GUO, Wanqi LIU

    Incomplete data accompanies our life processes and covers almost all fields of scientific studies, as a result of delivery failure, no power of battery, accidental loss, etc. However, how to model, index, and query incomplete data incurs big challenges. For example, the queries struggling with incomplete data usually have dissatisfying query results due to the improper incompleteness handling methods. In this paper, we systematically review the management of incomplete data, including modelling, indexing, querying, and handling methods in terms of incomplete data. We also overview several application scenarios of incomplete data, and summarize the existing systems related to incomplete data. It is our hope that this survey could provide insights to the database community on how incomplete data is managed, and inspire database researchers to develop more advanced processing techniques and tools to cope with the issues resulting from incomplete data in the real world.

  • REVIEW ARTICLE
    Ying-Ying XU, Li-Xiu YAO, Hong-Bin SHEN

    Subcellular localization of proteins can provide key hints to infer their functions and structures in cells. With the breakthrough of recent molecule imaging techniques, the usage of 2D bioimages has become increasingly popular in automatically analyzing the protein subcellular location patterns. Compared with the widely used protein 1D amino acid sequence data, the images of protein distribution are more intuitive and interpretable, making the images a better choice at many applications for revealing the dynamic characteristics of proteins, such as detecting protein translocation and quantification of proteins. In this paper, we systematically reviewed the recent progresses in the field of automated image-based protein subcellular location prediction, and classified them into four categories including growing of bioimage databases, description of subcellular location distribution patterns, classification methods, and applications of the prediction systems. Besides, we also discussed some potential directions in this field.

  • REVIEW ARTICLE
    Fei TIAN, Tao QIN, Tie-Yan LIU

    Pricing plays a central rule to a company’s profitability, and therefore has been extensively studied in the literature of economics. When designing a pricing mechanism/model, an important principle to consider is “price discrimination”, which refers to selling the same resources with different prices according to different values of buyers. To meet the “price discrimination” principle, especially when the number of buyers is large, computational methods, which act in a more accurate and principled way, are usually needed to determine the optimal allocation of sellers’ resources (whom to sell to) and the optimal payment of buyers (what to charge).Nowadays, in the Internet era in which quite a lot of buy and sell processes are conducted through Internet, the design of computational pricing models faces both new challenges and opportunities, considering that (i) nearly realtime interactions between people enable the buyers to reveal their needs and enable the sellers to expose their information in a more expressive manner, (ii) the large-scale interaction data require powerful methods for more efficient processing and enable the sellers to model different buyers in a more precise manner. In this paper, we review recent advances on the analysis and design of computational pricing models for representative Internet industries, e.g., online advertising and cloud computing. In particular, we introduce how computational approaches can be used to analyze buyer’s behaviors (i.e., equilibrium analysis), improve resource utilization (i.e., social welfare analysis), and boost seller’s profit (i.e., revenue analysis). We also discuss how machine learning techniques can be used to better understand buyer’s behaviors and design more effective pricing mechanisms, given the availability of large scale data. Moreover, we make discussions on future research directions on computational pricing, which hopefully can inspire more researchers to contribute to this important domain.

  • RESEARCH ARTICLE
    Xiong FU, Juzhou CHEN, Song DENG, Junchang WANG, Lin ZHANG

    Due to the increasing sizes of cloud data centers, the number of virtual machines (VMs) and applications rises quickly. The rapid growth of large scale Internet services results in unbalanced load of network resource. The bandwidth utilization rate of some physical hosts is too high, and this causes network congestion. This paper presents a layered VM migration algorithm (LVMM). At first, the algorithm will divide the cloud data center into several regions according to the bandwidth utilization rate of the hosts. Then we balance the load of network resource of each region by VM migrations, and ultimately achieve the load balance of network resource in the cloud data center. Through simulation experiments in different environments, it is proved that the LVMMalgorithm can effectively balance the load of network resource in cloud computing.

  • RESEARCH ARTICLE
    Anil Kumar KARNA, Jinbo DU, Haihao SHEN, Hao ZHONG, Jiong GONG, Haibo YU, Xiangning MA, Jianjun ZHAO

    Symbolic execution is widely used in many code analysis, testing, and verification tools. As symbolic execution exhaustively explores all feasible paths, it is quite time consuming. To handle the problem, researchers have paralleled existing symbolic execution tools (e.g., KLEE). In particular, Cloud9 is a widely used paralleled symbolic execution tool, and researchers have used the tool to analyze real code. However, researchers criticize that tools such as Cloud9 still cannot analyze large scale code. In this paper, we conduct a field study on Cloud9, in which we use KLEE and Cloud9 to analyze benchmarks in C. Our results confirm the criticism. Based on the results, we identify three bottlenecks that hinder the performance of Cloud9: the communication time gap, the job transfer policy, and the cache management of the solved constraints. To handle these problems, we tun the communication time gap with better parameters, modify the job transfer policy, and implement an approach for cache management of solved constraints. We conduct two evaluations on our benchmarks and a real application to understand our improvements. Our results show that our tuned Cloud9 reduces the execution time significantly, both on our benchmarks and the real application. Furthermore, our evaluation results show that our tuning techniques improve the effectiveness on all the devices, and the improvement can be achieved upto five times, depending upon a tuning value of our approach and the behaviour of program under test.

  • RESEARCH ARTICLE
    Jayanthi MANICASSAMY, Dinesh KARUNANIDHI, Sujatha POTHULA, Vengattaraman THIRUMAL, Dhavachelvan PONNURANGAM, Subramanian RAMALINGAM

    The multiple knapsack problem (MKP) forms a base for resolving many real-life problems. This has also been considered with multiple objectives in genetic algorithms (GAs) for proving its efficiency. GAs use selfadaptability to effectively solve complex problems with constraints, but in certain cases, self-adaptability fails by converging toward an infeasible region. This pitfall can be resolved by using different existing repairing techniques; however, this cannot assure convergence toward attaining the optimal solution. To overcome this issue, gene position-based suppression (GPS) has been modeled and embedded as a new phase in a classical GA. This phase works on the genes of a newly generated individual after the recombination phase to retain the solution vector within its feasible region and to improve the solution vector to attain the optimal solution. Genes holding the highest expressibility are reserved into a subset, as the best genes identified from the current individuals by replacing the weaker genes from the subset. This subset is used by the next generated individual to improve the solution vector and to retain the best genes of the individuals. Each gene’s positional point and its genotype exposure for each region in an environment are used to fit the best unique genes. Further, suppression of expression in conflicting gene’s relies on the requirement toward the level of exposure in the environment or in eliminating the duplicate genes from the environment. The MKP benchmark instances from the OR-library are taken for the experiment to test the new model. The outcome portrays that GPS in a classical GA is superior in most of the cases compared to the other existing repairing techniques.

  • RESEARCH ARTICLE
    Geng ZHANG, Yangmin LI, Yuhui SHI

    Particle swarm optimizer (PSO) is an effective tool for solving many optimization problems. However, it may easily get trapped into local optimumwhen solving complex multimodal nonseparable problems. This paper presents a novel algorithm called distributed learning particle swarm optimizer (DLPSO) to solve multimodal nonseparable problems. The strategy for DLPSO is to extract good vector information from local vectors which are distributed around the search space and then to form a new vector which can jump out of local optima and will be optimized further. Experimental studies on a set of test functions show that DLPSO exhibits better performance in solving optimization problems with few interactions between variables than several other peer algorithms.

  • RESEARCH ARTICLE
    Weimin WANG, Dan ZHOU

    The problem of spam short message (SMS) recognition involves many aspects of natural language processing. A good solution to solving the problem can not only improve the quality of people experiencing the mobile life, but also has a positive role on promoting the analysis of short text occurring in current mobile applications, such as Webchat and microblog. As spam SMSes have characteristics of sparsity, transformation and real-timedness, we propose three methods at different levels, i.e., recognition based on symbolic features, recognition based on text similarity, and recognition based on pattern matching. By combining these methods, we obtain a multi-level approach to spam SMS recognition. In order to enrich the pattern base to reduce manual labor and time, we propose a quasi-pattern learning method, which utilizes quasi-pattern matching results in the pattern matching process. Themethod can learnmany interesting and new patterns from the SMS corpus. Finally, a comprehensive analysis indicates that our spam SMS recognition approach achieves a precision rate as high as 95.18%, and a recall rate of 95.51%.

  • RESEARCH ARTICLE
    Weihuang HUANG, Jeffrey Xu YU, Zechao SHANG

    Indexing is one of the most important techniques to facilitate query processing over a multi-dimensional dataset. A commonly used strategy for such indexing is to keep the tree-structured index balanced. This strategy reduces query processing cost in the worst case, and can handle all different queries equally well. In other words, this strategy implies that all queries are uniformly issued, which is partially because the query distribution is not possibly known and will change over time in practice. A key issue we study in this work is whether it is the best to fully rely on a balanced tree-structured index in particular when datasets become larger and larger in the big data era. This means that, when a dataset becomes very large, it becomes unreasonable to assume that all data in any subspace are equally important and are uniformly accessed by all queries at the index level. Given the existence of query skew and the possible changes of query skew, in this paper, we study how to handle such query skew and such query skew changes at the index level without sacrifice of supporting any possible queries in a wellbalanced tree index and without a high overhead. To tackle the issue, we propose index-view at the index level, where an index-view is a short-cut in a balanced tree-structured index to access objects in the subspaces that are more frequently accessed, and propose a new index-view-centric framework for query processing using index-views in a bottom-up manner. We study index-views selection problem in both static and dynamic setting, and we confirm the effectiveness of our approach using large real and synthetic datasets.

  • RESEARCH ARTICLE
    Ilyes KHENNAK, Habiba DRIAS

    Because of users’ growing utilization of unclear and imprecise keywords when characterizing their information need, it has become necessary to expand their original search queries with additional words that best capture their actual intent. The selection of the terms that are suitable for use as additional words is in general dependent on the degree of relatedness between each candidate expansion term and the query keywords. In this paper, we propose two criteria for evaluating the degree of relatedness between a candidate expansion word and the query keywords: (1) co-occurrence frequency, where more importance is attributed to terms occurring in the largest possible number of documents where the query keywords appear; (2) proximity, where more importance is assigned to terms having a short distance from the query terms within documents. We also employ the strength Pareto fitness assignment in order to satisfy both criteria simultaneously. The results of our numerical experiments on MEDLINE, the online medical information database, show that the proposed approach significantly enhances the retrieval performance as compared to the baseline.

  • RESEARCH ARTICLE
    Wei GAO, Guilin WANG, Kefei CHEN, Xueli WANG

    Using Shamir’s secret sharing scheme to indirectly share the identity-based private key in the form of a pairing group element, we propose an efficient identity-based threshold decryption scheme from pairings and prove its security in the random oracle model. This new paring-based scheme features a few improvements compared with other schemes in the literature. The two most noticeable features are its efficiency, by drastically reducing the number of pairing computations, and the ability it gives the user to share the identity-based private key without requiring any access to a private key generator. With the ability it gives the user to share the identity-based private key, our ID-based threshold decryption (IBTD) scheme, the second of its kind, is significantly more efficient than the first scheme, which was developed by Baek and Zheng, at the expense of a slightly increased ciphertext length. In fact, our IBTD scheme tries to use as few bilinear pairings as possible, especially without depending on the suite of Baek–Zheng secret sharing tools based on pairings.