Please wait a minute...

Frontiers of Computer Science

Current Issue

, Volume 12 Issue 1 Previous Issue   
For Selected: View Abstracts Toggle Thumbnails
Incomplete data management: a survey
Xiaoye MIAO, Yunjun GAO, Su GUO, Wanqi LIU
Front. Comput. Sci.. 2018, 12 (1): 4-25.
Abstract   PDF (728KB)

Incomplete data accompanies our life processes and covers almostall fields of scientific studies, as a result of delivery failure,no power of battery, accidental loss, etc. However, how to model,index, and query incomplete data incurs big challenges. For example,the queries struggling with incomplete data usually have dissatisfyingquery results due to the improper incompleteness handling methods.In this paper, we systematically review the management of incompletedata, including modelling, indexing, querying, and handling methodsin terms of incomplete data. We also overview several applicationscenarios of incomplete data, and summarize the existing systems relatedto incomplete data. It is our hope that this survey could provideinsights to the database community on how incomplete data is managed,and inspire database researchers to develop more advanced processingtechniques and tools to cope with the issues resulting from incompletedata in the real world.

References | Related Articles | Metrics
Bioimage-based protein subcellular location prediction:a comprehensive review
Ying-Ying XU, Li-Xiu YAO, Hong-Bin SHEN
Front. Comput. Sci.. 2018, 12 (1): 26-39.
Abstract   PDF (643KB)

Subcellular localization of proteins can provide key hints toinfer their functions and structures in cells. With the breakthroughof recent molecule imaging techniques, the usage of 2D bioimages hasbecome increasingly popular in automatically analyzing the proteinsubcellular location patterns. Compared with the widely used protein1D amino acid sequence data, the images of protein distribution aremore intuitive and interpretable, making the images a better choiceat many applications for revealing the dynamic characteristics ofproteins, such as detecting protein translocation and quantificationof proteins. In this paper, we systematically reviewed the recentprogresses in the field of automated image-based protein subcellularlocation prediction, and classified them into four categories includinggrowing of bioimage databases, description of subcellular locationdistribution patterns, classification methods, and applications ofthe prediction systems. Besides, we also discussed some potentialdirections in this field.

References | Supplementary Material | Related Articles | Metrics
Computational pricing in Internet era
Fei TIAN, Tao QIN, Tie-Yan LIU
Front. Comput. Sci.. 2018, 12 (1): 40-54.
Abstract   PDF (338KB)

Pricing plays a central rule to a company’s profitability,and therefore has been extensively studied in the literature of economics.When designing a pricing mechanism/model, an important principle toconsider is “price discrimination”, which refers to sellingthe same resources with different prices according to different valuesof buyers. To meet the “price discrimination” principle,especially when the number of buyers is large, computational methods,which act in a more accurate and principled way, are usually neededto determine the optimal allocation of sellers’ resources (whomto sell to) and the optimal payment of buyers (what to charge).Nowadays,in the Internet era in which quite a lot of buy and sell processesare conducted through Internet, the design of computational pricingmodels faces both new challenges and opportunities, considering that(i) nearly realtime interactions between people enable the buyersto reveal their needs and enable the sellers to expose their informationin a more expressive manner, (ii) the large-scale interaction datarequire powerful methods for more efficient processing and enablethe sellers to model different buyers in a more precise manner. Inthis paper, we review recent advances on the analysis and design ofcomputational pricing models for representative Internet industries,e.g., online advertising and cloud computing. In particular, we introducehow computational approaches can be used to analyze buyer’sbehaviors (i.e., equilibrium analysis), improve resource utilization(i.e., social welfare analysis), and boost seller’s profit (i.e.,revenue analysis). We also discuss how machine learning techniquescan be used to better understand buyer’s behaviors and designmore effective pricing mechanisms, given the availability of largescale data. Moreover, we make discussions on future research directionson computational pricing, which hopefully can inspire more researchersto contribute to this important domain.

References | Supplementary Material | Related Articles | Metrics
A retrospective of knowledge graphs
Jihong YAN, Chengyu WANG, Wenliang CHENG, Ming GAO, Aoying ZHOU
Front. Comput. Sci.. 2018, 12 (1): 55-74.
Abstract   PDF (542KB)

Information on the Internet is fragmented and presented in differentdata sources, which makes automatic knowledge harvesting and understandingformidable for machines, and even for humans. Knowledge graphs havebecome prevalent in both of industry and academic circles these years,to be one of the most efficient and effective knowledge integrationapproaches. Techniques for knowledge graph construction can mine informationfrom either structured, semi-structured, or even unstructured datasources, and finally integrate the information into knowledge, representedin a graph. Furthermore, knowledge graph is able to organize informationin an easy-to-maintain, easy-to-understand and easy-to-use manner.

In this paper, we give a summarization of techniques for constructingknowledge graphs. We review the existing knowledge graph systems developedby both academia and industry. We discuss in detail about the processof building knowledge graphs, and survey state-of-the-art techniquesfor automatic knowledge graph checking and expansion via logical inferringand reasoning. We also review the issues of graph data managementby introducing the knowledge data models and graph databases, especiallyfrom a NoSQL point of view. Finally, we overview current knowledgegraph systems and discuss the future research directions.

References | Related Articles | Metrics
Layered virtual machine migration algorithm fornetwork resource balancing in cloud computing
Xiong FU, Juzhou CHEN, Song DENG, Junchang WANG, Lin ZHANG
Front. Comput. Sci.. 2018, 12 (1): 75-85.
Abstract   PDF (674KB)

Due to the increasing sizes of cloud data centers, the numberof virtual machines (VMs) and applications rises quickly. The rapidgrowth of large scale Internet services results in unbalanced loadof network resource. The bandwidth utilization rate of some physicalhosts is too high, and this causes network congestion. This paperpresents a layered VM migration algorithm (LVMM). At first, the algorithmwill divide the cloud data center into several regions according tothe bandwidth utilization rate of the hosts. Then we balance the loadof network resource of each region by VM migrations, and ultimatelyachieve the load balance of network resource in the cloud data center.Through simulation experiments in different environments, it is provedthat the LVMMalgorithm can effectively balance the load of networkresource in cloud computing.

References | Supplementary Material | Related Articles | Metrics
Tuning parallel symbolic execution engine forbetter performance
Anil Kumar KARNA, Jinbo DU, Haihao SHEN, Hao ZHONG, Jiong GONG, Haibo YU, Xiangning MA, Jianjun ZHAO
Front. Comput. Sci.. 2018, 12 (1): 86-100.
Abstract   PDF (859KB)

Symbolic execution is widely used in many code analysis, testing,and verification tools. As symbolic execution exhaustively exploresall feasible paths, it is quite time consuming. To handle the problem,researchers have paralleled existing symbolic execution tools (e.g.,KLEE). In particular, Cloud9 is a widely used paralleled symbolicexecution tool, and researchers have used the tool to analyze realcode. However, researchers criticize that tools such as Cloud9 stillcannot analyze large scale code. In this paper, we conduct a fieldstudy on Cloud9, in which we use KLEE and Cloud9 to analyze benchmarksin C. Our results confirm the criticism. Based on the results, weidentify three bottlenecks that hinder the performance of Cloud9:the communication time gap, the job transfer policy, and the cachemanagement of the solved constraints. To handle these problems, wetun the communication time gap with better parameters, modify thejob transfer policy, and implement an approach for cache managementof solved constraints. We conduct two evaluations on our benchmarksand a real application to understand our improvements. Our resultsshow that our tuned Cloud9 reduces the execution time significantly,both on our benchmarks and the real application. Furthermore, ourevaluation results show that our tuning techniques improve the effectivenesson all the devices, and the improvement can be achieved upto fivetimes, depending upon a tuning value of our approach and the behaviourof program under test.

References | Supplementary Material | Related Articles | Metrics
GPS: a constraint-based gene position procurementin chromosome for solving large-scale multiobjective multiple knapsackproblems
Jayanthi MANICASSAMY, Dinesh KARUNANIDHI, Sujatha POTHULA, Vengattaraman THIRUMAL, Dhavachelvan PONNURANGAM, Subramanian RAMALINGAM
Front. Comput. Sci.. 2018, 12 (1): 101-121.
Abstract   PDF (839KB)

The multiple knapsack problem (MKP) forms a base for resolvingmany real-life problems. This has also been considered with multipleobjectives in genetic algorithms (GAs) for proving its efficiency.GAs use selfadaptability to effectively solve complex problems withconstraints, but in certain cases, self-adaptability fails by convergingtoward an infeasible region. This pitfall can be resolved by usingdifferent existing repairing techniques; however, this cannot assureconvergence toward attaining the optimal solution. To overcome thisissue, gene position-based suppression (GPS) has been modeled andembedded as a new phase in a classical GA. This phase works on thegenes of a newly generated individual after the recombination phaseto retain the solution vector within its feasible region and to improvethe solution vector to attain the optimal solution. Genes holdingthe highest expressibility are reserved into a subset, as the bestgenes identified from the current individuals by replacing the weakergenes from the subset. This subset is used by the next generated individualto improve the solution vector and to retain the best genes of theindividuals. Each gene’s positional point and its genotype exposurefor each region in an environment are used to fit the best uniquegenes. Further, suppression of expression in conflicting gene’srelies on the requirement toward the level of exposure in the environmentor in eliminating the duplicate genes from the environment. The MKPbenchmark instances from the OR-library are taken for the experimentto test the new model. The outcome portrays that GPS in a classicalGA is superior in most of the cases compared to the other existingrepairing techniques.

References | Supplementary Material | Related Articles | Metrics
Distributed learning particle swarm optimizerfor global optimization of multimodal problems
Geng ZHANG, Yangmin LI, Yuhui SHI
Front. Comput. Sci.. 2018, 12 (1): 122-134.
Abstract   PDF (598KB)

Particle swarm optimizer (PSO) is an effective tool for solvingmany optimization problems. However, it may easily get trapped intolocal optimumwhen solving complex multimodal nonseparable problems.This paper presents a novel algorithm called distributed learningparticle swarm optimizer (DLPSO) to solve multimodal nonseparableproblems. The strategy for DLPSO is to extract good vector informationfrom local vectors which are distributed around the search space andthen to form a new vector which can jump out of local optima and willbe optimized further. Experimental studies on a set of test functionsshow that DLPSO exhibits better performance in solving optimizationproblems with few interactions between variables than several otherpeer algorithms.

References | Supplementary Material | Related Articles | Metrics
A multi-level approach to highly efficient recognitionof Chinese spam short messages
Weimin WANG, Dan ZHOU
Front. Comput. Sci.. 2018, 12 (1): 135-145.
Abstract   PDF (589KB)

The problem of spam short message (SMS) recognition involvesmany aspects of natural language processing. A good solution to solvingthe problem can not only improve the quality of people experiencingthe mobile life, but also has a positive role on promoting the analysisof short text occurring in current mobile applications, such as Webchatand microblog. As spam SMSes have characteristics of sparsity, transformationand real-timedness, we propose three methods at different levels,i.e., recognition based on symbolic features, recognition based ontext similarity, and recognition based on pattern matching. By combiningthese methods, we obtain a multi-level approach to spam SMS recognition.In order to enrich the pattern base to reduce manual labor and time,we propose a quasi-pattern learning method, which utilizes quasi-patternmatching results in the pattern matching process. Themethod can learnmanyinteresting and new patterns from the SMS corpus. Finally, a comprehensiveanalysis indicates that our spam SMS recognition approach achievesa precision rate as high as 95.18%, and a recall rate of 95.51%.

References | Supplementary Material | Related Articles | Metrics
Handling query skew in large indexes: a viewbased approach
Weihuang HUANG, JeffreyXu YU, Zechao SHANG
Front. Comput. Sci.. 2018, 12 (1): 146-162.
Abstract   PDF (1136KB)

Indexing is one of the most important techniques to facilitatequery processing over a multi-dimensional dataset. A commonly usedstrategy for such indexing is to keep the tree-structured index balanced.This strategy reduces query processing cost in the worst case, andcan handle all different queries equally well. In other words, thisstrategy implies that all queries are uniformly issued, which is partiallybecause the query distribution is not possibly known and will changeover time in practice. A key issue we study in this work is whetherit is the best to fully rely on a balanced tree-structured index inparticular when datasets become larger and larger in the big dataera. This means that, when a dataset becomes very large, it becomesunreasonable to assume that all data in any subspace are equally importantand are uniformly accessed by all queries at the index level. Giventhe existence of query skew and the possible changes of query skew,in this paper, we study how to handle such query skew and such queryskew changes at the index level without sacrifice of supporting anypossible queries in a wellbalanced tree index and without a high overhead.To tackle the issue, we propose index-view at the index level, wherean index-view is a short-cut in a balanced tree-structured index toaccess objects in the subspaces that are more frequently accessed,and propose a new index-view-centric framework for query processingusing index-views in a bottom-up manner. We study index-views selectionproblem in both static and dynamic setting, and we confirm the effectivenessof our approach using large real and synthetic datasets.

References | Supplementary Material | Related Articles | Metrics
Strength Pareto fitness assignment for pseudo-relevancefeedback: application to MEDLINE
Front. Comput. Sci.. 2018, 12 (1): 163-176.
Abstract   PDF (354KB)

Because of users’ growing utilization of unclear and imprecisekeywords when characterizing their information need, it has becomenecessary to expand their original search queries with additionalwords that best capture their actual intent. The selection of theterms that are suitable for use as additional words is in generaldependent on the degree of relatedness between each candidate expansionterm and the query keywords. In this paper, we propose two criteriafor evaluating the degree of relatedness between a candidate expansionword and the query keywords: (1) co-occurrence frequency, where moreimportance is attributed to terms occurring in the largest possiblenumber of documents where the query keywords appear; (2) proximity,where more importance is assigned to terms having a short distancefrom the query terms within documents. We also employ the strengthPareto fitness assignment in order to satisfy both criteria simultaneously.The results of our numerical experiments on MEDLINE, the online medicalinformation database, show that the proposed approach significantlyenhances the retrieval performance as compared to the baseline.

References | Supplementary Material | Related Articles | Metrics
Efficient identity-based threshold decryptionscheme from bilinear pairings
Wei GAO, Guilin WANG, Kefei CHEN, Xueli WANG
Front. Comput. Sci.. 2018, 12 (1): 177-189.
Abstract   PDF (350KB)

Using Shamir’s secret sharing scheme to indirectly sharethe identity-based private key in the form of a pairing group element,we propose an efficient identity-based threshold decryption schemefrom pairings and prove its security in the random oracle model. Thisnew paring-based scheme features a few improvements compared withother schemes in the literature. The two most noticeable featuresare its efficiency, by drastically reducing the number of pairingcomputations, and the ability it gives the user to share the identity-basedprivate key without requiring any access to a private key generator.With the ability it gives the user to share the identity-based privatekey, our ID-based threshold decryption (IBTD) scheme, the second ofits kind, is significantly more efficient than the first scheme, whichwas developed by Baek and Zheng, at the expense of a slightly increasedciphertext length. In fact, our IBTD scheme tries to use as few bilinearpairings as possible, especially without depending on the suite ofBaek–Zheng secret sharing tools based on pairings.

References | Supplementary Material | Related Articles | Metrics
13 articles