MapReduce, a parallel computational model, has been widely used in processing big data in a distributed cluster. Consisting of alternate map and reduce phases, MapReduce has to shuffle the intermediate data generated by mappers to reducers. The key challenge of ensuring balanced workload on MapReduce is to reduce partition skew among reducers without detailed distribution information on mapped data.
In this paper, we propose an incremental data allocation approach to reduce partition skew among reducers on MapReduce. The proposed approach divides mapped data into many micro-partitions and gradually gathers the statistics on their sizes in the process of mapping. The micropartitions are then incrementally allocated to reducers in multiple rounds. We propose to execute incremental allocation in two steps, micro-partition scheduling and micro-partition allocation. We propose a Markov decision process (MDP) model to optimize the problem of multiple-round micropartition scheduling for allocation commitment. We present an optimal solution with the time complexity of O(K · N2), in which K represents the number of allocation rounds and N represents the number of micro-partitions. Alternatively, we also present a greedy but more efficient algorithm with the time complexity of O(K · N ln N). Then, we propose a minmax programming model to handle the allocation mapping between micro-partitions and reducers, and present an effective heuristic solution due to its NP-completeness. Finally, we have implemented the proposed approach on Hadoop, an open-source MapReduce platform, and empirically evaluated its performance. Our extensive experiments show that compared with the state-of-the-art approaches, the proposed approach achieves considerably better data load balance among reducers as well as overall better parallel performance.
AADL (architecture analysis and design language) concentrates on the modeling and analysis of application system architectures. It is quite popular for its simple syntax, powerful functionality and extensibility and has been widely applied in embedded systems for its advantage. However, it is not enough for AADL to model cyber-physical systems (CPS) mainly because it cannot be used to model the continuous dynamic behaviors. This paper proposes an approach to construct a new sublanguage of AADL called AADL+, to facilitate the modeling of not only the discrete and continuous behavior of CPS, but also interaction between cyber components and physical components. The syntax and semantics of the sublanguage are provided to describe the behaviors of the systems. What’s more, we develop a plug-in to OSATE (open-source AADL tool environment) for the modeling of CPS. And the plug-in supports syntax checking and simulation of the system model through linking with modelica. Finally, the AADL+ annex is successfully applied to model a lunar rover control system.
The problem of limited minority class data is encountered in many class imbalanced applications, but has received little attention. Synthetic over-sampling, as popular class-imbalance learning methods, could introduce much noise when minority class has limited data since the synthetic samples are not i.i.d. samples of minority class. Most sophisticated synthetic sampling methods tackle this problem by denoising or generating samples more consistent with ground-truth data distribution. But their assumptions about true noise or ground-truth data distribution may not hold. To adapt synthetic sampling to the problem of limited minority class data, the proposed Traso framework treats synthetic minority class samples as an additional data source, and exploits transfer learning to transfer knowledge from them to minority class. As an implementation, TrasoBoost method firstly generates synthetic samples to balance class sizes. Then in each boosting iteration, the weights of synthetic samples and original data decrease and increase respectively when being misclassified, and remain unchanged otherwise. The misclassified synthetic samples are potential noise, and thus have smaller influence in the following iterations. Besides, the weights of minority class instances have greater change than those of majority class instances to be more influential. And only original data are used to estimate error rate to be immune from noise. Finally, since the synthetic samples are highly related to minority class, all of the weak learners are aggregated for prediction. Experimental results show TrasoBoost outperforms many popular class-imbalance learning methods.
Performance variability, stemming from nondeterministic hardware and software behaviors or deterministic behaviors such as measurement bias, is a well-known phenomenon of computer systems which increases the difficulty of comparing computer performance metrics and is slated to become even more of a concern as interest in Big Data analytic increases. Conventional methods use various measures (such as geometric mean) to quantify the performance of different benchmarks to compare computers without considering this variability which may lead to wrong conclusions. In this paper, we propose three resampling methods for performance evaluation and comparison: a randomization test for a general performance comparison between two computers, bootstrapping confidence estimation, and an empirical distribution and five-number-summary for performance evaluation. The results show that for both PARSEC and highvariance BigDataBench benchmarks 1) the randomization test substantially improves our chance to identify the difference between performance comparisons when the difference is not large; 2) bootstrapping confidence estimation provides an accurate confidence interval for the performance comparison measure (e.g., ratio of geometric means); and 3) when the difference is very small, a single test is often not enough to reveal the nature of the computer performance due to the variability of computer systems.We further propose using empirical distribution to evaluate computer performance and a five-number-summary to summarize computer performance. We use published SPEC 2006 results to investigate the sources of performance variation by predicting performance and relative variation for 8,236 machines. We achieve a correlation of predicted performances of 0.992 and a correlation of predicted and measured relative variation of 0.5. Finally, we propose the utilization of a novel biplotting technique to visualize the effectiveness of benchmarks and cluster machines by behavior. We illustrate the results and conclusion through detailed Monte Carlo simulation studies and real examples.
In crowdsourced mobile application testing, workers are often inexperienced in and unfamiliar with software testing. Meanwhile, workers edit test reports in descriptive natural language on mobile devices. Thus, these test reports generally lack important details and challenge developers in understanding the bugs. To improve the quality of inspected test reports, we issue a new problem of test report augmentation by leveraging the additional useful information contained in duplicate test reports. In this paper, we propose a new framework named test report augmentation framework (TRAF) towards resolving the problem. First, natural language processing (NLP) techniques are adopted to preprocess the crowdsourced test reports. Then, three strategies are proposed to augment the environments, inputs, and descriptions of the inspected test reports, respectively. Finally, we visualize the augmented test reports to help developers distinguish the added information. To evaluate TRAF, we conduct experiments over five industrial datasets with 757 crowdsourced test reports. Experimental results show that TRAF can recommend relevant inputs to augment the inspected test reports with 98.49% in terms of NDCG and 88.65% in terms of precision on average, and identify valuable sentences from the descriptions of duplicates to augment the inspected test reports with 83.58% in terms of precision, 77.76% in terms of recall, and 78.72% in terms of F-measure on average. Meanwhile, empirical evaluation also demonstrates that augmented test reports can help developers understand and fix bugs better.
Most traditional collaborative filtering (CF) methods only use the user-item rating matrix to make recommendations, which usually suffer from cold-start and sparsity problems. To address these problems, on the one hand, some CF methods are proposed to incorporate auxiliary information such as user/item profiles; on the other hand, deep neural networks, which have powerful ability in learning effective representations, have achieved great success in recommender systems. However, these neural network based recommendation methods rarely consider the uncertainty of weights in the network and only obtain point estimates of the weights. Therefore, they maybe lack of calibrated probabilistic predictions and make overly confident decisions. To this end, we propose a new Bayesian dual neural network framework, named BDNet, to incorporate auxiliary information for recommendation. Specifically, we design two neural networks, one is to learn a common low dimensional space for users and items from the rating matrix, and another one is to project the attributes of users and items into another shared latent space. After that, the outputs of these two neural networks are combined to produce the final prediction. Furthermore, we introduce the uncertainty to all weights which are represented by probability distributions in our neural networks to make calibrated probabilistic predictions. Extensive experiments on real-world data sets are conducted to demonstrate the superiority of our model over various kinds of competitors.
In diverse and self-governed multiple clouds context, the service management and discovery are greatly challenged by the dynamic and evolving features of services. How to manage the features of cloud services and support accurate and efficient service discovery has become an open problem in the area of cloud computing. This paper proposes a field model of multiple cloud services and corresponding service discovery method to address the issue. Different from existing researches, our approach is inspired by Bohr atom model. We use the abstraction of energy level and jumping mechanism to describe services status and variations, and thereby to support the service demarcation and discovery. The contributions of this paper are threefold. First, we propose the abstraction of service energy level to represent the status of services, and service jumping mechanism to investigate the dynamic and evolving features as the variations and re-demarcation of cloud services according to their energy levels. Second, we present user acceptable service region to describe the services satisfying users’ requests and corresponding service discovery method, which can significantly decrease services search scope and improve the speed and precision of service discovery. Third, a series of algorithms are designed to implement the generation of field model, user acceptable service regions, service jumping mechanism, and user-oriented service discovery.We have conducted an extensive experiments on QWS dataset to validate and evaluate our proposed models and algorithms. The results show that field model can well support the representation of dynamic and evolving aspects of services in multiple clouds context and the algorithms can improve the accuracy and efficiency of service discovery.
Redundant array of independent SSDs (RAIS) is generally based on the traditional RAID design and implementation. The random small write problem is a serious challenge of RAIS. Random small writes in parity-based RAIS systems generate significantly more pre-reads and writes which can degrade RAIS performance and shorten SSD lifetime. In order to overcome the well-known write-penalty problem in the parity-based RAID5 storage systems, several logging techniques such as Parity Logging and Data Logging have been put forward. However, these techniques are originally based on mechanical characteristics of the HDDs, which ignore the properties of the flash memory.
In this article, we firstly propose RAISL, a flash-aware logging method that improves the small write performance of RAIS storage systems. RAISL writes new data instead of new data and pre-read data to the log SSD by making full use of the invalid pages on the SSD of RAIS. RAISL does not need to perform the pre-read operations so that the original characteristics of workloads are kept. Secondly, we propose AGCRL on the basis of RAISL to further boost performance. AGCRL combines RAISL with access characteristic to guide read and write cost regulation to improve the performance of RAIS storage systems. Our experiments demonstrate that the RAISL significantly improves write performance and AGCRL improves both of write performance and read performance. AGCRL on average outperforms RAIS5 and RAISL by 39.15% and 16.59% respectively.
Multi-label learning is more complicated than single-label learning since the semantics of the instances are usually overlapped and not identical. The effectiveness of many algorithms often fails when the correlations in the feature and label space are not fully exploited. To this end, we propose a novel non-negative matrix factorization (NMF) based modeling and training algorithm that learns from both the adjacencies of the instances and the labels of the training set. In the modeling process, a set of generators are constructed, and the associations among generators, instances, and labels are set up, with which the label prediction is conducted. In the training process, the parameters involved in the process of modeling are determined. Specifically, an NMF based algorithm is proposed to determine the associations between generators and instances, and a non-negative least square optimization algorithm is applied to determine the associations between generators and labels. The proposed algorithm fully takes the advantage of smoothness assumption, so that the labels are properly propagated. The experimentswere carried out on six set of benchmarks. The results demonstrate the effectiveness of the proposed algorithms.
The propagation of information in online social networks plays a critical role in modern life, and thus has been studied broadly. Researchers have proposed a series of propagation models, generally, which use a single transition probability or consider factors such as content and time to describe the way how a user activates her/his neighbors. However, the research on the mechanism how social ties between users play roles in propagation process is still limited. Specifically, comprehensive summary of factors which affect user’s decision whether to share neighbor’s content was lacked in existing works, so that the existing models failed to clearly describe the process a user be activated by a neighbor. To this end, in this paper, we analyze the close correspondence between social tie in propagation process and communication channel, thus we propose to exploit the communication channel to describe the information propagation process between users, and design a social tie channel (STC) model. The model can naturally incorporate many factors affecting the information propagation through edges such as content topic and user preference, and thus can effectively capture the user behavior and relationship characteristics which indicate the property of a social tie. Extensive experiments conducted on two real-world datasets demonstrate the effectiveness of our model on content sharing prediction between users.