As the complexity of software systems is increasing; software maintenance is becoming a challenge for software practitioners. The prediction of classes that require high maintainability effort is of utmost necessity to develop cost-effective and high-quality software. In research of software engineering predictive modeling, various software maintainability prediction (SMP) models are evolved to forecast maintainability. To develop a maintainability prediction model, software practitioners may come across situations in which classes or modules requiring high maintainability effort are far less than those requiring low maintainability effort. This condition gives rise to a class imbalance problem (CIP). In this situation, the minority classes’ prediction, i.e., the classes demanding high maintainability effort, is a challenge. Therefore, in this direction, this study investigates three techniques for handling the CIP on ten open-source software to predict software maintainability. This empirical investigation supports the use of resampling with replacement technique (RR) for treating CIP and develop useful models for SMP.
Coverage based fuzzing is a widespread vulnerability detection technique, and it has exposed many bugs in many real-world programs. However, its attention is to eliminate the testing on the repeated paths, yet it still employs random mutation to generate inputs, which is blind to penetrate complex comparisons in the program. As a result, the testing coverage is limited. Despite some solution proposals are presented, this problem is still partially solved. This paper argues that random mutation is mainly limited by two challenges, the sizable search spaceand the lack of a useful feedback to direct the search. Then we present an augmented fuzzing technique by addressing these two challenges. First of all, we point out a black relationship between input contents and comparison operands, which is dubbed connection. Second, we present a novel method to collect the comparison operands during execution, which is leveraged to infer the connections. Based on the connections, the fuzzer can learn about which input byte affects on which comparison instruction to establish a smaller search space. Third, the connection provides a useful feedback to direct the search. We resort to a modern meta-heuristic algorithm to satisfy this searching requirement. We developed a prototype
A sequent is a pair
Understanding the influencing mechanism of the urban streetscape on crime is fairly important to crime prevention and urban management. Recently, the development of deep learning technology and big data of street view images, makes it possible to quantitatively explore the relationship between streetscape and crime. This study computed eight streetscape indexes of the street built environment using Google Street View images firstly. Then, the association between the eight indexes and recorded crime events was revealed with a poisson regression model and a geographically weighted poisson regression model. An experiment was conducted in downtown and uptown Manhattan, New York. Global regression results show that the influences of Motorization Index on crimes are significant and positive, while the effects of the Light View Index and Green View Index on crimes depend heavily on the socio-economic factors. From a local perspective, the Pedestrian Space Index, Green View Index, Light View IndexandMotorization Index have a significant spatial influence on crimes, while the same visual streetscape factors have different effects on different streets due to the combination differences of socio-economic, cultural and streetscape elements. The key streetscape elements of a given street that affect a specific criminal activity can be identified according to the strength of the association. The results provide both theoretical and practical implications for crime theories and crime prevention efforts.
Temporal action proposal generation aims to output the starting and ending times of each potential action for long videos and often suffers from high computation cost. To address the issue, we propose a new temporal convolution network called Multipath Temporal ConvNet (MTCN). In our work, one novel high performance ring parallel architecture based is further introduced into temporal action proposal generation in order to respond to the requirements of large memory occupation and a large number of videos. Remarkably, the total data transmission is reduced by adding a connection between multiplecomputing load in the newly developed architecture. Compared to the traditional Parameter Server architecture, our parallel architecture has higher efficiency on temporal action detection tasks with multiple GPUs. We conduct experiments on ActivityNet-1.3 and THUMOS14, where our method outperformsother state-of-art temporal action detection methods with high recall and high temporal precision. In addition, a time metric is further proposed here to evaluate the speed performancein the distributed training process.
Due to the lack of parallel data in current grammatical error correction (GEC) task, models based on sequence to sequence framework cannot be adequately trained to obtain higher performance. We propose two data synthesis methods which can control the error rate and the ratio of error types on synthetic data. The first approach is to corrupt each word in the monolingual corpus with a fixed probability, including replacement, insertion and deletion. Another approach is to train error generation models and further filtering the decoding results of the models. The experiments on different synthetic data show that the error rate is 40% and that the ratio of error types is the same can improve the model performance better. Finally, we synthesize about 100 million data and achieve comparable performance as the state of the art, which uses twice as much data as we use.
Parameter server (PS) as the state-of-the-art distributed framework for large-scale iterative machine learning tasks has been extensively studied. However, existing PS-based systems often depend on memory implementations. With memory constraints, machine learning (ML) developers cannot train large-scale ML models in their rather small local clusters. Moreover, renting large-scale cloud servers is always economically infeasible for research teams and small companies. In this paper, we propose a disk-resident parameter server system named DRPS, which reduces the hardware requirement of large-scale machine learning tasks by storing high dimensional models on disk. To further improve the performance of DRPS, we build an efficient index structure for parameters to reduce the disk I/O cost. Based on this index structure, we propose a novel multi-objective partitioning algorithm for the parameters. Finally, a flexible workerselection parallel model of computation (WSP) is proposed to strike a right balance between the problem of inconsistent parameter versions (staleness) and that of inconsistent execution progresses (straggler). Extensive experiments on many typical machine learning applications with real and synthetic datasets validate the effectiveness of DRPS.
There are many new and potential drug targets in G protein-coupled receptors (GPCRs) without sufficient ligand associations, and accurately predicting and interpreting ligand bioactivities is vital for screening and optimizing hit compounds targeting these GPCRs. To efficiently address the lack of labeled training samples, we proposed a multi-task regression learning with incoherent sparse and low-rank patterns (MTR-ISLR) to model ligand bioactivities and identify their key substructures associated with these GPCRs targets. That is, MTR-ISLR intends to enhance the performance and interpretability of models under a small size of available training data by introducing homologous GPCR tasks. Meanwhile, the low-rank constraint term encourages to catch the underlying relationship among homologous GPCR tasks for greater model generalization, and the entry-wise sparse regularization term ensures to recognize essential discriminative substructures from each task for explanative modeling. We examined MTR-ISLR on a set of 31 important human GPCRs datasets from 9 subfamilies, each with less than 400 ligand associations. The results show that MTR-ISLR reaches better performance when compared with traditional single-task learning, deep multi-task learning and multi-task learning with joint feature learning-based models on most cases, where MTR-ISLR obtains an average improvement of 7% in correlation coefficient (r2) and 12% in root mean square error (RMSE) against the runner-up predictors. The MTR-ISLR web server appends freely all source codes and data for academic usages. ①
A promoter is a short region of DNA that can bind RNA polymerase and initiate gene transcription. It is usually located directly upstream of the transcription initiation site. DNA promoters have been proven to be the main cause of many human diseases, especially diabetes, cancer or Huntington’s disease. Therefore, the classification of promoters has become an interesting problem and has attracted the attention of many researchers in the field of bioinformatics. Various studies have been conducted in order to solve this problem, but their performance still needs further improvement. In this research, we segmented the DNA sequence in a k-mers manner, then trained the word vector model, inputted it into long short-term memory(LSTM) and used the attention mechanism to predict. Our method can achieve 93.45% and 90.59% cross-validation accuracy in the two layers, respectively. Our results are better than others based on the same data set, and provided some ideas for accurately predicting promoters. In addition, this research suggested that natural language processing can play a significant role in biological sequence prediction.
Deciding bisimulation equivalence of two normed pushdown automata is one of the most fundamental problems in formal verification. The problem is proven to be ACKERMANN-complete recently. Both the upper bound and the lower bound results indicate that the number of control states is an important parameter. In this paper, we study the parametric complexity of this problem. We refine previous results in two aspects. First, we prove that the bisimulation equivalence of normed PDA with two states is EXPTIME-hard. Second, we prove that the bisimulation equivalence of normed PDA with
Blockchain platform Ethereum has involved millions of accounts due to its strong potential for providing numerous services based on smart contracts. These massive accounts can be divided into diverse categories, such as miners, tokens, and exchanges, which is termed as account diversity in this paper. The benefit of investigating diversity are multi-fold, including understanding the Ethereum ecosystem deeper and opening the possibility of tracking certain abnormal activities. Unfortunately, the exploration of blockchain account diversity remains scarce. Even the most relevant studies, which focus on the deanonymization of the accounts on Bitcoin, can hardly be applied on Ethereum since their underlying protocols and user idioms are different. To this end, we present the first attempt to demystify the account diversity on Ethereum. The key observation is that different accounts exhibit diverse behavior patterns, leading us to propose the heuristics for classification as the premise. We then raise the coverage rate of classification by the statistical learning model Maximum Likelihood Estimation (MLE). We collect real-world data through extensive efforts to evaluate our proposed method and show its effectiveness. Furthermore, we make an in-depth analysis of the dynamic evolution of the Ethereum ecosystem and uncover the abnormal arbitrage actions. As for the former, we validate two sweeping statements reliably: (1) standalone miners are gradually replaced by the mining pools and cooperative miners; (2) transactions related to the mining pool and exchanges take up a large share of the total transactions. The latter analysis shows that there are a large number of arbitrage transactions transferring the coins from one exchange to another to make a price difference.
Image forgery detection remains a challenging problem. For the most common copy-move forgery detection, the robustness and accuracy of existing methods can still be further improved. To the best of our knowledge, we are the first to propose an image copy-move forgery passive detection method by combining the improved pulse coupled neural network (PCNN) and the self-selected sub-images. Our method has the following steps: First, contour detection is performed on the input color image, and bounding boxes are drawn to frame the contours to form suspected forgery sub-images. Second, by improving PCNN to perform feature extraction of sub-images, the feature invariance of rotation, scaling, noise adding, and so on can be achieved. Finally, the dual feature matching is used to match the features and locate the forgery regions. What's more, the self-selected sub-images can quickly obtain suspected forgery sub-images and lessen the workload of feature extraction, and the improved PCNN can extract image features with high robustness. Through experiments on the standard image forgery datasets CoMoFoD and CASIA, it is effectively verified that the robustness score and accuracy of proposed method are much higher than the current best method, which is a more efficient image copy-move forgery passive detection method.
Nighttime image dehazing aims to remove the effect of haze on the images captured in nighttime, which however, raises new challenges such as severe color distortion, more complex lighting conditions, and lower contrast. Instead of estimating the transmission map and atmospheric light that are difficult to be accurately acquired in nighttime, we propose a nighttime image dehazing method composed of a color cast removal and a dual path multi-scale fusion algorithm. We first propose a human visual system (HVS) inspired color correction model, which is effective for removing the color deviation on nighttime hazy images. Then, we propose to use dual path strategy that includes an underexposure and a contrast enhancement path for multi-scale fusion, where the weight maps are achieved by selecting appropriate exposed areas under Gaussian pyramids. Extensive experiments demonstrate that the visual effect of the hazy nighttime images in real-world datasets can be significantly improved by our method regarding contrast, color fidelity, and visibility. In addition, our method outperforms the state-of-the-art methods qualitatively and quantitatively.
Despite Kerckhoff's principle, there are secret ciphers with unknown components for diplomatic or military usages. The side-channel analysis of reverse engineering (SCARE) is developed for analyzing secret ciphers. Considering the side-channel leakage, SCARE attacks enable the recovery of some secret parts of a cryptosystem, e.g., the substitution box table. However, based on idealized leakage assumption, most of these attacks have a few limitations on prior knowledge or implementations. In this paper, we focus on AES-like block ciphers with a secret S-box and demonstrate an attack which recovers both the secret key and the secret S-box. On the one hand, the key is recovered under profiled circumstance by leakage analysis and collision attack. On the other hand, the SCARE attack is based on mathematical analysis. It relies on Hamming weight of MixColumns intermediate results in the first round, which can restore the secret S-box. Experiments are performed on real power traces from a software implementation of AES-like block cipher. Moreover, we evaluate the soundness and efficiency of our method by simulations and compare with previous approaches. Our method has more advantages in intermediate results location and the required number of traces. For simulated traces with gaussian noise, our method requires 100000 traces to fully restore the secret S-box, while the previous method requires nearly 300000 traces to restore S-box.
Because of its closeness to users, fog computing responds faster than cloud computing. Thus, it has been deployed to various applications, such as healthcare system. Recently, to ensure the secure communication of the fog-based healthcare system, Jia et al. proposed an authenticated key agreement scheme. Moreover, in view of the high computation cost existing in Jia et al.’s scheme, Ma et al. presented an efficient one using elliptic curve cryptography. In this paper, we observe that both the two schemes may potentially risk ephemeral key compromise attacks and need improving. Therefore, to overcome this potential risk, we propose a new authenticated scheme based on Jia et al.’s scheme using elliptic curve computational Diffie-Hellman hypothesis and hash functions. Additionally, we provide provable security under the adopted adversarial model and ProVerif simulation, and also analyze the performance in terms of computation and communication costs by comparisons. The analysis results show that the improved scheme resists the common attacks, reduces computation overhead, and has a certain significance.
Currently, security-critical server programs are well protected by various defense techniques, such as Address Space Layout Randomization(ASLR), eXecute Only Memory(XOM), and Data Execution Prevention(DEP), against modern code-reuse attacks like Return-oriented Programming(ROP) attacks. Moreover, in these victim programs, most syscall instructions lack the following ret instructions, which prevents attacks to stitch multiple system calls to implement advanced behaviors like launching a remote shell. Lacking this kind of gadget greatly constrains the capability of code-reuse attacks. This paper proposes a novel code-reuse attack method called Signal Enhanced Blind Return Oriented Programming(SeBROP) to address these challenges. Our SeBROP can initiate a successful exploit to server-side programs using only a stack overflow vulnerability. By leveraging a side-channel that exists in the victim program, we show how to find a variety of gadgets blindly without any pre-knowledges or reading/disassembling the code segment. Then, we propose a technique that exploits the current vulnerable signal checking mechanism to realize the execution flow control even when ret instructions are absent. Our technique can stitch a number of system calls without returns, which is more superior to conventional ROP attacks. Finally, the SeBROP attack precisely identifies many useful gadgets to constitute a Turing-complete set. SeBROP attack can defeat almost all state-of-the-art defense techniques. The SeBROP attack is compatible with both modern 64-bit and 32-bit systems. To validate its effectiveness, We craft three exploits of the SeBROP attack for three real-world applications, i.e., 32-bit Apache 1.3.49, 32-bit ProFTPD 1.3.0, and 64-bit Nginx 1.4.0. Experimental results demonstrate that the SeBROP attack can successfully spawn a remote shell on Nginx, ProFTPD, and Apache with less than 8500/4300/2100 requests, respectively.
The packing of genomic DNA from double helix into highly-order hierarchical assemblies has a great impact on chromosome flexibility, dynamics and functions. The open and accessible regions of chromosomes are primary binding positions for regulatory elements and are crucial to nuclear processes and biological functions. Motivated by the success of flexibility-rigidity index (FRI) in biomolecular flexibility analysis and drug design, we propose an FRI-based model for quantitatively characterizing chromosome flexibility. Based on Hi-C data, a flexibility index for each locus can be evaluated. Physically, flexibility is tightly related to packing density. Highly compacted regions are usually more rigid, while loosely packed regions are more flexible. Indeed, a strong correlation is found between our flexibility index and DNase and ATAC values, which are measurements for chromosome accessibility. In addition, the genome regions with higher chromosome flexibility have a higher chance to be bound by transcription factors. Recently, the Gaussian network model (GNM) is applied to analyze the chromosome accessibility and a mobility profile has been proposed to characterize chromosome flexibility. Compared with GNM, our FRI is slightly more accurate (1% to 2% increase) and significantly more efficient in both computational time and costs. For a 5Kb resolution Hi-C data, the flexibility evaluation process only takes FRI a few minutes on a single-core processor. In contrast, GNM requires 1.5 hours on 10 CPUs. Moreover, interchromosome interactions can be easily combined into the flexibility evaluation, thus further enhancing the accuracy of our FRI. In contrast, the consideration of interchromosome information into GNM will significantly increase the size of its Laplacian (or Kirchhoff) matrix, thus becoming computationally extremely challenging for the current GNM. The software and supplementary document are available at