RESEARCH ARTICLE

Pusher: an augmented fuzzer based on the connection between input and comparison operand

  • Bin ZHANG ,
  • Jiaxi YE ,
  • Ruilin LI ,
  • Chao FENG ,
  • Yunfei SU ,
  • Chaojing TANG
Expand
  • College of Electronic Science, National University of Defense Technology, Changsha 410072, China

Received date: 21 Feb 2020

Accepted date: 06 Jan 2021

Published date: 15 Aug 2022

Copyright

2022 Higher Education Press

Abstract

Coverage based fuzzing is a widespread vulnerability detection technique, and it has exposed many bugs in many real-world programs. However, its attention is to eliminate the testing on the repeated paths, yet it still employs random mutation to generate inputs, which is blind to penetrate complex comparisons in the program. As a result, the testing coverage is limited. Despite some solution proposals are presented, this problem is still partially solved. This paper argues that random mutation is mainly limited by two challenges, the sizable search spaceand the lack of a useful feedback to direct the search. Then we present an augmented fuzzing technique by addressing these two challenges. First of all, we point out a black relationship between input contents and comparison operands, which is dubbed connection. Second, we present a novel method to collect the comparison operands during execution, which is leveraged to infer the connections. Based on the connections, the fuzzer can learn about which input byte affects on which comparison instruction to establish a smaller search space. Third, the connection provides a useful feedback to direct the search. We resort to a modern meta-heuristic algorithm to satisfy this searching requirement. We developed a prototype Pusher and evaluated its performance on several benchmarks and four real-world programs. The experimental results demonstrate that Pusher works better than some other state-of-the-art fuzzers on bug detection, and can achieve a higher testing coverage. Moreover, we take a detailed statistic about the execution overhead in Pusher, and the results indicate that the execution overhead introduced by our approach is within an acceptable scope.

Cite this article

Bin ZHANG , Jiaxi YE , Ruilin LI , Chao FENG , Yunfei SU , Chaojing TANG . Pusher: an augmented fuzzer based on the connection between input and comparison operand[J]. Frontiers of Computer Science, 2022 , 16(4) : 164206 . DOI: 10.1007/s11704-021-0075-8

1 Introduction

Fuzzing has been an extensively used technology for vulnerability discovery since proposed [1, 2], and is proved to be more scalable on the large real-world software than symbolic execution and taint anlaysis. Its fundamental idea is to run randomly or strategically generated inputs against the target program and monitor abnormal behaviors, like crashes.
Coverage based fuzzing is a remarkable innovation in the fuzzing community [3]. It constructs a testing loop, in which the new inputs are continuously generated by mutating the seeds from a seed pool. Then, it conducts the coverage evaluation on each input through a lightweight instrumentation, and only the ones hitting new coverage would be added to the seed pool. By this testing loop, the engine is skilled at eliminating the inputs on the duplicated paths and can discover bugs faster than traditional blind fuzzing. The state-of-the-art coverage based fuzzers, such as AFL (proposed by Michał Zalewski), LibFuzzer in LLVM, and Honggfuzz (from Google), have found lots of bugs in the real-world programs [4].
However, since the fuzzers have no preknowledge of target program, it has to produce the testing inputs by random mutation. As a result, these fuzzers are weak at penetrating into deeper code areas that are guarded by the complex constraints. This reduces the effciency of bug discovery capability for coveraged-based fuzzers.
To address this challenge, symbolic execution is introduced to bypass such complex constraints. It solves the path constraints via SMT solvers (such as Z3 [5]) to generate the correspoding inputs that exercise the new paths [68]. This core idea behind such hybrid testing method is promising yet limited currently. This is mainly because there are still some challenges that symbolic execution needs to be solved, such as state explosion [9, 10].
Taint analysis [1113] is also leveraged to handle the limitation in random mutation. These techniques recognize the input contents which can affect the program execution, and practice various strategies on these input contents to achieve better path exploration performance. However, deploying taint analysis still requires more conditions than applying fuzzing. For example, taint analysis needs to identify and understand the instructions, and DataFlowSanitizer requires other explanations for the external libraries [14]. In a nutshell, their practicability on large real-world programs is not easy as well.
To avoid the disadvantages introduced by symbolic execution and taint analysis, some lightweight techniques are presented. These techniques address the blindness of the random mutation, such as program-state aware based techniques [15, 16], statistical analysis based techniques [17, 18], etc. For example, a Markov Chain based power schedule is proposed [17] which can gravitate the testing energy toward the inputs executing on the low-frequency paths. A targeted strategy is put forward [18] to test more on the rare branches. A program-state based approach is leveraged to apply adaptive mutation [15]. However, the limitation in random mutation is only partially relieved by these lightweight techniques, and it is worth further researching.
Among the related technology investigation, improving fuzzing performance via the lightweight search-based input generation is promising [19]. The core idea is to formulate the testing process as an optimization problem, and leverage searching algorithms to tackle it [20]. For example, EvoSuite [21] adopts a hybrid searching approach to steer whole test suites toward satisfying more comparisons.
Challenges Two significant challenges need to be addressed to make such search-based fuzzing technique more efficient. The first challenge is that the searching for the target inputs is usually typified by a sizable search space which may bring in serious overhead and make the testing ineffective. The second challenge is that most searching algorithms claim a heavy dependence on the feedback, which is useful to direct the running.
This paper presents a lightweight search-based approach to augment fuzzing via addressing the two challenges mentioned above. First of all, we know that the comparison operands would be assigned concrete values when the program executes on an input, i.e., there is a black relationship between the input contents and the comparison operands, we dubbed this relationship a connection in the program. And our approach is principally based on these connections. Second, we design a novel method to obtain the connections during execution. Then a connection inference is presented to recognize which input bytes can affect which comparison operands. It finally locates a smaller search space, addressing the challenge of the sizable search space. Third, these connections offer a feedback to indicate the difference between the current input to the expected input (satisfying the comparison) to take some off-the-shelf algorithm to search for the inputs.
During implementation, the proposed techniques are integrated with the original coverage-based fuzzing, and it performs as a search-based module to help the engine achieve a higher coverage. Many modern searching algorithms can be used in our approach, yet this is not our emphasis in this paper. Therefore, we just select the vanilla genetic algorithm [22] as our searching engine.
We developed a prototype Pusher based on AFL, and evaluated its performance on LAVA-M [23], CGC binaries, and four real-world programs. On the LAVA-M database, Pusher beats the other nine state-of-the-art fuzzers on bug detection. On the CGC binaries, Pusher finds more bugs and achieves a higher coverage than other fuzzers. On the four real-world programs, Pusher behaves the best accroding to the testing coverage. Besides, an in-depth analysis of the execution overhead demonstrates that our approach brings less overhead and is scalable for testing real world programs.
The rest of this paper is organized as follows. In Section 2, a motivation example and the framework of our approach are described. In Section 3, the techniques are depicted in detail. Section 4 reports the experimental evaluation results. Section 5 discusses the related work and Section 6 concludes the whole paper.

2 Overview

2.1 Motivation example

We take a motivation example to describe how our approach regulates the input contents and solves the comparisons.
Program Figure 1 is an example containing two representative complex path constraints. It reads contents from the input FILE *fp , and assigns values to the magic and result variables. Then, the program performs two checks on the two variables:
Fig.1 An example containing two representative path constraints at lines 8 and 19

Full size|PPT slide

● C1: A magic number check at lines 3–9. Assuming that the value of magic is directly from the 0th, 1st, 2nd, and 3rd bytes of the input.
● C2: A simple format check at lines 11–20. The value of checkresult is a result of the check_format function, assuming that it is calculated according to the 8th, 12th, 16th, and 20th bytes of the input fp.
An error will be raised for input that satisfies any of the two checks.
Coverage-based fuzzing These two checks highlight the challenge in the random mutation used in the coverage based fuzzers. Because the random mutation is unware of the program, it has to randomly select input contents for random mutation. As a result, it is hard to satisfy the C1 or C2 check. For example, to handle the C1 check, the random mutation should select the 0th, 1st, 2nd, 3rd bytes for mutation, and the obtained input contents need to be 0xdeadbeef by chance.
Pusher Our approach generates the inputs that can satisfy C1 and C2 checks via the following steps.
● The program executes on an input s, the magic and result variables are assigned concrete values. We use Vms for the magic and Vcs for the result. Therefore, the combination of (s,Vms,Vcs) is a connection instance between the input contents and the comparison operands.
● The engine produces a new input s0 by mutating on the 0th byte of the input s. The program executes on s0, and it obtains a new connection instance, (s0,Vms0,Vcs0). By comparing inference that whether the 0th byte affects on any corresponding comparison operands or not in this case, it would find that Vms0 is different from Vms, i.e., the 0th byte can affect the magic variable.
● After traversing all the bytes of input s, the engine learns the knowledge that the magic variable is affected by the 0th, 1st, 2nd, 3rd bytes, and the checkresult variable is based on the 8th, 12th, 16th, 20th bytes (if the length of the input is larger than 20 bytes). These recognized bytes locate a reduced space for the searching. It addresses the challenge of searching in a sizable space.
● Finally, it adjusts the recognized bytes to making Vm be 0xdeadbeef, and Vc be result and produces the corresponing inputs. In this process, the distance between Vm and 0xdeadbeef (or Vc and result) acts as a feedback to direct the searching.

2.2 The framework of our approach

This paper designs a new fuzzer, which integrates the original coverage-based testing with our connection-based searching. Figure 2 depicts the framework of our approach.
Fig.2 The workflow of our approach

Full size|PPT slide

We instruments target source code by adding comparison operand collection code and also generates static information for search-based testing component during compiling. Then we choose coverage-based fuzzing as one component in our fuzzer Pusher to quickly creates a seed pool for the searching component.
In the connection-based searching, the engine runs the inputs from the seed pool to collect the comparison operands, which compose the connection instances between the input contents and comparison operands. Then, an inference component is leveraged to locate the bytes that can affect the comparison operands. Afterward, a heuristic searching engine is introduced to solve the unsatisfied comparisons and generate the target inputs by mutating the recognized bytes. This process is iterated until all the inputs in the seed pool are conducted once.

3 Technology

In this section, the applied techniques are described in detail. In Section 3.1, a novel instrumentation method is presented to collect the connection instances between the input contents and the comparison operands. Then, Section 3.2 describes the process of how to make the connection inference to identify the input bytes that affect the comparison operands. In Section 3.3, we take the genetic algorithm [22] to search for the target inputs.

3.1 Connection instance collection

The cmp and test instructions in the binary embody the comparisons in the source code. Without loss of generality, we take cmp instruction to typify the comparison in this paper.
To discriminate the cmp instructions, we assign each of them a unique label as the identity, which is calculated by three related basic blocks: the one where target cmp instruction locates and its two successive blocks according to the static control flow. Previous work has introduced a method to allocate unique IDs for each block, like idA is assigned for the block A. Therefore, the cmp instruction labels can be determined based on these IDs.
Formally, the label calculation is as Eq. (1) shows. μ is a symbol for the label of a cmp instruction; idc is denoted as the ID of the basic block where the cmp instruction is in, idt and idf are the IDs of the two successive basic blocks.
μ=idc(idt1)(idf2).
Moreover, the cmp instructions are reconstructed to ensure a uniform shape. In brief, each cmp instruction is characterized in three forms, shown in Table 1. imm means an immediate value, reg is referred to a value from registers, and mem is a value from the memory. For the first form, there are two immediate values in a cmp instruction, it is ruled out for its seldom usage. The second form is the expected shape with a dynamic operand and a fixed operand, the fixed operand is chosen as the target value (required in the searching). For the third form, both the two operands are dynamic, it would be reconstructed into the second form, creating a dynamic operand and a fixed operand. Afterwards, all the cmp instructions hold the same form, e.g., “ cmp op,imm”.
Tab.1 The reconstruction of the cmp instructions
Form Example Reconstructed Target value
cmp imm, imm cmp 5, 7 N/A N/A
cmp reg/mem, imm cmp eax, 7 N/A 7
cmp reg/mem, reg/mem cmp eax, ebx cmp eax-ebx, 0 0
During execution, every input comes across some cmp instructions, as well as the concrete comparison operands. Then the input and the concrete comparison operands jointly make up a connection instance. Formally, given an input s, COPMap(s) collects all the concrete comparison operands, depicted as Eq. (2). Every combination of (s,COPMap(s)) is a connection instance.
COPMap(s)={(μ,Opμ)μLabelset(s)},
where Labelset(s) denotes a set of labels of the cmp instruction emerged in the execution trace; μ is the label for a cmp instruction, and Opμ is its dynamic operand value, i.e., the op in “ cmp op,imm”.
We design a novel method to collect COPMap (running by instrumentation) as Algorithm 1 shows. First, a memory block is initialized (line 2); during execution, the input would encounter some cmp instructions, of which the labels are collected (lines 3 and 4). Then, for each cmp instruction with label μ, its operand value Opμ is obtained (line 6) and stored into the memory unit indexed by μ (line 7). At last, the memory block preservers all the specific operand values of the emerged cmp instructions, i.e., COPMap(s).

3.2 Connection inference

Since not all input contents can affect the comparison operands (such as the example in Fig. 1, only the 0th, 1st, 2nd, 3rd bytes are used to get magic variable), the only mutation on such related bytes is promising to generate the inputs that can overcome the complex path constraints.
In our study, the connection between the input contents and the comparison operands means that, once the input contents change, the corresponding comparison operands change as a result. Therefore, we present sequential analyzing method to perform the connection inference, pointing out which input content affects which comparison operand.
As a common choice, the sequential analyzing is done byte-wise. Given an input s and its COPMap(s), a new input s0 is produced by mutation on the 0th byte, and COPMap(s0) is obtained after executing. It decides whether the mutated byte belongs to the connection by comparing COPMap(s) and COPMap(s0). For example, if COPMap(s) and COPMap(s0) possess different values at a same comparison operad, then the 0th byte is involved to this comparison. After traversing all the bytes, the connection inference is made on this input.
Formally, we take CI(s) to represent the connection inference results for the input s. It is a key-value storage shown in Eq. (3).
CI(s)={(i,labelseti)0i<len(s)},
where len(s) is the input length, i is the key role, it denotes the index of each byte, labelseti is the value role, it is a set including all the labels of the cmp instructions connected to the No.i byte. The quantity of labels in labelseti can be 0, 1 or even larger.
Figure 3 shows an example of the partial inference process. The left part is the memory block of COPMap(s), and the right is COPMap(s0) ( s0 is produced by mutation on the 0th byte of s). Their values at 5th and 14th memory units are different (marked as gray), it means the 0th byte can affect the operands at the cmp instructions with labels of 5 and 14. So that, it yields the connection as Eq. (4) shows.
Fig.4 An example to show the process of the connection inference

Full size|PPT slide

CI(s)={(0,(5,14))}.
The total connection inference can be made by traversing all the bytes of the input. Intuitively, an accurate way is assigning 0x00 to 0xFF for each byte to produce new inputs. However, it inevitably results in too much overhead. To make a trade-off between the effectiveness and the accuracy, each byte is only assigned some random values (like 10 values).
Algorithm 2 demonstrates the process of connection inference. It reads an input s, and finally outputs the result CI(s). Given an input, the engine traverses all its bytes (lines 2–11), and every byte is checked for N times (lines 5–10); every time a new input s is produced by mutation on one byte of s, its COPMap(s) is compared to COPMap(s), then it makes a decision about whether this byte belongs to the connection (lines 6–8); the CI(s) is updated after checking each byte (line 9).

3.3 Connection based searching

Based on the connection inference results, many modern meta-heuristic algorithms can be used to search for the target inputs and penetrate the complex comparisons. However, our primary goal in this paper is not to optimize the meta-heuristic algorithms themselves, but rather to address the challenges in random mutation. So that we just adopt the genetic algorithm (GA for short) as a choice to verify the availability of our techniques. In a nutshell, this section illustrates how we handle the collected information and resort to GA engine to produce the target inputs.
First of all, for an input s, its connection inference result CI(s) is obtained. We transforme CI(s) into another keyvalue storage, in which the cmp instruction label acts as the key role, and a set of the input bytes, which can affect the corresponding cmp instruction operands, acts as the value role. This storage is named TCI(s), and formally shown in Eq. (5).
TCI(s)={(μ,Possetμ)μLabelset(P)},
where P is the target program, Labelset(P) is a set of all the cmp instruction labels in the program, µ is the cmp instruction label, and Possetμ is the set of all the input bytes connected to the µ cmp instruction.
For example, the CI(s) in Eq. (4) is transformed into Eq. (6). Specifically, the cmp instruction with label 5 is connected to the 0th input byte, and the cmp instruction with label 14 is connected to the 0th input byte as well.
TCI(s)={(5,0),(14,0)}.
Furthermore, there is a target value for each cmp instruction (see Table 1). When compiling, all comparison instructions are instrumented in LLVM IR level to output the corresponding target values. For clear description, we utilize Vtμ to denote the target value of the cmp instruction with μ label. Such as, the Vtμ for the comparison at line 8 in Fig. 1 is 0xdeadbeef.
Based on TCI(s), Algorithm 3 is designed to apply a GA engine to satisfy our searching requirements. It reads an input s, and outputs several target inputs inputset.
For a target program P, given an input s, its TCI(s) is obtained by the execution and conversion (lines 2, 3); for each cmp instruction contained in TCI(s), an individual GA based process is done to search for target inputs (lines 4-10); the set Possetμ of the connected input bytes and the target value Vtμ for the μcmp instruction are determined (lines 5, 6); then, according to the GA model, the Possetμ is considered as the GAGenes, and Vtμ is considered as the GATarget (lines 7, 8), both of them are delivered to the GA engine via the parameters; finally, the GA engine tries its best to produce the target inputs (line 9).
More specially, in GA engine, the inputs bytes in Possetμ are disassembled into bits, each of which acts as a gene in the GA model. And in the fitness function, we take the Euclidean distance between the two comparison operands, op and imm, as a feedback to decide the surviving possibility of each input in the evolution [22].

4 Implementation and evaluation

Based on the proposed methods, we develop a prototype Pusher which is built on top of AFL, LLVM instrumentation [24] and GAlib which Matthew proposes. AFL performs the state-of-the-art coverage based fuzzing, LLVM instrumentation is used to record the comparison operands, and GAlib provides the genetic algorithm based searching. Besides, we add some extra codes to conduct the connection inference. For the quantity of Labelset(s) for input s, we adopt the size of AFL trace map (i.e., 65536) to cover the requirement for most real world programs.

4.1 Evaluation setup

We investigate the evaluation experiments of other fuzzers according to their public papers, the results are shown in Table 2. Their experiments suggest a representative choice to evaluate LAVA-M [23], CGC, and some real-world programs. Therefore, we also perform the evaluation experiments on these three benchmark databases.
Tab.2 The evaluation experiments in other fuzzers
Fuzzer Time Benchmark Compared Fuzzer
Steelix 2017 RWP AFL-dyninst
CGC AFL-dyninst
LAVA-M FUZZER, SES, VUzzer, AFL-lafintel
VUzzer 2017 RWP AFL
CGC AFLPIN
LAVA-M FUZZER, SES
Angora 2018 RWP AFL
CGC
LAVA-M FUZZER, SES, AFL, VUzzer, Steelix
T-fuzz 2018 RWP AFL
CGC AFL, Driller
LAVA-M FUZZER, SES, VUzzer, Steelix
Profuzzer 2019 RWP AFL, AFLFast, Driller, Vuzzer, QSYM
CGC
LAVA-M AFL, AFLFast, Driller, VUzzer, Angora, QSYM

Note: RWP is short for real-world program

Experimental infrastructure All experiments were conducted on a server equipped with 2 Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz (72 threads in total), running Linux Ubuntu 18.04.3 LTS AMD64. Besides, the fuzzers were launched in single mode, to avoid the performance fluctuation caused by the different parallel modes in various fuzzers [16, 25].
Research questions The following experiments were designed to answer three research questions:
RQ1: How well is the bug detection of Pusher?
RQ2: How well is the testing coverage of Pusher?
RQ3: How well is the execution overhead of Pusher?

4.2 Bug detection on LAVA (RQ1)

In 2016, to address the shortage of ground-truth corpora for vulnerability discovery evaluation, a new technique is developed in paper [23] to produce such corpora, LAVA-1, and LAVA-M. In LAVA-1, there are 69 instances of the file program, each of which is injected with a unique bug. Whereas in LAVA-M, there are four programs, each of which is injected with more than one bug. In terms of the bug discovery evaluation, LAVA-M is an appropriate benchmark database in a long run.
Both the authors and users of LAVA-M have arranged five hours of testing for bug detection. Keeping up with this standard, we also take Pusher on LAVA-M for five hours and compare its results with other fuzzers.
In particular, each bug in LAVA-M owns a unique ID. That is realized in lava_get function as Fig. 4 shows. Once any of the two comparisons at line 2 is satisfied, the input would trigger the crash, and the program outputs a bug_num as the ID of the bug. This mechanism can be leveraged for crash distinguishing, and the number of unique bugs can be counted easily.
Fig.7 Each bug owns a unique ID

Full size|PPT slide

After five hours of testing, the number of detected bugs by Pusher is shown in the last column in Table 3. Besides, the second column shows the number of inserted bugs according to the provided document (actually, this number is incomplete, some inserted bug are not mentioned [14]). The other columns summary the evaluation results of other fuzzers.
Tab.3 The number of detected bugs on LAVA-M database
Program Listed bugs FUZZER SES AFL AFL-lafintel Angora Pusher
base64 44 7 9 9 28 44+4 44+4
md5sum 57 2 0 0 0 57 57+4
uniq 28 7 0 0 24 28+1 28+1
who 2136 0 18 1 2 1443+98 1847+205
The results indicate that Pusher found 44 listed bugs and 4 unlisted bugs in base64, 57+4 bugs in md5sum, 28+1 bugs in uniq. and 1847+205 bugs in who. The unlisted bug IDs are acquired by running the crashes and collecting from their outputs. Overall, Pusher is the best among the 10 compared fuzzers on the bug discovery.
We invastigated why some bugs cannot be detected in the who program. The reason we found is that the inserted codes are too concentrated. For example, all the codes from line 743 to line 1353 (610 lines) in readutmp.c file are inserted by LAVA techniques, more importantly, the bugs among these codes are connected to the same four input bytes, such nested conditional branches make it difficult to find all the bugs. However, this scenario is not universal in the realworld program, so missing some bugs in the who program is acceptable.
Furthermore, Pusher is built on top of AFL by introducing a search-based method to handle the complex path constraints. The results show that Pusher outperforms AFL on the bug discovery, it is proof that our proposed approach is effective in solving more path constraints.

4.3 Bug detection on CGC (RQ1)

In 2016, the Defense Advanced Research Projects Agency (DARPA) held a Cyber Grand Challenge (CGC), and finally released a benchmark database for vulnerability discovery evaluation, which has been wildly used, such as Table 2 shows.
However, the original CGC binaries can only run under the DARPA Experimental Cyber Research Evaluation Environment (DECREE), which is a customized system. Due to the instrumentation dependency, Pusher cannot run on the DECREE system so far. Thanks to the work of Trail of Bits, a migration version is created in the Linux system. Therefore, we take Pusher on these migrated binaries for evaluation.
Though the sizes of these binaries are small, the vulnerabilities in which are quite diverse. We have tested these binaries and get some observations.
● I. Because of the migration problems [15] and the infinite loops of reading, some binaries cannot be tested by fuzzers.
● II. In some binaries, the vulnerabilities cannot be detected, whereas the testing coverages are even high. Because covering the related codes is not a guarantee to find these bugs, other detection techniques [26] are required.
● III. Some binaries contain some special paths guarded by complex condition predicates, like magic number comparison. Once these comparisons are penetrated, the testing coverage rises a lot, and the possibility of finding bugs also increases.
Because Pusher is designed to handle the comparisons, i.e., it is designed to address the binaries with No.III observations. So, we take Pusher to test some selected binaries and collected the branch coverage and number of crashes. The results are shown in Table 4 (We failed to build the CGC binaries with Angora).
Tab.4 The experimental results on CGC binaries
Program Size AFL AFLFast AFL-lafintel FairFuzz Pusher
CGC_File_System 88 KB 7.2% 0 7.2% 0 59.5% 0 7.2% 0 75.1% 10
CGC_Hangman_Game 5.2 MB 12% 0 12% 0 12% 0 12% 0 94% 1
CGC_Image_Parser 145 KB 4.6% 0 4.6% 0 4.6% 0 4.6% 0 60.6% 12
CGC_Video_Format_Parser_and_Viewer 82 KB 12% 0 12% 0 53.1% 0 12% 0 53.1% 0
CNMP 56 KB 27.5% 0 27.5% 0 42.5% 26 47.5% 3 80% 30
FASTLANE 52 KB 25.9% 0 17.5% 0 17.5% 0 27.7% 0 51.2% 19
Gridder 164 KB 84.2% 10 70.3% 0 82.9% 23 84.2% 28 84.2% 56
Griswold 95 KB 3.8% 0 3.8% 0 45% 13 3.8% 0 55.3% 27
Barcoder 172 KB 59% 0 59% 0 57.2% 0 59% 0 68% 25
The results indicate that on all 9 binaries, Pusher achieves a higher branch coverage than other fuzzers. For example, it achieves 94% branch coverage in CGC_Hangman_Game which is 7.8× than other tools. This is because there is a simple check strcmp (inbuf, “HANGEMHIGH!”) in the begining of main function, which prevents the other tools from testing further. However, by leveraging GA searching method, Pusher can quickly pass through this check to uncover more code areas. Besides, Pusher can find more bugs on these binaries (a total number of 180 crashes in 7 binaries).
Take Barcoder as an example, when it tries to create a barcode from the input bitmap file, it firstly invokes function validate_bmp_ headers to check if the input file as a correct header. Only the input file that passes this check can be further processed. The check function is shown in Fig. 5. To pass this function, the input file needs to satisfy many validation checks such as magic number, file size, image size, etc. Fuzzers like AFL, AFL-lafintel and and Fairfuzz has no knowledge about the input-branch relationship, thus they cannot bypass the complex checks. Whereas, by extracting the target values when compiling and leveraging heuristic search method to solve them, Pusher can quickly generate a valid bitmap file to satisfy all the checks in this function.
Fig.8 The validate_bmp_headers function implementation in the Barcoder program

Full size|PPT slide

4.4 Bug detection on real-world programs (RQ1)

We selected several real-world programs to evaluate the ability of unknown bug discovery of our system. The input types of this dataset cover ELF binary, image, and packet capture. To demonstrate the efficiency of our system, we also compared our results with AFL, AFL-lafintel, FairFuzz, and Angora. All the experiments are lasted for 72 hours to ensure soundness. The initial seeds are equal and from fuzz-data [27].
The testing result is shown in Table 5. We can see that Pusher outperforms all the other fuzzers in these four real-world programs from the tabe. Especially in jhead and tcpdump, Pusher can successfully uncover bugs while other fuzzers like AFL, AFL-lafintel, and Fairfuzz cannot. We will illustrate the reason for the performance improvment by checking the testing coverage in Section 4.5.
Tab.5 The number of detected bugs on real-world programs
Program AFL AFL-lafintel FairFuzz Angora Pusher
jhead 0 0 0 16 114
nm 77 69 27 3 83
objdump 5 22 29 34 51
tcpdump 0 0 0 0 3
total 82 91 56 53 251
In conclusion, Pusher works well on the bug detection, especially the bugs guarded by the complex condition constraints.

4.5 Testing coverage on real-world program (RQ2)

Since it is hard to predict where a bug may appear in the software, maximizing the testing coverage is usually chosen as a proxy of increasing the probability of finding bugs. Therefore, we make an deep investigation of the testing coverage (i.e., branch coverage, line coverage, and function coverage) to illustate the bug detection performance improvment in Table 5. The results are shown in Table 6.
Tab.6 The final testing coverage after 72 hours of testing
Program AFL AFLFast AFL-lafintel Fairfuzz Angora Pusher
BC LC FC BC LC FC BC LC FC BC LC FC BC LC FC BC LC FC
jhead 213 313 18 213 313 18 213 313 18 213 313 18 479 687 26 568 714 26
0% 0% 0% 0% 0% 0% 0% 0% 0% +125% +119% +44% +167% +128% +44%
nm 4397 7097 391 4342 6993 390 4046 6856 405 4468 7230 392 4002 6679 404 4710 7851 433
−1% −1% 0% −8% −3% +4% +2% +2% 0% −9% −6% +3% +7% +11% +11%
objdump 2602 4375 273 2378 3949 258 2694 4990 309 2595 4426 278 3337 6329 358 3737 7034 372
−9% −10% −5% +4% +14% +13% 0% +1% +2% +28% +45% +31% +44% +61% +36%
tcpdump 15474 23452 706 13501 20645 682 12791 20475 698 16264 24530 753 2482 4834 269 18754 27861 846
−13% −12% −3% −17% −13% −1% +5% +5% +7% −84% −81% −62% +21% +19% +20

Note: BC is the branch coverage number. LC is the line coverage number. FC is the function coverage number

After the execution, Table 6 shows the final testing coverage results, and Fig. 6 shows the increasing process of the line coverage along with time. From an overall perspective, it is concluded that Pusher can achieve the highest testing coverage among the six compared fuzzers on the four programs. Besides, although Pusher is developed on the ground of AFL, it can achieve an 11% to 128% increase on the line coverage than the original AFL.
Fig.9 The increasing process of branch coverage number in the 72 hours. (a) jhead; (b) nm; (c) objdump; (d) tcpdump

Full size|PPT slide

After an adequate test (72 hours), Pusher behaves better than AFLFast and AFL-lafintel. That is because both AFLFast and AFL-lafintel still leverage the random mutation to produce new inputs, which is powerless to penetrate complex paths. Whereas AFLFast introduces a power-based schedule, and AFL-lafintel exploits the program-transformation to help the testing, their improvements would become weak after a long time testing. On the contrary, Pusher is aware of the connection between the inputs and the comparisons, so that it successfully achieves a higher testing coverage.
Fairfuzz and Angora perform better than AFL in all these four programs except tcpdump, that is because they present related techniques to augment the mutation capacity. In Fairfuzz, it leverages the statistical results of execution to direct mutation, and Angora adopts the DataFlowSanitizer to do so. In comparison, it can be seen that Pusher acts better than these two fuzzers. That is because the statistical results in Fairfuzz are not absolutely accurate in the individual case, and the external library description requirements in Angora can also bring inaccuracy. Especially in tcpdump, since the input function like fread is implemented in external library libpcap that we are not interested in, we patched tcpdump so that taint data can be introduced.
In contrast, Pusher is easily deployed to test many programs, and it can conduct the input searching based on an accurate connection between the inputs and the comparisons. Therefore, it can achieve a higher testing coverage.
Case study An interesting sample is the testing on jhead program. In that case, AFL, AFLFast, AFL-lafintel, and Fairfuzz reach their top coverage at the very beginning, no new coverage is discovered in the following 72 hours. On the contrary, Pusher and Angora are continually exercising new codes during the testing.
More specifically, this difference is due to a magic number check. Figure 7 shows the related code snippet, it reads some data and compares it to the “Exif” string, if satisfied, the process_EXIF() function would be invoked, otherwise, the program quits quickly.
Fig.10 The code snippet in the jhead program

Full size|PPT slide

If this comparison cannot be solved in the testing, the program would end directly, and the fuzzer cannot discover new codes. Because AFL, AFLFast, and AFL-lafintel have no idea about which input bytes are connected to this comparison, they have to conduct the searching in an enormous space, and fail at last. The strategy in Fairfuzz is invalid due to the inadequate statistical information, it also fails to penetrate this comparison.
Contrariwise, both Pusher and Angora are aware of which input bytes are connected to this comparison to direct the mutation to produce the expected inputs quickly. Even so, Pusher still achieves a higher coverage than Angora.
In conclusion, Pusher improves the testing coverage.

4.6 Execution overhead evaluation (RQ3)

Execution overhead is an essential metric for vulnerability detection tools, and it seriously influences their application on the large software, i.e., symbolic execution is limited partly because of its serious overhead. Therefore, we analyze the execution overhead of Pusher in this section.
In the fuzzers, the execution overhead mainly comes from two aspects: instrumentation and analysis. For example, the execution overhead in AFL consists of the coverage collection and the testing schedule, as Table 7 shows. Developed on AFL, Pusher additionally increases the overhead by the comparison operands collection, the connection inference and the GA based searching, which is also shown in Table 7.
Tab.7 The overhead in AFL and Pusher
Overhead AFL Pusher
Instrumentation coverage collection
comparison operands collection
Analysis testing schedule
connection inference
GA based searching
We utilize the total execution number after thorough testing as an indicator of the execution overhead of the fuzzers, i.e., in a fixed time, a fewer execution number denotes a more significant overhead. Because Pusher is based on AFL, we only compare Pusher with AFL and its variants (Angora is not taken into comparison) to avoid the affection of other unrelated factors.
The execution number results are shown in Fig. 8. Taken the execution number of AFL as the baseline, it can be seen that Pusher achieves a fewer number, it only executes 39% execution of that in AFL on jhead, 96% on nm, 80% on objdump, and 63% on tcpdump. Nevertheless, the execution number of Pusher is almost on the same order of magnitude as that in AFL. Moreover, the execution numbers of the AFL variant fuzzers are also on the same order of magnitude as that in AFL, and they are regarded as light techniques. Therefore, the overhead in Pusher is acceptable, and it belongs to the lightweight techniques.
Fig.11 The final execution numbers. (a) jhead; (b) nm; (c) objdump; (d) tcpdump

Full size|PPT slide

We also carry out in-depth statistics about the execution numbers by the coverage-based testing and the search-based testing (consisting of the connection inference and the GA based searching), in an attempt to understand the execution distribution between these two stages.
For clarity, the execution numbers are transformed into the normalized rate, as Fig. 9 shows. From the results, it can be seen that the execution distribution between the two testing stages is reasonable.
Fig.12 The execution distribution in Pusher between coverage-based testing and search-based testing

Full size|PPT slide

For example, on the jhead program, the search-based testing only occupies 3% of the total execution, because the amount of the inputs in SeedPool is small, the search-based component can quickly accomplish its work, and wait for new inputs generated by the coverage based testing. Since tcpdump contains lots of swith-case statements that affected by input file in source code (as shown in Fig. 10), Pusher spends more time on overcoming such cases by search-based testing which occupies 46% quantity of the total execution. Whereas, the introduced overhead is within an acceptable scope, and the testing coverage is improved as a result.
Fig.13 The code coverage of tcpdump for vanilla AFL (left part) and Pusher (right part)

Full size|PPT slide

Moreover, we further take a detailed statistic about the execution distribution between the connection inference and the GA based searching. The results in Table 8 indicate that the connection inference causes the primary execution overhead in the search-based testing. For example, the execution number of connection inference on the tcpdump binary is 81659k, whereas the number of GA based searching is just 1016k, which is almost negligible.
Tab.8 The execution distribution in Pusher between connection inference and GA searching
Program connection inference/# GA based searching/#
jhead 4748k 41k
nm 8352k 1720k
objdump 8842k 969k
tcpdump 81659k 1016k
As far as we know, the other connection inference techniques, like taint analysis, symbolic execution, hold on heavier burdens than our approach when applying to specific program structures such as symbolic loops/implicit data flow that commoly exist in real world software. Whereas our approach still belongs to the lightweight techniques and more adaptive for complex program structures. In other words, our method is insensitive to the scale of target program (i.e., lines of code), the overhead and efficiency only can be affected by the quantity and complexity of branch instructions within the target program.
In addition, the execution distribution is imbalanced between the connection inference and GA based searching, it provides an explanation that why our researching emphasis is on the connection inference, and just select one off-the-shelf modern search algorithm as a representative to handle the comparisons.
In conclusion, the execution overhead introduced by our approach in Pusher is within an acceptable scope.

5 Related Work

5.1 Input aware fuzzing

Vuzzer [12] proposed an application-aware evolutionary fuzzing strategy that does not require any prior knowledge of the application or input format. Instead, it leverages control and data flow features based on static and dynamic analysis to infer the critical bytes of comparison instructions to maximize coverage and explore deeper paths. However, even though the critical bytes are located, their mutation strategies are still based on random mutations. To relieve the overhead introduced by taint analysis of Vuzzer, Steelix [15] presented a program-state based fuzzing method. It leverages light-weight static analysis and binary instrumentation to provide coverage information and comparison progress information to help fuzzer to locate the magic bytes. However, its muation methods still belong to random strategies. Angora [14] leverages the similar searching based muation method (i.e., gradient descent) to our GA method, which can improve the quality of generated testcase. However, as demonstrated in [28], Angora uses the expensive taint tracking step only sparsely to overcome hard-to-solve conditions bringing more overhead. What’s worse, one has to make a compromise between precision and usability. This is because, to successfully build a target program with Angora’s compiler, one has to ignore the taint propagation in many libraries [14]. Similar to our method, redqueen [28] implements a light weight “taint analysis” to locate the bytes related to magic bytes or checksums. For magic bytes cases, redqueen employs several specific mutation strategies to overcome them. For checksum cases, it uses the “patch-solve” method to overcome them, which introduces the high-weight symbolic execution. However, our method uses light-weight searching method to solve the constraints which is considered to be more scalable.

5.2 Hybrid fuzz testing

Hybrid fuzz testing uses concolic execution to discover frontier nodes that represent unique paths in the program. For example, Driller is an up-to-date hybrid testing tool that leverages fuzz testing and concolic execution in a complementary manner to find deeper bugs [6]. It is more practical than previous hybrid tools. DeepFuzz [29] defined a novel method to assign probabilities to program paths to maximize both the execution path depth and the degree of freedom in input parameters for exploitation. SAFL leverages symbolic analysis to generate high-quality seed inputs for the fuzzer, and prioritizing the fuzzing towards seeds which have exercised rarely executed branches to help the fuzzing process to exercise rare and deep paths with higher probability [30]. DigFuzz designs a novel Monte Carlo based probabilistic path prioritization model to quantify each path’s difficulty and prioritize them for concolic execution which improves the number of bug detected and the code coverage [7]. Intriguer proposed a field-level constraint solving, which optimizes symbolic execution with field-level knowledge. It performs instruction-level taint analysis and records execution traces without data transfer instructions like mov. Then reduces the execution traces for tainted instructions that accessed a wide range of input bytes, and infers input fields to build field transition trees [31].

5.3 Machine learning based fuzzing

Gong et al. trained a deep learning model based on samples generated by AFL that caused the program state to be changed and the program state to be unchanged [32]. Whether the sample generated by the new round of AFL could change the program state can be predicted by the trained model. Therefore, AFL cannot execute samples that cannot generate new states, improving the efficiency of the fuzzer. NeuFuzz [33] learns the known vulnerability programs and hidden vulnerability patterns in the sample through LSTM to discover the execution path containing the vulnerability. Then, NeuFuzz preferentially executes the seed files that can cover the path containing the vulnerability, and assigns more mutation energy to these seed files based on the prediction results. NEUZZ proposed a novel program smoothing technique using surrogate neural network models that can incrementally learn smooth approximations of a complex, real-world program’s branching behaviors to improve the efficiency of fuzzing [34].

5.4 Checksum bypass

Taintscope is the first method that is aware of checksums like CRC and can bypass them automatically [35]. It uses heuristic method to locate the checksum point and then replaces the checksum part in the input with the calculated checksum value to bypass the check. However, this method may fail when dealing with the example in Fig. 1. Taintscope bypasses checksum by adjusting the checksum part in the input, thus this method fails in our cases in Fig. 1 since magic and FORM are both constant values. Taintscope can bypass the path constraint like f(x)=c where c is a constant value with the help of constraint solvers, but that method is regarded as a kind of hybrid fuzz testing method. CAFA presents a approach based on taint analysis to identify the checksum point automatically. Then the application is statically patched in an antilogical manner at the checksum point to bypass the checksum verification [36].

6 Conclusion

This paper presents a new prototype by addressing two challenges in random mutation used in the coverage-based fuzzing. The two challenges are the sizable searching space and the lack of a useful feedback information. In our design, a novel instrumentation method is utilized to collect the comparison operands, which is then used to infer the connection between the input contents and the comparison operands. In that way, the engine is aware of which input byte is pivotal to penetrate which comparison. Furthermore, adopting the distance between the two operands as feedback, we employed the genetic algorithm as a representative to search for the expected inputs, and to penetrate the complex constraints. We developed a new tool, and evaluated it on the LAVA-M, CGC binaries, and four real-world programs. The experimental results demonstrate that our approach improves the bug detection and testing coverage, whereas its introduced execution overhead is within an acceptable scope, ensuring its practicability on lots of large programs.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 61702540) and Hunan Provincial Natural Science Foundation of China (2018jj3615).
1
Miller B P , Fredriksen L , So B . An empirical study of the reliability of UNIX utilities. Communications of the ACM, 1990, 33( 12): 32– 44

2
Liang H , Pei X , Jia X , Shen W , Zhang J . Fuzzing: state of the art. IEEE Transactions on Reliability, 2018, 67( 3): 1199– 1218

3
Serebryany K. Continuous fuzzing with libfuzzer and addresssanitizer. In: Proceedings of IEEE Cybersecurity Development. 2016, 157– 157

4
Gan S, Zhang C, Qin X, Tu X, Li K, Pei Z, Chen Z. CollAFL: path sensitive fuzzing. In: Proceedings of IEEE Symposium on Security and Privacy. 2018, 679−696

5
Demoura L, Bjørner N. Z3: An efficient SMT solver. In: Proceedings of Tools and Algorithms for the Construction and Analysis of Systems. 2008, 337– 340

6
Stephens N, Grosen J, Salls C, Dutcher A, Wang R, Corbetta J, Shoshitaishvili Y, Kruegel C, Vigna G. Driller: augmenting fuzzing through selective symbolic execution. In: Proceedings of Network and Distributed System Security Symposium. 2016

7
Zhao L, Duan Y, Yin H, Xuan J. Send hardest problems my way: probabilistic path prioritization for hybrid fuzzing. In: Proceedings 2019 Network and Distributed System Security Symposium. 2019

8
Pak B S. Hybrid fuzz testing: discovering software bugs via fuzzing and symbolic execution. PhD thesis, Carnegie Mellon University Pittsburgh, PA, 2012

9
Baldoni R , Coppa E , Doelia D C , Demetrescu C , Finocchi I . A survey of symbolic execution techniques. Journal of ACM Computer Survey, 2018, 51( 3): 1– 39

10
Peng H, Shoshitaishvili Y, Payer M. T-Fuzz: fuzzing by program transformation. In: Proceedings of IEEE Symposium on Security and Privacy. 2018, 697– 710

11
Newsome J, Song D X. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In: Proceedings of the Network and Distributed System Security Symposium. 2005

12
Rawat S, Jain V, Kumar A, Cojocar L, Giuffrida C, Bos H. VUzzer: application-aware evolutionary fuzzing. In: Proceedings of Network and Distributed System Security Symposium. 2017

13
Dowser: A guided fuzzer to find buffer overflow vulnerabilities. In: Proceedings of the USENIX Security Symposium

14
Chen P, Chen H. Angora: efficient fuzzing by principled search. In: Proceedings of IEEE Symposium on Security and Privacy. 2018, 711– 725

15
Li Y, Chen B, Chandramohan M, Lin S W, Liu Y, Tiu A. Steelix: program-state based binary fuzzing. In: Proceedings of the Joint Meeting on Foundations of Software Engineering. 2017, 627– 637

16
Ye J , Zhang B , Li R , Feng C , Tang C . Program state sensitive parallel fuzzing for real world software. IEEE Access, 2019, 7 : 42557– 42564

17
Böhme M, Pham V T, Roychoudhury A. Coveragebased greybox fuzzing As Markov chain. In: Proceedings of ACM SIGSAC Conference on Computer and Communications Security. 2016, 1032−1043

18
Lemieux C, Sen K. FairFuzz: a targeted mutation strategy for increasing greybox fuzz testing coverage. In: Proceedings of ACM/IEEE International Conference on Automated Software Engineering. 2018, 475– 485

19
Dave M, Agrawal R. Search based techniques and mutation analysis in automatic test case generation: a survey. In: Proceedings of IEEE International Advance Computing Conference. 2015, 795– 799

20
Harman M, Jia Y, Zhang Y. Achievements, open problems and challenges for search based software testing. In: Proceedings of IEEE International Conference on Software Testing, Verification and Validation. 2015, 1– 12

21
Fraser G, Arcuri A. EvoSuite: automatic test suite generation for object-oriented software. In: Proceedings of ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 2011, 416– 419

22
Rowe J E. Genetic algorithm theory. In: Proceedings of Conference Companion on Genetic and Evolutionary Computation. 2007, 3585

23
Dolan-Gavitt B, Hulin P, Kirda E, Leek T, Mambretti A, Robertson W, Ulrich F, Whelan R. LAVA: large-scale automated vulnerability addition. In: Proceedings of IEEE Symposium on Security and Privacy. 2016, 110–121

24
Lattner C, Adve V. LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of IEEE International Symposium on Code Generation and Optimization. 2004, 75– 86

25
Liang J, Jiang Y, Chen Y, Wang M, Zhou C, Sun J. PAFL: extend fuzzing optimizations of single mode to industrial parallel mode. In: Proceedings of ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2018, 809– 814

26
Serebryany K, Bruening D, Potapenko A, Vyukov D. AddressSanitizer: a fast address sanity checker. In: Proceedings of USENIX Annual Technical Conference. 2012, 309– 318

27
Security M. fuzzdata: fuzzing resources for feeding various fuzzers with input. Mozilla Security, December 2017

28
Aschermann C, Schumilo S, Blazytko T, Gawlik R, Holz T. REDQUEEN: fuzzing with input-to-state correspondence. In: Proceedings of Annual Network and Distributed System Security Symposium. 2019

29
Böttinger K, Eckert C. Deepfuzz: triggering vulnerabilities deeply hidden in binaries. In: Proceedings of International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. 2016, 25– 34

30
Wang M, Liang J, Chen Y, Jiang Y, Jiao X, Liu H, Zhao X, Sun J. Safl: increasing and accelerating testing coverage with symbolic execution and guided fuzzing. In: Proceedings of International Conference on Software Engineering: Companion Proceeedings. 2018

31
Cho M, Kim S, Kwon T. Intriguer: field-level constraint solving for hybrid fuzzing. In: Proceedings of ACM SIGSAC Conference on Computer and Communications Security. 2019, 515– 530

32
Gong W, Zhang G, Zhou X. Learn to accelerate identifying new test cases in fuzzing. In: Proceeding of Security, Privacy, and Anonymity in Computation, Communication, and Storage. 2017, 298– 307

33
Wang Y , Wu Z , Wei Q , Wang Q . Neufuzz: efficient fuzzing with deep neural network. IEEE Access, 2019, 7 : 36340– 36352

34
She D, Pei K, Epstein D, Yang J, Ray B, Jana S. NEUZZ: efficient fuzzing with neural program smoothing. In: Proceedings of IEEE Symposium on Security and Privacy. 2019, 803– 817

35
Wang T, Wei T, Gu G, Zou W. Taintscope: a checksumaware directed fuzzing tool for automatic software vulnerability detection. In: Proceedings of IEEE Symposium on Security and Privacy. 2010, 497– 512

36
Liu X , Wei Q , Wang Q , Zhao Z , Yin Z . Cafa: a checksum-aware fuzzing assistant tool for coverage improvement. Journal of Security and Communication Networks, 2018, 2018 : 1– 13

Outlines

/