83
Program Analysis and Black-box GUI Testing Dissertation zur Erlangung des Doktorgrades der Technischen Fakult¨ at der Albert-Ludwigs-Universit¨ at Freiburg vorgelegt von Stephan Arlt

Program Analysis and Black-box GUI Testing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Program Analysis and Black-box GUI Testing

Program Analysis and Black-box GUI Testing

Dissertationzur Erlangung des Doktorgrades

der Technischen Fakultatder Albert-Ludwigs-Universitat Freiburg

vorgelegt vonStephan Arlt

Page 2: Program Analysis and Black-box GUI Testing

Tag der Disputation:26. Marz 2014

Dekan:Prof. Dr. Yiannos Manoli

Referenten:Prof. Dr. Andreas PodelskiProf. Dr. Mauro PezzeProf. Dr. Hannah BastProf. Dr. Peter Thiemann

2

Page 3: Program Analysis and Black-box GUI Testing

Abstract

In GUI testing, a program is tested by performing user interactions on its GUI. Userinteractions are encoded as events, and a test case is encoded as a sequence of eventstogether with input data. One challenge in GUI testing is the judicious selection ofevent sequences and input data, in order to detect as many bugs as possible.

A well-established approach to systematically select event sequences uses a black-box model. The black-box model is inferred by recording possible sequences of userinteractions from the running GUI program (ignoring the source code of the GUIprogram). The model informs the tester about the executability of event sequences,but not about their relevance for finding bugs.

In a white-box approach, one analyzes the source code of a GUI program in orderto infer the relevance of event sequences. However, since large parts of the GUI pro-gram are generally not amenable to the program analysis, no information is obtainedabout the executability of event sequences.

This thesis shows that it is possible to gain information about both the relevanceand the executability of the event sequences by integrating program analysis intoblack-box GUI testing.

In the first part of this thesis we present a new approach to select relevant eventsequences among the event sequences selected by a black-box approach. Departingfrom a black-box model, we apply a def-use program analysis to the source code ofthe GUI program. This allows us to infer a dependency graph, which we call EventDependency Graph (EDG). We use the EDG together with the black-box model toconstruct a set of relevant event sequences among the executable ones. We evaluatethe approach on several open source programs. The results indicate that the approachscales to large programs and, at the same time, leads to an informed selection of eventsequences. We are able to find previously undetected bugs.

In the second part of this thesis we present a generalization of the previous ap-proach. Departing from a black-box model, we apply program slicing to the sourcecode of the GUI program. The inferred slices of the GUI program allow us to identify

3

Page 4: Program Analysis and Black-box GUI Testing

and to eliminate redundant event sequences from the set of test cases. The evaluationof this approach on several open source programs shows a significant reduction of thenumber of event sequences, leading to a strong reduction in overall execution time.We are able to find more previously undetected bugs.

In the third part of this thesis we address an orthogonal issue to the selection ofevent sequences, namely the selection of suitable input data for test cases. First, weselect event sequences from a black-box model and convert them into parameterizedGUI tests (i. e., event sequences that are parameterized by input data). Then, we in-stantiate the parameters by applying symbolic execution. We evaluate the approachon several open source programs. The results indicate that the approach can achievehigher code coverage.

We present a new tool, called Gazoo, which embodies the three approaches above.The tool is available under a GPL license.

4

Page 5: Program Analysis and Black-box GUI Testing

Zusammenfassung

Im GUI-Testing wird ein Programm durch das Ausfuhren von Benutzerinteraktionenauf der GUI getestet. Benutzerinteraktionen sind kodiert als Events, und ein Test-fall basiert zusammen mit Input-Daten auf einer Sequenz von Events. Eine Heraus-forderung im GUI-Testing ist die uberlegte Wahl von Event-Sequenzen und Input-Daten, um so viele Bugs wie moglich zu finden.

Ein gangiger Ansatz fur die systematische Wahl von Event-Sequenzen ist die An-wendung eines Black-box-Modells. Das Black-box-Modell wird abgeleitet durch dasAufzeichnen moglicher Sequenzen von Benutzerinteraktionen eines laufenden GUI-Programms. Der Quelltext des GUI-Programms wird dabei ignoriert. Das Black-box-Modell informiert den Tester uber die Ausfuhrbarkeit von Event-Sequenzen, abernicht uber ihre Relevanz zum Finden von Bugs.

In einem White-box-Ansatz wird der Quelltext des GUI-Programms analysiert,um die Relevanz von Event-Sequenzen abzuleiten. Allerdings sind große Teile desGUI-Programms fur die Analyse generell nicht zuganglich, so dass keine Informa-tionen uber die Ausfuhrbarkeit von Event-Sequenzen erhalten werden konnen.

Diese Thesis zeigt, dass es moglich ist, Informationen uber die Relevanz und dieAusfuhrbarkeit von Event-Sequenzen durch die Integration von Programm-Analysein Black-box GUI Testing zu gewinnen.

Im ersten Teil dieser Arbeit prasentieren wir einen neuen Ansatz zur Wahl vonrelevanten Event-Sequenzen unter den Sequenzen, die von einem Black-box-Ansatzgewahlt werden. Ausgehend von einem Black-box-Modell wenden wir eine def-use-Analyse auf dem Quelltext des GUI-Programms an. Dies erlaubt uns einenAbhangigkeitsgraphen abzuleiten, den wir Event Dependency Graph (EDG) nennen.Wir verwenden den EDG zusammen mit dem Black-box-Modell, um eine Mengevon relevanten Event-Sequenzen zu konstruieren. Wir evaluieren den Ansatz auf ver-schiedenen Open-Source-Programmen. Die Ergebnisse zeigen, dass der Ansatz bzgl.großer Programme skaliert und zu einer informierten Wahl von Event-Sequenzenfuhrt. Mit dem Ansatz finden wir bisher unentdeckte Bugs.

5

Page 6: Program Analysis and Black-box GUI Testing

Im zweiten Teil dieser Arbeit prasentieren wir eine Generalisierung des vorheri-gen Ansatzes. Ausgehend von einem Black-box-Modell wenden wir Program Slic-ing auf dem Quelltext des GUI-Programms an. Die abgeleiteten Slices des GUI-Programms erlauben uns redundante Event-Sequenzen zu identifizieren und sie ausder Menge der Testfalle zu eliminieren. Die Evaluierung des Ansatzes auf ver-schiedenen Open-Source-Programmen zeigt eine signifikante Reduktion von Event-Sequenzen, die zu einer starken Reduktion der gesamten Ausfuhrungszeit fuhrt. Mitdem Ansatz finden wir weitere bisher unentdeckte Bugs.

Im dritten Teil dieser Arbeit gehen wir ein orthogonales Problem zur Wahl vonEvent-Sequenzen an, namlich die Wahl von geeigneten Input-Daten fur Testfalle.Zuerst wahlen wir Event-Sequenzen von einem Black-box-Modell und konvertierendiese in parametrisierte GUI-Tests (d.h., Event-Sequenzen parametrisiert durch Input-Daten). Dann instanziieren wir die Parameter durch die Anwendung von SymbolicExecution. Wir evaluieren den Ansatz auf verschiedenen Open-Source-Programmen.Die Ergebnisse zeigen, dass der Ansatz eine hohere Code Coverage erzielen kann.

Wir prasentieren ein neues Werkzeug namens Gazoo, das die drei oberen Ansatzeenthalt. Das Werkzeug ist verfugbar unter einer GPL-Lizenz.

6

Page 7: Program Analysis and Black-box GUI Testing

Acknowledgements

First and foremost I want to thank my supervisor Andreas Podelski for accepting meas a PhD student and for supporting me over all the years. I really appreciate thefreedom Andreas gave to me: Finding and working on my own research topic.

Furthermore, I want to thank Mauro Pezze for taking over the responsibility asthe second reviewer of this thesis.

A special thank goes to Martin Schaf, who paved the way for the collaborationwith the United Nations University. In this context, I would like to thank CristianoBertolini for his advice. I very much enjoyed the disputes with Martin and Cristiano,as well as the time in Macau when software engineering was off the cards.

I am thankful to my flatmates Evren Ermis and Martin Schaf for the marvelousjourneys. I thank Martin for establishing Karaoke as the essential part of the flat, andI thank Evren for being the perfect conversationalist in endless discussions on the“Blaue Couch”.

I would like to thank the members of the Software Engineering group at the Uni-versity of Freiburg, including the external members at Bosch and Daimler. In partic-ular, I am thankful to my office colleagues Martin Wehrle, Sergio Feo, and ChristianHerrera, who endured my taste of music. I thank the administrative staff Marlis Jost,Berit Brauer, and Martin Preen for their assistance in organizational and technicalmatters. They all provided me an excellent environment for my research and madethese years an unforgettable experience.

I am thankful to Ralf Karle, who taught me software engineering from the bottomup during my apprenticeship. Furthermore, I thank the Horst Klaes GmbH & Co. KGfor providing resources that supported the experiments of my research.

Finally, I would like to thank my family. In particular, I want to thank my parentsPetra and Ulrich, as well as my sister Manuela and my brother-in-law Michael fortheir love and faith over all the years. Last but not least, I want to thank Nicole forshowing me things in life that happen beyond my screen.

7

Page 8: Program Analysis and Black-box GUI Testing

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.1 Graphical User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.2 GUI Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 Program Analysis and Black-box GUI Testing . . . . . . . . . . . . . . . . . . . . 121.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Selection of Relevant Event Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Approach and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Elimination of Redundant Event Sequences . . . . . . . . . . . . . . . . . . . . . . . . 333.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Redundant Event Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3 Slicing-based Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Selection of Suitable Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2 Approach and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.1 Threats to Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.2 Threats to External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8

Page 9: Program Analysis and Black-box GUI Testing

6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.1 Selection of Event Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.2 Selection of Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

9

Page 10: Program Analysis and Black-box GUI Testing

1

Introduction

1.1 Graphical User Interfaces

A Graphical User Interface (GUI) provides the most common way to interact with aprogram [21]. A GUI consists of a collection of widgets; for example buttons and textboxes. A user interacts with the program by manipulating those widgets; for exampleclicking a button or typing characters into a text box. Each user interaction with awidget triggers an event. An event carries information related to its user interaction;for example, the mouse position or the key entered. A program can respond to anevent and evaluate its information with the execution of an event handler, which is amethod implemented by the programmer.

1.2 GUI Testing

GUI Testing is an established method to assure the quality of a GUI program [7, 13,16, 17, 52, 73]. In GUI testing (a form of system testing [57]), a program is tested byperforming user interactions on its GUI. A test case is based on user interactions (en-coded by a sequence of events) together with input data. Given a test case, the eventsequence can be performed on the GUI program either manually by a test engineer,or automatically by a testing tool [31, 48, 51]. A set of test cases is called test suite.

Testing GUI programs is a challenging task [7, 6, 13, 16, 31, 48, 52, 73], since thenumber of possible user interactions with a GUI is prohibitively large. For example,while the numbers of buttons of a mobile phone is fixed (i. e., the buttons for thenumbers 0 to 9), the number of clicks to a button is unrestricted. That is, the numberof possible user interactions (and thus, the number of event sequences) is infinite.

A common way to test GUI programs is to create test cases manually throughcapture-replay tools [17]. In a first step, a test engineer performs a sequence of eventson the GUI program. While interacting with the program, all events (e. g., mousemovements and mouse clicks) are captured and embedded into a test case. In a second

10

Page 11: Program Analysis and Black-box GUI Testing

step, the test engineer uses the captured test case to replay the previously capturedevents on the GUI program and to observe, whether the response of the programmeets the expected outcome. Usually, the test engineer captures the event sequenceof a certain functionality (e. g., defined in a requirements document), replays it onthe GUI program, and then verifies, whether the requirement of this functionality isfulfilled or not.

However, capture-replay tools require a large amount of human interaction, sincethe entire test suite has to be created manually. Furthermore, a captured test case isvulnerable to style changes of the GUI. For example, if buttons are moved to anotherposition on the window (e. g., during the software development process), test caseshave to be maintained (i. e., to be captured and to be replayed again), which representsa tedious and time-consuming task.

In order to avoid the costly efforts of manual test case generation, one would pre-fer to generate test cases automatically by a testing tool [30, 47, 51]. However, theautomatic generation of GUI test cases bears two major challenges: Event sequencesshould be both executable and relevant.

Executable Event Sequences. In the example of a mobile phone, each button canbe clicked at any time. Thus, all sequences of events are executable. Let us now con-sider a more complex example: A text editor, which offers multiple windows (e. g., amain window and a print window). In contrast to the mobile phone, some windowsof the text editor might be closed and only be opened after preceding events. For ex-ample, the print window and its buttons may only be clickable if one clicks a buttonin the main window. Assuming that all events can be performed at any time (e. g., ason a mobile phone) would lead to the generation of non-executable event sequences.That is, for the automatic generation of event sequences one needs to know whichsequences are actually executable on the GUI program [53].

Relevant Event Sequences. Let us again consider the example of a text editor.Usually, a text editor offers the events Copy and Paste. The event Copy copiesa selected text to the clipboard, and the event Paste pastes the previously copiedtext from the clipboard to the text editor. In this sense, the event Paste depends onthe event Copy, because Copy “writes” data (i. e., text) to the clipboard that is later“read” by Paste. Thus, one would like to test the event sequence 〈Copy, Paste〉,which represents a relevant event sequence. In contrast, one would not like to test theevent sequence 〈Paste, Copy〉, because Paste does not write data that is read inCopy. Hence, such a sequence represents an irrelevant event sequence. That is, forthe automatic generation of event sequences one needs to know which sequences areactually relevant (and irrelevant) on the GUI program [80].

11

Page 12: Program Analysis and Black-box GUI Testing

A major effort to systematically generate executable event sequences is the use ofa black-box approach [6, 13, 16, 52, 73]. In a black-box approach, one ignores thesource code of the GUI program and creates a model that expresses a set of executableevent sequences. The black-box model can be a graph [13, 52], whose nodes representevents that can be triggered by a user interaction. An edge between two nodes in thisgraph indicates that the corresponding events can be executed consecutively. The ideais that a path in this graph forms an event sequence, which can be embedded into atest case.

The benefit of a black-box approach is its informedness of executable event se-quences. Furthermore, a black-box model is created without the source code of a GUIprogram, and is hence applicable to real-world programs. For example, a black-boxmodel can be created automatically by observing possible user interactions duringrun-time of the GUI program [53]. However, without having knowledge about theprogram internals (i. e., the source code), a black-box approach may generate a hugenumber of irrelevant event sequences [7].

Thus, an interesting direction is to develop a white-box approach [25, 60, 66]to systematically generate relevant event sequences. In a white-box approach, oneanalyzes the source code of the GUI program for an informed selection of event se-quences. Until now, however, a white-box approach was not used to generate eventsequences. Moreover, it is open whether a white-box approach can realistically inferrelevant event sequences that are, at the same time, executable on the GUI. The prob-lem is that parts of the program (e. g., the GUI toolkit) might not be amenable for theanalysis.

1.3 Program Analysis and Black-box GUI Testing

The question is whether there is an approach to GUI testing which inherits from thebest of two worlds: The informedness of executable event sequences from a black-box approach and the informedness of relevant event sequences from a white-boxapproach. Put into a slogan, the black-box tells us what we can test; the white-boxtells us what we should test. Phrasing the slogan negatively, the black-box may pro-duce irrelevant test cases; the white-box may produce non-executable test cases.

In this thesis we present three new approaches to GUI testing that establish aninterface between the black-box and the white-box approach. In particular, each ap-proach starts with a black-box approach, then moves on to a white-box approach, andfinally goes back to a black-box approach. As a result, the drawbacks of the black-and the white-box approach can often be eliminated, while the benefits of the black-and the white-box approach can be preserved. In the white-box part, we apply tech-niques for program analysis (e. g., [20, 67, 72, 83]) on the source code of a GUIprogram. In the black-box parts, we perform user interactions on the GUI of a pro-

12

Page 13: Program Analysis and Black-box GUI Testing

gram – typical for Black-box GUI Testing [50, 52]. The approaches presented in thisthesis are embodied in a new tool called Gazoo.

1.3.1 Selection of Relevant Event Sequences

In Chapter 2 of this thesis we propose a new approach [3, 4, 7] to select relevant eventsequences among the executable event sequences of a black-box model. In a step thatstarts with a black-box approach and departs from it, we extract the set of events ofthe GUI program and the corresponding event handlers. In a step typical for a white-box approach, we apply a def-use [83] analysis to the bytecode of the GUI programand its dependent libraries. This allows us to infer a dependency relation betweenevents. The relation gives rise to a graph, which we call Event Dependency Graph(EDG). The EDG can be used to infer the relevance of an event sequence; it saysnothing about its executability. In contrast, a black-box model can be used to inferthe executability of an event sequence. In a step that goes back towards the black-box approach, we use the EDG together with the black-box model to construct aninformed selection of executable event sequences, namely through the set of relevantevent sequences among the executable ones. We evaluate the approach on severalopen source GUI programs. The approach scales to large programs and, at the sametime, leads to an informed selection of event sequences. Using our approach we areable to find previously undetected bugs.

1.3.2 Elimination of Redundant Event Sequences

The previous approach for selecting relevant event sequences aims at combining thebest of two worlds by combining a black-box and a white-box approach to identifyboth executable and relevant sequences. While this approach shows promising resultscompared to black-box testing, it only considers pairs of events that represent a def-use dependency. However, longer event sequences are often desired to detect morecomplex bugs [10, 14, 76]. Furthermore, although only pairs of events have beenconsidered, the number of resulting event sequences is still huge, which limits itspractical applicability.

In Chapter 3 of this thesis we propose a generalization [8] of the previous ap-proach [7]. The generalization is not restricted to pairs, but can handle an arbitrarynumber of dependent events while guaranteeing that all relevant event sequences aregenerated based on these events. To make this generalization scalable in practice, wepresent a refined static analysis approach based on a variant of program slicing [72]to reduce the resulting number of event sequences. In more detail, we present an ap-proach that allows us to identify redundant event sequences that can safely be ignoredfor the test case generation, because non-redundant sequences already guarantee thesame code coverage. Our experiments on several open source GUI programs con-firm its practical potential. In particular, our experiments demonstrate that redundant

13

Page 14: Program Analysis and Black-box GUI Testing

event sequences occur in a huge number of various open source GUI programs. Thereduced number of generated test cases can strongly reduce the overall execution timeof the remaining test cases. We are able to find more previously undetected bugs.

1.3.3 Selection of Suitable Input Data

As outlined above, GUI test cases contain both event sequences and input data. Forselecting event sequences, one can use a black-box approach based on Event FlowGraphs (EFGs) [52]. For selecting input data, one can use a white-box approach basedon parameterized unit tests [67] and dynamic symbolic execution [20]. Motivated bythe success of the black-box approach to GUI testing [13, 16, 52, 74], we ask thequestion whether the black-box approach can be integrated with techniques from thewhite-box approach so that the resulting approach provides both, the selection ofexecutable event sequences and the selection of suitable input data.

In Chapter 4 of this thesis we propose an approach [5] to make the principle ofparameterized unit testing available to black-box GUI testing. In a first step, our ap-proach selects event sequences from the EFG of a GUI program and converts theminto a set of parameterized GUI tests. Parameterized GUI Tests are event sequencesthat are parameterized by possible input data. In a second step, we apply dynamicsymbolic execution [20] in order to instantiate the parameterized GUI tests. In a finalstep, instantiated GUI tests are replayed on the GUI program. We evaluate the ap-proach on several open source GUI programs. The results indicate that the approachcan achieve higher code coverage.

1.4 Outline

In Chapter 2 we present our approach for selecting relevant event sequences. Af-terwards, in Chapter 3 we present an approach for eliminating redundant event se-quences. In Chapter 4 we present our approach for selecting suitable input data forGUI test cases. The threats to validity of our experiments are listed in Chapter 5, fol-lowed by an overview of related work in Chapter 6. We conclude this thesis togetherwith future work in Chapter 7.

14

Page 15: Program Analysis and Black-box GUI Testing

2

Selection of Relevant Event Sequences

In this chapter we present our new approach to select relevant event sequences. First,we illustrate the approach using an example of a GUI program. Then, we present theapproach and its implementation. Finally, we formulate a set of research questionsand evaluate the approach.

2.1 Example

In this section we illustrate the application of our approach on an example of a GUIprogram; see Figure 2.1. The example contains a simplified version of a bug whichwe detected in an existing GUI program; see Section 2.3. Note that the GUI programmust not be confused with the GUI toolkit: We here concentrate on testing the eventhandlers of the GUI program instead of testing certain features of the GUI toolkit.

Fig. 2.1: Example of a GUI program. The arrows between the two screenshots of the GUI indicate thetransition between two views. Clicking the button for the event e3 leads to opening the Dialog window(and hiding the MainWindow; i. e., the buttons for the events e1, e2, and e3 are no longer enabled).Clicking the button for the event e4 closes the Dialog window and leads back to the state in the firstview; i. e., the buttons for the events e1, e2, and e3 are enabled again.

15

Page 16: Program Analysis and Black-box GUI Testing

The description of the possible user interactions with the GUI in Figure 2.1 revealsits event flow. Namely, each event can follow any other event except that e4 canfollow only after e3. The event flow (i. e., the can-follow relation between events)is expressed by the graph depicted in Figure 2.2. Such a graph, called Event FlowGraph (EFG), represents the black-box model that is used to generate test cases inMemon [52]. The idea is that a path in the EFG encodes a possible user interaction.The marking of e1, e2, and e3 as initial events encodes how a user interaction canstart. An EFG can be constructed automatically using reverse engineering [53].

e1

e3

e2

e4

Fig. 2.2: Event Flow Graph (EFG) for the example program. Each event can follow any other eventexcept that e4 can follow only after e3. A path in the EFG encodes a possible user interaction. Themarking of initial events encodes that a user interaction can start with e1, e2, e3, but not e4.

We will first explain how a “pure” black-box approach a la [52] would work onthe example program. It would first construct the EFG in Figure 2.2. It would thenuse the EFG to generate the ten event flow sequences in Figure 2.3, which it wouldtransform into the ten test sequences in Figure 2.4 (and embed those into test cases).

s1 = 〈 e1 , e1 〉 s6 = 〈 e2 , e3 〉s2 = 〈 e1 , e2 〉 s7 = 〈 e3 , e4 〉s3 = 〈 e1 , e3 〉 s8 = 〈 e4 , e1 〉s4 = 〈 e2 , e1 〉 s9 = 〈 e4 , e2 〉s5 = 〈 e2 , e2 〉 s10 = 〈 e4 , e3 〉

Fig. 2.3: The event flow sequences extracted from the EFG in Figure 2 (i. e., the sequences of events ona path in the EFG of length n = 2).

An event flow sequence consists of the events on one of the paths in the EFG. Thelength of the paths in the EFG is fixed as the parameter of the approach; here weset the parameter to n = 2. The transformation of an event flow sequence into a testsequence accounts for the fact that a user interaction must start in an initial event,

16

Page 17: Program Analysis and Black-box GUI Testing

namely by a suitable expansion. Specifically, since the event e4 is not an initial eventin the EFG, a test sequence cannot start with e4. Thus, the transformation expands s8,s9, and s10 by an initial event, here e3. The transformation does not expand s1, . . . , s7,since they already start with an initial event.

t1 = 〈 e1 , e1 〉 t6 = 〈 e2 , e3 〉t2 = 〈 e1 , e2 〉 t7 = 〈 e3 , e4 〉t3 = 〈 e1 , e3 〉 t8 = 〈 e3 , e4 , e1 〉t4 = 〈 e2 , e1 〉 t9 = 〈 e3 , e4 , e2 〉t5 = 〈 e2 , e2 〉 t10 = 〈 e3 , e4 , e3 〉

Fig. 2.4: The test sequences that would be generated in the black-box approach whose parameter isn = 2. They are obtained from the event flow sequences in Figure 2.3, possibly by expansion (a testsequence cannot start with e4).

Our approach

We will now explain how our approach works on the example program. In the firststep, we extract the event handlers of the events e1, . . . , e4 from the GUI programand its bytecode. The Java source code is shown in Figure 2.5. For the purpose of thisexample, we use the source code (and not, as in real, the bytecode).

In the second step of our approach, we apply a static analysis to the bytecodeand derive a dependency relation between the four events of the GUI: The event e1defines (writes) the field text which is used (read) in the event e4; so does the evente2, and so does the event e4 itself. The def-use (“writes-to/reads-from”) dependencybetween events is expressed by the graph depicted in Figure 2.6. We call it the EventDependency Graph (EDG).

In the third step, we first generate the event dependency sequences shown in Fig-ure 2.7 from the EDG in Figure 2.6. We then transform the event dependency se-quences into the test sequences shown in Figure 2.8, hereby using the EFG in Fig-ure 2.2 in an auxiliary procedure.

An event dependency sequence consists of the events on a path in the EDG. Now,it is the length of the paths in the EDG which is fixed by the parameter of our ap-proach; here we set the parameter to n = 2 (i. e., the same value as in the black-boxapproach).

The transformation of an event dependency sequence s into a test sequence t con-sists of expanding the sequence until it corresponds to a path in the EFG and thenfurther until it corresponds to an EFG path that starts in an initial event. For example,we transform s1 into t1 by inserting e3 between e1 and e4. In the case of the trans-formation from s3 into t3, we need to additionally add an initial event (since a test

17

Page 18: Program Analysis and Black-box GUI Testing

1 class MainWindow extends JFrame {2 String text = new String();3

4 void e1() { // handler for event e15 text = "Hello World";6 }7

8 void e2() { // handler for event e29 text = null;

10 }11

12 void e3() { // handler for event e313 Dialog d = new Dialog(this);14 d.setVisible(true);15 }16

17 class Dialog extends JDialog {18 void e4() { // handler for event e419 text = text.trim();20 Dialog.this.setVisible(false);21 }22 }23 }

Fig. 2.5: The event handlers extracted from the example program and the bytecode. The classMainWindow defines the event handlers for e1, e2, and e3, and the class Dialog for e4. e1 assignsa string value to the field text (line 5). e2 sets the field text to null (line 9). e3 opens the dialog(line 13-14). e4 assigns the trimmed string value of the field text to the field text, and closes thedialog (line 19-20).

e1 e2 e3 e4

Fig. 2.6: Event Dependency Graph (EDG) for the example program. Each edge expresses a def-usedependency found by the static analysis (applied to the source code shown in Figure 2.5): e1 defines thefield text which is used in e4; so does e2, and so does e4 itself. We observe that a path in the EDG maynot be matched by any user interaction (since e4 can follow only after e3). We further observer that e3does not contain incoming or outgoing edges, since e3 neither reads nor writes fields.

s1 = 〈 e1 , e4 〉 s2 = 〈 e2 , e4 〉 s3 = 〈 e4 , e4 〉

Fig. 2.7: The event dependency sequences extracted from the EDG in Figure 2.6 (i. e., the sequences ofevents on a path in the EDG of length n = 2).

t1 = 〈 e1 , e3 , e4 〉 t2 = 〈 e2 , e3 , e4 〉 t3 = 〈 e3 , e4 , e3 , e4 〉

Fig. 2.8: The test sequences that are generated in our approach when the parameter is set to n = 2.They are obtained from the event dependency sequences in Figure 2.7, using the EFG in Figure 2; e4can follow only after e3 (and a test sequence cannot start with e4).

18

Page 19: Program Analysis and Black-box GUI Testing

sequence cannot start in e4). Thus, the transformation proceeds in two steps wherethe second step is analogous to the transformation from the event flow sequences inFigure 2.3 into the test sequences in Figure 2.4. I.e., in the black-box approach onewould take all event flow sequences of the length fixed by the parameter n = 2. Inour approach, we take a few (three) selected event flow sequences whose length is apriori unbounded.

Our approach finds a bug in the example.

The execution of one of the three test sequences generated in our approach, namely t2,causes a NullPointerException. This is because the field text is set to nullin the event handler e2 and then de-referenced in event handler e4. The intermediateevent e3 in the test sequence t2 is not involved in the causal chain of the bug; it isneeded because without it, the event sequence would not be executable (i. e., couldnot be executed on the GUI program).

Would the black-box approach find the bug?

No. In our example, the set of the three test sequences generated from event de-pendency sequences of length n = 2 is disjoint from the set of the ten test sequencesgenerated from event flow sequences of length n = 2. Thus, the test sequence t2 can-not be generated from one of the ten event flow sequences of length n = 2. I.e., thebug cannot be detected in the black-box approach when the parameter is set to n = 2.

Yes. In our example, the set of the three test sequences generated from eventdependency sequences of length n = 2 is a subset of the (24) test sequences generatedfrom event dependency sequences of length n = 3. Thus, the test sequence t2 couldbe generated from one of the event flow sequences of length n = 3. I.e., the bug couldbe detected in the black-box approach when the parameter was set to n = 3, at leastin this example. However, setting n = 3 can be considerably expensive in terms ofexecution time/number of test sequences.

No. We can modify the example such that the bug can no longer be detected inthe black-box approach even when the parameter is set to n = 3 (the modificationextends to n = 4, n = 5, . . . ). Namely, we add an intermediate dialog box with theevent e1

4 such that the test sequence t2 becomes non-executable and the bug would bedetected only by the test sequence t ′2 below.

t ′2 = 〈 e2 , e3 , e14 , e4 〉

If the button for e14 can only be activated by e3 and the button for e4 can only be

activated by e14, then the edge (e3,e4) in the EFG of Figure 2.2 gets replaced by the

two edges (e3,e14) and (e1

4,e4). This means that the test sequence t ′2 can no longer

19

Page 20: Program Analysis and Black-box GUI Testing

be generated from an event flow sequence of length n = 3. I.e., the bug cannot bedetected in the black-box approach even when the parameter is set to n = 3.

One might argue that one can modify the example such the bug can no longerbe detected in our approach with the parameter n = 2. This, however, would involveconstructing a more complex bug, namely one whose causal chain involves more thantwo events.

2.2 Approach and Implementation

Our approach consists of three main steps as shown in Figure 2.9: (1) the extractionof event handlers; (2) the static analysis; and (3) the generation of test sequences.

2.2.1 Event Handler Extraction

Each event in a GUI (e. g., a click on a button OK) is encoded as an event handler(e. g., OnClickOK). The step Event Handler Extraction mixes aspects of black-boxand white-box approaches in order to extract the GUI’s event handlers. First, asin a black-box approach, we execute the GUI program and enumerate the GUI’swidgets (e. g., windows, buttons, and text fields). Then, leaving a black-box ap-proach, we apply Reflection [58] to obtain the Java object corresponding to eachwidget (e. g., a JButton object). We ask the Java object to invoke the methodgetActionListeners. The method invocation returns the widget’s event han-dlers (which are called action listeners in Java). The widget’s ID is used as aunique identifier for each event. If the Java object does not provide the methodgetActionListeners, then there exists no registered event handler to this wid-get. In this case, we can simply discard the corresponding event for the constructionof the EDG; it will still be used for the construction of the EFG.

2.2.2 Static Analysis

The input of the step Static Analysis is the bytecode of the program and the set ofevent handlers which has been extracted by the step Event Handler Extraction. Foreach event handler, we infer a sound approximation of what fields can be definedto resp. used from during the execution of the event handler. The inference is doneby a “lightweight” static analysis applied to the bytecode. Concretely, for each eventhandler, the static analysis collects all fields that are defined to resp. used by theevent handler directly or indirectly (indirectly means: By all methods that it maypossibly call, directly or via intermediate method calls). The static analysis returnstwo mappings that assign to each event e a set of fields, namely FieldsDefined(e)resp. FieldsUsed(e). We construct the EDG from the two mappings. Namely, theedge (e,e′) belongs to the EDG if the intersection between FieldsDefined(e) and

20

Page 21: Program Analysis and Black-box GUI Testing

FieldsUsed(e′) is not empty. The EDG is the output of the step Static Analysis. Anedge (e,e′) in the EDG indicates a def-use dependency between the events e and e′,meaning: The event handler of e′ possibly uses data defined by the event handler ofe.

Sta$c&Analysis&

Test&Sequence&Genera$on&

Event&Handler&Extrac$on&

GUI&Ripper&

Replayer&

GUI&P

rogram

&

GUI$

Bytecode$

Test$Cases$

EDG$

EFG$

Test$Sequences$

Bytecode$+$GUI$

Event$Handler$

pass$ fail$

GUI$

Fig. 2.9: Our approach, with three main steps. (1) The step Event Handler Extraction extracts a set ofevent handlers from the bytecode and the GUI of a program. (2) The step Static Analysis constructs anEDG from the def-use dependencies which are found by a static analysis applied to the event handlersand the bytecode. (3) The step Test Sequence Generation generates a set of test sequences from theEDG, with the help of the EFG. There are two more steps: the step GUI Ripper constructs an EFG, andthe step Replayer executes the generated test sequences on the GUI (reporting sets of passed and failedtest cases).

21

Page 22: Program Analysis and Black-box GUI Testing

Implementation details

We use the ASM framework [9] for the implementation of the static analysis de-scribed above. We apply the static analysis to the bytecode, and not on the Javasource code as shown for the example discussed in Section 2.1. For concreteness, weshow the bytecode for the event handler e2 and e4 in the example; see Figure 2.10.Our earlier implementation efforts, which used a static analysis on the source code(implemented in JDT1), failed; we did not succeed to obtain the required informa-tion. For each event handler, the static analysis collects all fields defined by analyz-ing the instruction PUTFIELD [46], and all fields used by analyzing the instructionGETFIELD. Since an event handler may directly or indirectly call methods, the staticanalysis follows all method calls by analyzing the instruction INVOKE.

1 void e2()V2 ICONST_03 PUTFIELD MainWindow.text : Z4

5 void e4()V6 GETFIELD MainWindow.text : Ljava/lang/String;7 INVOKEVIRTUAL java/lang/String.trim()Ljava/lang/String;8 PUTFIELD MainWindow.text : Ljava/lang/String;

Fig. 2.10: Bytecode for the event handlers for e2 and e4 in the example of Section 2.1. The static analysisadds the text field of MainWindow to the set of fields defined by e2, and it adds it to the set of fieldsused by e4. It thus infers the def-use dependency from e2 to e4. Thus, we add the edge (e2,e4) to the EDG.In bytecode, fields are used by the instruction GETFIELD; they are defined by PUTFIELD. Methodsare called using the INVOKE instruction (INVOKEVIRTUAL, INVOKESTATIC, INVOKESPECIAL).Line 2: the constant value 0 (which refers to NULL) is pushed to the stack. Line 3: the constant value 0is assigned to the field text. Line 6: the field text is used and pushed to the stack. Line 7: the methodtrim is called and the return value is pushed to the stack. Line 8: the return value is pop-ed from thestack and written to the field text.

2.2.3 Test Sequence Generation

The input to the step Test Sequence Generation is the EDG, which has been con-structed in the step Static Analysis, and the EFG, which has been constructed in anauxiliary step which we will describe later. As we have described above, the EDG en-codes the dependency relation between events, and the EFG encodes the flow relationbetween events. A path in the EDG, which we call event dependency sequence, is ingeneral not executable (i. e., does not encode a user interaction that can be executedon the GUI). In contrast, a path in the EFG that starts in an initial event, which wecall test sequence, is executable. Therefore, in order to obtain a test sequence from an

1 http://www.eclipse.org/jdt/

22

Page 23: Program Analysis and Black-box GUI Testing

event dependency sequence, we need to transform it until it forms a path in the EFGthat starts with an initial event.

The length of the event dependency sequences is fixed by the parameter n of theoverall approach. It is the step Test Sequence Generation, and only this step in theoverall approach, that depends on the parameter n. In the example of Section 2.1 andin the experiments which we will discuss in Section 2.3, the parameter is set to n = 2.

The step Test Sequence Generation consists of two substeps: (1) the extraction ofall event dependency sequences from the EDG, and (2) the transformation of eventdependency sequences to test sequences. The transformation consists of finding apath in the EFG that contains all events of the given event dependency sequence, inthe same order but not necessarily consecutive.

For two events in an event dependency sequence, say e and e′, we compute theshortest path in the EFG from e to an initial event, and the shortest path from e toe′. We then concatenate both computed paths (i. e., the prefix of e and the suffixstarting in e) in order to obtain an executable test sequence. In particular, we apply abreadth-first search on the EFG to find the shortest path between events. The searchfor the shortest path is motivated by the idea of having compact test cases (in termsof numbers of events in a test sequence). If there exist multiple shortest paths from eto e′ in the EFG, we consider the first shortest path found by our implementation.

There are two cases when an event dependency sequence cannot be transformedinto a test sequence: (a) there exists no path in the EFG from an initial event to thefirst event of the event dependency sequence (we then discard the event dependencysequence); and (b) for two events in the event dependency sequence, say e and e′,there exists no path in the EFG leading from e to e′. In this case, we first split theevent dependency sequence into two test sequences (i. e., we compute the shortestpath to an initial event for e and for e′). Then, we concatenate these two test sequenceswith a restart of the GUI program in between. The need of a restart arises from thefollowing observation: GUI programs tend to store user settings (e. g., the recentlyopened files in the File menu). For example, an event e writes a user setting which isused by an event e′, but there exists no event flow between these two events. In orderto test (e,e′), we generate two test sequences and concatenate them with a restart inbetween.

Auxiliary Steps

The step GUI Ripper constructs an EFG by executing the GUI program. For com-pleteness we describe it briefly; see [53] for details. The execution is directed toexplore the hierarchical structure of the GUI in a depth-first manner. For each widgetfound during the execution, say a button OK, the GUI Ripper triggers the assignedevent (i. e., a button click). By recording the history of triggered events, the GUI Rip-per detects the event flow (i. e., which pairs of events can be consecutive) and stores

23

Page 24: Program Analysis and Black-box GUI Testing

it in the EFG. The GUI Ripper uses a widget ID as a unique identifier for each event,just as the step Event Handler Extraction. In our implementation, the steps EventHandler Extraction and GUI Ripper are combined.

The step Replayer takes as input a set of test sequences and embeds each test se-quence into a test case. A test case consists of four components: (1) a preconditionthat must hold before executing a test sequence; (2) the test sequences to be executed;(3) possible input data to the GUI; and (4) a postcondition that must hold after exe-cuting the test sequence. In particular, the step Replayer consists of several substeps:First, it ensures that the precondition of the test case holds. Then, it launches theprogram. Afterwards, it executes the event sequence of the test case on the GUI andinserts input data where necessary. In the next steps, the Replayer closes the GUI pro-gram and finally checks, whether the postcondition of the test case holds. If it holds,the Replayer reports the test case as passed, and if not, as failed. In Section 2.3 weoutline the precondition and the postcondition (i. e., the oracle), as well as the inputdata used in our experiments.

We have implemented our approach in a new tool called Gazoo [26]. Our imple-mentation uses an adaptation of the GUI Ripper and the Replayer of GUITAR [55].

2.3 Experiments

In this section we evaluate our approach by a comparison with the black-box ap-proach in Memon et al. [52]. We first present the setup of the experiments. Then, wediscuss the results of the experiments. We define the following five research ques-tions:

• Q1: Does our approach scale to realistic GUI programs? A priori, this is not clear.The question is critical for a particular reason. In the black-box approach, one canbound the cost by fixing the parameter to, say, n = 2 (the approach will then gen-erate at most k2 test sequences where k is the number of events). In our approach,the parameter fixes the complexity of the bug to be found (the approach guaran-tees to find all bugs whose causal chain involves n = 2 events); a priori, there isno bound on the number of the generated test sequences (nor on their length).

• Q2: Is the test sequence generation in our approach effectively selective; i. e., doesit discard an interesting number of irrelevant event flow sequences?

• Q3: How much would we have to increase the parameter n for the black-boxapproach in order to generate all the test sequences that are generated by ourapproach with the parameter set to n = 2?

• Q4: Does our approach achieve the same coverage even in the cases when thenumber of generated test sequences is smaller than in the black-box approach?

• Q5: Is our approach effective for finding bugs?

24

Page 25: Program Analysis and Black-box GUI Testing

2.3.1 Setup of the Experiments

We evaluate our approach using four Java-based open source programs: JabRef2.7 [39] manages bibliographic references, FreeMind 0.9 [23] creates mind maps,TerpWord 4.0 is a word processor, and Rachota 2.3 [62] is a time recording system.For Rachota, we used the artifacts from Community Event-based Testing(COMET) [19]. We choose these programs to consider both small and large programs(in terms of # of classes), and to cover different code styles. Figure 2.11 shows somerelevant statistics for each Application Under Test (AUT).

JabRef FreeMind Rachota TerpWordLOC 68,468 40,922 13,750 6,842

Classes 4,027 1,362 468 215

Events 776 959 154 159

EDG edges 10,034 25,248 2,172 4,100

EFG edges 100,211 105,986 1,493 4,229

Fig. 2.11: Comparison between AUTs: number of lines of code (LOC), classes, events, and edges in theEDG and the EFG that are computed in the experiments. For Rachota, the size of the EDG is almost50% higher than the size of the EFG.

Our experiments consist of two configurations. The configuration EFG-2 standsfor the black-box approach [52] where the parameter is set to n = 2; this limits thelength of the considered paths in the EFG to n = 2. The configuration EDG-2 standsfor our approach where the parameter is set to n = 2 as well; this limits the lengthof the considered paths in the EDG to n = 2. The choice of the parameter n = 2 ismotivated by previous empirical studies on bugs in GUI programs [79].

As mentioned in Section 2.2, each test sequence is embedded into one test caseconsisting of (1) a precondition; (2) a test sequence; (3) input data; (4) a postcon-dition. For (1), as a precondition we define that all user settings of an AUT haveto be deleted before executing the test sequence. For (3), we generate random data(i. e., random string values for text boxes). The computation of suitable input data(see [25, 28, 69]) for widgets represents an orthogonal problem and is not in thescope of this chapter. For (4), we use a crash monitor as an oracle; this is a simple butreasonable oracle. In particular, we record any exception occurred during test caseexecution, and we automatically observe whether a test case is executable on a GUI.For a discussion of alternative oracles we refer to [22, 54, 56, 77].

The test cases are executed on 10 virtual Linux machines with 2.0 GHz CPU, 2GB RAM, 500 GB HDD. In order to mitigate the effect of randomness, the config-urations EFG-2 and EDG-2 are executed three times. The total number of executedtest cases amounts to 236,808.

25

Page 26: Program Analysis and Black-box GUI Testing

2.3.2 Results of the Experiments

AUT EFG-2 EDG-2JabRef# test sequences 43,017 5,860generation time (m) 93 21generation time per test sequences (s) 0.13 0.22execution time (h) 358 49line coverage (%) 54 54branch coverage (%) 26 26# detected bugs

FreeMind# test sequences 11,396 9,944generation time (m) 25 34generation time per test sequences (s) 0.13 0.21execution time (h) 98 88line coverage (%) 53 53branch coverage (%) 37 37# detected bugs - -

Rachota# test sequences 1,310 1,407generation time (m) 3 4generation time per test sequences (s) 0.14 0.17execution time (h) 6 6line coverage (%) 61 62branch coverage (%) 34 36# detected bugs -

TerpWord# test sequences 3,307 2,695generation time (m) 7 8generation time per test sequences (s) 0.13 0.18execution time (h) 12 10line coverage (%) 55 55branch coverage (%) 36 36# detected bugs - -

Fig. 2.12: The configuration EFG-2 stands for the black-box approach [52] where the parameter is setto n = 2; this limits the length of the considered paths in the EFG to n = 2. The configuration EDG-2 stands for our approach where the parameter is set to n = 2 as well; this limits the length of theconsidered paths in the EDG to n = 2.

Regarding research question Q1, we refer to Figure 2.12. Contrary to our initialworries, the number of test sequences does not explode. In the middle-sized pro-gram Rachota, it is slightly larger. In the somewhat larger program JabRef, it is sig-nificantly smaller. A potential explanation is that while our approach selects a test

26

Page 27: Program Analysis and Black-box GUI Testing

sequence for every pair of events with a def-use dependency, it selects only one. An-other major worry of ours (based on earlier experience) concerned the scalability ofthe static analysis which is an essential ingredient of our approach. It seems that withour choice of a lightweight static analysis, we have identified a sweetspot in the pre-cision/cost tradeoffs. Looking at the generation time and the generation time per testsequence, we conclude that the overhead incurred by the static analysis is acceptable.In summary, the answer to the research question Q1 is Yes: Our approach does scaleto realistic GUI programs.

(a) JabRef (b) FreeMind

(c) Rachota (d) TerpWord

Fig. 2.13: In each Venn diagram, the set marked EFG-2 consists of the test sequences generated withconfiguration EFG-2 (i. e., by the black-box approach). The set marked EDG-2 consists of the test se-quences generated with configuration EDG-2 (i. e., by our approach). Among the test sequences inEFG-2 our approach discards all those that are not in the intersection (they are not relevant test se-quences).

Regarding research question Q2, we refer to Figure 2.13. In each Venn diagram,the set marked EFG-2 consists of the test sequences generated with configurationEFG-2 (i. e., by the black-box approach). The set marked EDG-2 consists of the testsequences generated with configuration EDG-2 (i. e., by our approach). Among thetest sequences in EFG-2 our approach discards all those that are not in the intersec-tion. Among all test sequences in EFG-2, only the ones that are also in EDG-2 (i. e.,in the intersection) are justifiably relevant (because they are known to contain a pairof two events with a def-use dependency). All other test sequences in EFG-2 are irrel-evant, i. e., their selection is not based on the formal criterion of relevance and, hence,they are discarded by our approach. In JabRef only 540 out of the 43,017 event flowsequences in EFG-2 are relevant, i. e., 99% of the event flow sequences in EFG-2are irrelevant and discarded by our approach. For the other AUTs, the numbers aresimilar: 95% of the event flow sequences in FreeMind; 92% for Rachota; 97% for

27

Page 28: Program Analysis and Black-box GUI Testing

TerpWord. In summary, the answer to the research question Q2 is Yes: Our approachdiscards an interesting number of irrelevant event flow sequences.

(a) JabRef

607

1045

5387

2010

450

281

117

46

1

0 2000 4000 6000

2

3

4

5

6

7

8

9

10

(b) FreeMind

(c) Rachota

92

171

1383

908

141

0 500 1000 1500

2

3

4

5

6

7

8

9

10

(d) TerpWord

Fig. 2.14: The distribution of the test sequences obtained from the event dependency sequences of length2. The y-axis stands for the parameter which is required for the black-box approach in order to generatethe test sequences. For example, 2,656 test sequences out of the (5,860) test sequences generated byour approach with parameter n = 2 are generated by the black-box approach only if the parameter isincreased to n = 5.

Regarding research question Q3, we refer to Figure 2.14. The y-axis stands forthe parameter which is required for the black-box approach in order to generate thetest sequences. For example, 2,656 test sequences out of the (5,860) test sequencesgenerated by our approach with parameter n = 2 are generated by the black-boxapproach only if the parameter is increased to n = 5. A similar fact holds true foreach of the programs: For a rather large fraction of the test sequences generated by ourapproach, one has to increase the parameter to n = 4 or n = 5. To answer the researchquestion Q3, one has to increase the parameter to n = 7 (for JabRef), to n = 10(for FreeMind), to n = 6 (for Rachota and TerpWord) for the black-box approach inorder to generate all the test sequences that are generated by our approach with theparameter set to n = 2.

28

Page 29: Program Analysis and Black-box GUI Testing

0

10

20

30

40

50

60

0 10000 20000 30000 40000 50000

% lin

e c

overa

ge

# test sequences

(a) JabRef

0

10

20

30

40

50

60

0 2000 4000 6000 8000 10000 12000

% lin

e c

overa

ge

# test sequences

(b) FreeMind

0

10

20

30

40

50

60

70

0 200 400 600 800 1000 1200 1400 1600

% lin

e c

overa

ge

# test sequences

(c) Rachota

0

10

20

30

40

50

60

0 500 1000 1500 2000 2500 3000 3500

% lin

e c

overa

ge

# test sequences

(d) TerpWord

Fig. 2.15: The trend of the achieved coverage. The x-axis indicates the number of executed test se-quences. The y-axis indicates the line coverage. The black line represents the coverage of EDG-2 testsequences; the gray line represents the coverage of EFG-2 test sequences. We observe that the config-uration EDG-2 does not loose coverage. Moreover, the configuration EDG-2 can achieve coverage ofEFG-2 significantly faster.

Regarding research question Q4, we refer to Figure 2.15. The x-axis indicates thenumber of executed test sequences. The y-axis indicates the line coverage. The blackline represents the coverage of the EDG-2 test sequences, the gray line represents thecoverage of the EFG-2 test sequences. Since our static analysis recognizes event de-pendencies in depending libraries, we consider coverage as the sum of the coverageof the GUI program itself and its depending libraries. In order to plot a fair coveragetrend (the order of the executed test sequences matters) we followed this procedure:First, we executed each test sequence and measured its achieved coverage. Second,we put the coverage results in a coverage sequence. Third, we randomly modified theorder in the coverage sequence k times using different seeds, where k is the numberof all test sequences. Furthermore, we calculated the coverage trend of each modifiedcoverage sequence. Fourth, in Figure 2.15 we reported the average of all coveragetrends. We observe that the configuration EDG-2 does not loose coverage comparingto the configuration EFG-2. Moreover, we observe that EDG-2 can achieve the samecoverage of EFG-2 significantly faster (in terms of the number of executed test se-

29

Page 30: Program Analysis and Black-box GUI Testing

quences). For JabRef, the EDG-2 needs 5,860 test sequences to achieve the coverageof EFG-2, which needs 43,017 test sequences. Rachota makes an exception: EDG-2achieves a slightly higher coverage with a slightly higher number of test sequencesthan EFG-2. Note that the EDG is larger than the EFG; see Figure 2.11. In summary,the answer to the research question Q4 is Yes: Our approach does achieve the samecoverage even in the cases when the number of generated test sequences is smallerthan in the black-box approach.

The answer to the research question Q5 is Yes. Our approach is able to findbugs. In JabRef we detected two bugs only with the configuration EDG-2, andone bug with the configuration EFG-2 and EDG-2. In Rachota we detected onebug only with the configuration EDG-2. In the following we sketch one bug de-tected in JabRef using our approach. In JabRef the following test sequence causesan ArrayOutOfBoundsException:(1) In the main window, click Manage content selectors , which opens a new di-alog; (2) switch to the main window and choose Close database ; (3) switch backto the dialog and click OK . The error occurs, because the newly opened dialog startsmodeless which allows the user to close the database, although the dialog still allowsthe user to modify the database. Note that JabRef does not show an error message.Instead, the stack trace of the ArrayOutOfBoundsException is printed on thestandard output (stdout). We reported all detected bugs to the corresponding devel-opers. Furthermore, all bugs have been fixed in the following version of the programs.

Managecontent selectors

OK

Close Database

(a) EFG

Managecontent selectors

OK

Close Database

(b) EDG

Fig. 2.16: Snippets of the EFG and EDG of JabRef. The edges marked in red in the EFG and in theEDG are used to generate the test sequence that detects the bug.

Figure 2.16 shows a snippet of the EFG and the EDG of JabRef that correspondsto the detected bug. In the test sequence generation for event Close database ourapproach detects the def-use dependency to event OK . This dependency consists ofa field for JabRef’s metadata, which is defined in Close database and used inOK . Thus, the event dependency sequence 〈 Close database , OK 〉 is generated.

30

Page 31: Program Analysis and Black-box GUI Testing

This event dependency sequence is transformed into a test sequence, since there ex-ists no corresponding path in the EFG. The shortest path from an initial event toClose database , and from Close database to OK is computed and inserted

to the sequence. The resulting test sequence is: 〈 Manage content selectors ,Close database , Manage content selectors , OK 〉. The EFG-2 approach does

not detect this bug.

2.4 Discussion

In this chapter we have presented a new approach to select relevant event sequencesamong the event sequences selected by a black-box approach. The results of the eval-uation indicate that the approach scales to large programs and, at the same time, leadsto an informed selection of event sequences. In this section we discuss the steps EventHandler Extraction, Static Analysis, and Test Sequence Generation of our approach.

2.4.1 Event Handler Extraction

The step Event Handler Extraction represents a dynamic approach (i. e., it executes aGUI program in order to extract event handlers). In principle, it would be possible toextract these event handlers in a static approach (e. g., by analyzing the source code).However, since Java code is written in so many ways, a static analysis technique mustbe tailored to comprehend how a GUI is built. For example, event handlers might alsobe registered using callbacks, virtual function calls, or even external resource files.

2.4.2 Static Analysis

When constructing the EDG, the step Static Analysis does not consider potentialaliasing of fields or potentially infeasible control-flow. Furthermore, Java distin-guishes between instance fields and class fields, which are treated the same way inour static analysis. That is, both class fields and instance fields are mapped to theircorresponding class. Moreover, instance fields are not mapped to their objects. Thestatic analysis does not distinguish between calls of instance methods and class meth-ods, and thus, is not reliable regarding polymorphism. Hence, the resulting EDG isan approximation of the dependencies between fields. However, we are interested inprioritizing events, so a lightweight static analysis is sufficient while leaving roomfor further in-depth analyses.

2.4.3 Test Sequence Generation

The step Test Sequence Generation transforms each event dependency sequence intoa test sequence by finding the shortest paths in the EFG as explained in Section 2.2.

31

Page 32: Program Analysis and Black-box GUI Testing

Thus, for a set of test sequences, the same prefix and suffix path in the EFG could beused several times. However, we argue that this does not present a drawback for tworeasons: (1) If the prefix or the suffix path consists of a relevant event sequence, thenthis sequence will be separately considered as an event dependency sequence (andlater on as a test sequence). (2) If the prefix or the suffix path consists of an irrelevantevent sequence, then this sequence simply serves for the purposes of making an eventdependency sequence executable. However, a possible extension to the step Test Se-quence Generation is to consider previously “unused” pre- and suffix paths. A furtherextension to the step Test Sequence Generation is to prefer pre- and suffix paths thatdo not “overwrite” fields in the event dependency sequence of a test sequence.

The overall code coverage reported in our experiment is relatively low for sev-eral reasons. For instance, key strokes (KeyListener in Java) and mouse gestures(MouseListener in Java) are not yet considered, but frequently used in the pro-gram FreeMind in order to draw a mind-map via mouse interactions. Support forthese events in the GUI Ripper and Replayer is scheduled for future releases of GUI-TAR [55]. Moreover, the use of random input data in the GUI Ripper and Replayermay only lead to the execution of default branches [83].

It is not possible to combine an EFG and an EDG into one graph-based model:On the one hand it may be possible to label a directed edge (e1,e2) in the EFG witha weight of the dependency (e. g., zero in case of a non-dependency). On the otherhand, a directed weighted edge (e3,e4) has to be added to the EFG, if a dependencyis detected. However, the added edge may represent an event flow that is not allowedin the GUI.

32

Page 33: Program Analysis and Black-box GUI Testing

3

Elimination of Redundant Event Sequences

In Chapter 1 we emphasized that event sequences should be both executable andrelevant, in the sense that as much of the code as possible is covered. In this context,one challenge is the potentially unbounded space of possible user interactions, andhence, the potentially unbounded space of possible event sequences.

Recent approaches to tackle this challenge can be classified as iterative and non-iterative approaches. Iterative approaches (e. g., [30, 31, 47, 48]) generate test caseson-the-fly, and test cases can be executed after their generation. In order to keepthe approach practical, the generation and execution time is usually limited with atimeout. Complementary to iterative approaches, non-iterative approaches (e. g., [7,52]) generate the entire suite of test cases before their actual execution, which hasthe advantage that a complete set of test cases (e. g., all event sequences of a fixedlength) is generated. In this context, most of the work concentrated on black-boxapproaches is to obtain executable test cases. For example, Belli [13] proposed thenotion of Event Sequence Graphs (ESGs), and Memon [52] proposed the notion ofEvent Flow Graphs (EFGs) to approximate the (black-box) behavior of the GUI.In addition to black-box approaches, white-box approaches have been proposed forGUI testing (e. g., [25, 60, 66]). However, it is open whether a white-box approachcan realistically infer relevant event sequences that are executable on the GUI.

The approach presented in Chapter 2 aims at combining the best of two worlds bycombining black-box and white-box techniques to identify both executable and rele-vant sequences. The white-box part selects “skeletons” of event sequences based onpairs of events that are in a def-use relationship. Such pairs of events are representedby an Event Dependency Graph (EDG) that is computed statically. The black-boxpart “fills” this skeleton with events such that the overall sequence becomes exe-cutable. Roughly speaking, this approach combines what one “should test” (eventsthat depend on each other identified with the EDG) with what one “can test” (eventsthat are executable identified with the EFG). While this approach has shown promis-ing performance compared to black-box testing, it only considers pairs of def-use

33

Page 34: Program Analysis and Black-box GUI Testing

events for the white-box part. However, longer event sequences are often desired todetect more complex bugs [10, 14, 76]. Furthermore, although only pairs of eventshave been considered, the number of resulting test cases is still huge, which limits itspractical applicability.

In this chapter we provide a generalization of the approach presented in Chapter 2.The generalization is not restricted to pairs, but can handle an arbitrary number ofdependent events while guaranteeing that all relevant event sequences are generatedbased on these events. To make this generalization scalable in practice, we present arefined static analysis approach based on a variant of program slicing [72] to reducethe resulting number of test cases. In more detail, we present an approach that allowsus to identify redundant event sequences which can safely be ignored for the testcase generation, because non-redundant sequences already guarantee the same codecoverage. Our experiments confirm its practical potential on a number of real-worldopen source GUI programs. In particular, our experiments demonstrate that redundantevent sequences occur in a huge number of various open source GUI programs. Thereduced number of generated test cases can strongly reduce the overall execution timeof the remaining test cases.

The remainder of this chapter is organized as follows. Next, we motivate the ap-plicability of our approach using an example of a GUI program. Section 3.2 formallyintroduces the concept of redundant event sequences, and Section 3.3 presents itsactual implementation on a technical level based on program slicing. We evaluatethe new approach in Section 3.4, which is followed by a discussion of its results inSection 3.5.

3.1 Motivation

The approach presented in Chapter 2 has shown its potential on several GUI pro-grams. However, in practice, longer event sequences than those handled by this ap-proach are often desired [10, 14, 76]. Considering longer event sequences in turnleads to a huge number of overall test cases. In the following, we provide a simplemotivating example, which is inspired by a real-world GUI program to show theseproblems in more detail, and to outline our approach to safely reduce the resultingnumber of test cases.

Consider the example of a GUI program in Figure 3.1, which is inspired by theTerpPaint program [49]; see Section 3.4 for more details. The window offers a userto modify a recently opened image. To modify the image, the user may click thecheckbox in order to convert the image to grayscale, and may rotate the image bychoosing an angle from the slider control. Furthermore, an existing angle can beloaded from the user settings (using the button Load), and a new angle can be savedas a user setting (using the button Save). The button OK performs the modification to

34

Page 35: Program Analysis and Black-box GUI Testing

the image and closes the window; the button Cancel closes the window without anymodifications to the image.

Fig. 3.1: An example program inspired by TerpPaint.

1 class ModifyImageWindow extends JFrame {2

3 boolean convert = false;4 int angle = 0;5

6 void onCheckBox() {7 int cbValue = checkBox.getValue();8 convert = (1 == cbValue) ? true : false;9 }

10

11 void onSlider() {12 int sliderValue = slider.getValue();13 angle = sliderValue;14 print(convert, angle);15 }16

17 void onSave() {18 UserSettings.RotationAngle = angle;19 }20

21 void onOK() {22 if ( convert ) {23 image.convertToGrayscale();24 image = null;25 }26 if ( angle > 0 ) {27 // BUG: crashes if image was converted to grayscale28 image.rotate(angle);29 }30 }31 }

Fig. 3.2: The event handlers extracted from the example program. The class ModifyImageWindowdefines the event handlers onCheckBox, onSlider, onSave, and onOK.

35

Page 36: Program Analysis and Black-box GUI Testing

Figure 3.2 shows a snippet of the code that describes the GUI program. The pro-gram contains event handlers that define the behavior of the GUI in case a corre-sponding event (i. e., interaction with the user) occurs. We will use the terms eventand event handler interchangeably when the meaning is clear from the context. Inthis example, there are the four event handlers onCheckBox, onSlider, onSaveand onOK. onCheckBox reads the current value from the checkbox and assigns itsvalue to the field convert. onSlider reads the current angle from the slider con-trol and assigns the value to the field angle. It also prints the current values to alog. onSave saves the current angle as a user setting. onOK converts the image tograyscale and resets the image object; furthermore it rotates the image by the givenangle.

As indicated in Figure 3.2 (line 27-28), the GUI program contains a bug that canoccur if the event handler onOK is executed after the event handlers onCheckBoxand onSlider: If onOK is executed, and the field convert is true (set byonCheckBox), and also the field angle is greater than zero (set by onSlider),then the image object is set to null, causing a NullPointerException in line 28.

onCheckBox onSlider onSave onOK

Fig. 3.3: Event Dependency Graph (EDG) for the example program. Each edge expresses a def-use(a write/read) dependency: For example, the event onCheckBox defines (writes) the field convert,which is used (read) in the event onSlider and in the event onOK.

The static analysis approach presented in Chapter 2 does not detect this bug, be-cause it generates test cases as follows: For the given GUI program, an Event Depen-dency Graph (EDG) is computed that reflects the def-use dependencies of the events.The nodes of the EDG are defined as the events of the GUI program, and there is anedge between two events if these events are in def-use relationship. For our concreteexample, this graph is depicted in Figure 3.3.

Based on the EDG, for all pairs of events e and e′ such that there is an edgebetween e and e′, a test case “skeleton” that contains e and e′ is computed. Thissequence is considered relevant, because the events are dependent. In our example,the following five event sequences are generated.

s1 = 〈 onCheckBox , onOK 〉 s4 = 〈 onSlider , onSlider 〉

s2 = 〈 onCheckBox , onSlider 〉 s5 = 〈 onSlider , onOK 〉

s3 = 〈 onSlider , onSave 〉

36

Page 37: Program Analysis and Black-box GUI Testing

Since such skeleton sequences are abstract in the sense that they are not necessar-ily executable on the GUI, a black-box model of the GUI is finally applied (e. g., theapproach proposed in Memon [52]) in order to transform the event dependency se-quence in an executable test sequence. (In our example, all sequences are executable,so there is nothing more to do). At this point, the important observation is that the bugin line 28 is not detected by this approach, as three events (onOK, onCheckBox andonSlider) are needed to detect the bug. Hence, increasing the length from n = 2to n = 3 would clearly detect the bug. However, as the potential number of resultingabstract event sequences becomes huge, techniques to effectively reduce the num-ber of sequences are required. In the following, we outline our approach to identifyredundant event sequences that can be removed while still obtaining the same codecoverage.

3.1.1 Examples for Redundancy

Assume one would like to test the event onOK of the example program in Figure 3.1.Furthermore, assume that the following two event sequences s1 and s2 have beengenerated in order to achieve this.

s1 = 〈 onCheckBox , onSlider , onOK 〉

s2 = 〈 onSlider , onCheckBox , onOK 〉

Both s1 and s2 detect the bug in line 28. In fact, we observe that these se-quences are equivalent in the sense that the execution ordering of onCheckBoxand onSlider does not matter for the execution of onOK (even though there is anedge in the EDG from onCheckBox to onSlider). Hence, one of these sequencesis redundant and can be ignored. This is essentially a simple variant of partial orderreduction [27, 29] applied to GUI testing. We will make this precise in the next sec-tion.

As a further example of redundant event sequences, assume one would like to testthe event onSave of the example program. Further assume that the following eventsequence s3 has been generated in order to achieve this.

s3 = 〈 onCheckBox , onSlider , onSave 〉

Although there is an edge in the EDG from onCheckBox to onSlider,and from onSlider to onSave, there is no “causal” data-flow from the firstto the third of these events. This is because the variable that causes the data-flow from onCheckBox to onSlider (i. e., the variable convert is defined by

37

Page 38: Program Analysis and Black-box GUI Testing

onCheckBox and used by onSlider) is different from the variable that causes thedata-flow from onSlider to onSave (i. e., the variable angle). Hence, the globaleffect of s3 after executing onSave is completely independent of the onCheckBoxevent, and it suffices to consider the shorter sequence 〈 onSlider , onSave 〉 insteadof s3 to test the onSave event. In this sense, s3 is redundant. (As a side remark, thevariables cbValue and sliderValue are local and can hence be ignored for theseconsiderations). Informally speaking, this problem of “pseudo-dependency” arisesbecause the EDG is computed statically and syntactically, without a deeper analysisof the actual causal data-flow [72]. In the next sections, we present approaches tosystematically identify such redundant sequences, and present effective and efficientways for their implementation based on program slicing.

3.2 Redundant Event Sequences

In this section we first introduce a generalization of the test case generation algorithmpresented in Chapter 2. Then, we present two approaches to identify redundant testcases when applied in the context of our generalized algorithm.

3.2.1 Test Case Generation Algorithm

In this section we formalize the ideas of the previous motivation section, and providean algorithm for test case generation as a result. To introduce our approach, we iden-tify a GUI program through its corresponding finite set of events Evts. Let us startour considerations with a formal definition of the Event Dependency Graph, whichwas previously introduced in Chapter 2.

Definition 1 (Event Dependency Graph (EDG)). Let Evts be a finite set of events.The Event Dependency Graph for Evts is defined as the directed graph EDG= (V,E),where V =Evts is the set of nodes, and there is an edge (e,e′)∈E iff there is a def-usedependency between e and e′ (i. e., there is a variable that is defined by e and used bye′).

For a given GUI program, the Event Dependency Graph reflects the pairwise def-use dependencies of its events. EDGs reflect an important concept in identifying rele-vant test cases (i. e., test cases that “should be tested”). In the following, we formalizethis intuition based on the definition of relevant EDG-sequences.

Definition 2 (Relevant EDG-Sequence). Let Evts be a finite set of events, and letn ∈ N be a natural number. Then, EDG(n) is defined as the set of event sequences〈e1, . . . ,en〉 of length n such that for all i ∈ {1, . . . ,n−1}, there is a j ∈ {i+1, . . . ,n}and an edge from ei to e j in the EDG.

38

Page 39: Program Analysis and Black-box GUI Testing

Informally speaking, for n ∈ N, the set EDG(n) contains exactly those event se-quences that are causally linked in the sense that events occurring “later” in the se-quence can be influenced by events that occur “earlier”. This definition of relevantEDG-sequences actually describes the “heart” of the generalization of the test casegeneration algorithm; see Chapter 2. More precisely, for a given n ∈N, test cases aregenerated according to the following procedure GenerateRelevantTestCases:

1. For all i ∈ {1, . . . ,n}, compute the set EDG(i) that contains all relevant EDG-sequences of length i.

2. For all sequences s∈ EDG(i): If s is not executable, use a black-box model of theGUI (e. g., [13, 52]) to enhance s such that it becomes executable.

At this point, we observe that the previously proposed static analysis approachis a special case of this algorithm for n = 2. It generates all relevant test cases oflength ≤ n. However, generating all of these test cases results in a huge number thatexceeds the number that can be handled in reasonable resource limits even for smalln. To tackle this problem to become scalable in practice, we present two approachesto identify redundant event sequences.

3.2.2 Partial Order Redundancy

Partial order reduction (POR) [27, 29] is a well-established approach to tackle thestate explosion in the area of model checking. Most of the POR-based techniquesare state-dependent, which means that the actual pruning decisions depend on thecurrently encountered state.

In this section we apply a state-independent POR-based technique to identify aclass of redundant event sequences for GUI testing. This technique is based on thesimple observation that two events that are independent in the sense that they can beapplied in both possible orderings with the same global effect need not be consideredin both, but only in one of these orderings. This idea has been investigated underthe name commutativity pruning in the area of Artificial Intelligence [37]. In thefollowing, we formalize this intuition.

Definition 3 (POR-Redundancy). Let Evts be a finite set of events, let < be a totalordering on Evts. Let n ∈ N and s = 〈e1, . . . ,en〉 ∈ EDG(n). If

1. the values of the variables that are used by en are the same after executinge1, . . . ,en−1 in all possible orderings, and

2. the first n−1 events do not occur in the “correct” ordering according to < (i. e.,there are events e and e′ with {e,e′} ⊆ {e1, . . . ,en−1} such that e′ occurs before ein s, but e < e′)

then s is POR-redundant w.r.t. the ordering <.

39

Page 40: Program Analysis and Black-box GUI Testing

Informally speaking, an event sequence s = 〈e1, . . . ,en〉 ∈ EDG(n) can be POR-redundant if the execution ordering of the first n− 1 does not matter for the finalevent en. In such cases, it suffices to consider only one out of (n−1)! possible eventsequences. Definition 3 formalizes this idea based on the ordering < by implicitlydeclaring (n−1)!−1 sequences as POR-redundant, and retaining one representativesequence as relevant (i. e., as not POR-redundant). We will describe how to actuallyinstantiate and implement this definition on a technical level in the next section.

For now, to get a better idea of POR-redundancy, consider again the first exampleof the motivation section restricted to the events Evts= {onCheckBox, onSlider,onOK}. Suppose we choose the ordering onCheckBox < onSlider < onOK.According to <, and because onCheckBox and onSlider can be executed in bothpossible orderings to test onOK, the sequence 〈 onSlider , onCheckBox , onOK 〉 isPOR-redundant. Generally, we observe that POR-redundant event sequences havethe property that the test case generation algorithm given in the previous section canignore such sequences without reducing code coverage. More formally, for a finiteset of events Evts and for a total ordering < on Evts, the sets of generated test caseswith algorithm GenerateRelevantTestCases based on the event sequences

EDG(n) and EDG(n) \ {s | s is POR-redundant w.r.t. <}

are equivalent for all n ∈ N.

3.2.3 Causal Variable Redundancy

To introduce our second technique to identify redundant event sequences (and hence,to reduce the number of test cases), let us first re-consider the second example ofthe motivation section. On a more formal level, the example consists of the eventsequence s = 〈e1,e2,e3〉 ∈ EDG(3). We observe that s is “chain-shaped” in the sensethat there is an edge in the EDG between e1 and e2, there is an edge between e2 ande3, and there are no further edges. Recall that s is executable.

In this example we observe that the variables causing the def-use dependency ofe1 and e2 are disjoint from the variables causing the def-use dependency of e2 and e3.Hence, despite the edge from e1 to e2, the effect of executing e3 is independent of theeffect of e1.

Therefore, s does not need to be considered, because it suffices to consider theshorter sequences s′ = 〈e2,e3〉 and s′′ = 〈e1,e3〉 to test e3. To formalize the idea ofthis reduction technique, we need to talk about the variables that cause def-use de-pendencies.

Definition 4 (Labels of Edges). Let Evts be a finite set of events, let e,e′ ∈ Evts. Theset of associated labels labels(e,e′) to e and e′ is defined as the set of variables thatare defined by e and used by e′.

40

Page 41: Program Analysis and Black-box GUI Testing

Note that the above definition can be considered as the implicit labeling of theedges in the EDG, because edges are represented as pairs of events. Based on thisdefinition, we formalize the idea of redundant sequences discussed above.

Definition 5 (CVA-Redundancy). Let Evts be a finite set of events, and let EDG =(Evts,E) be the Event Dependency Graph for Evts. Furthermore, let n ∈ N, and lets= 〈e1, . . . ,en〉 ∈EDG(n) such that there are edges (ei,ei+1)∈E for i∈{1, . . . ,n−1}in the EDG, and there are no further edges between events in s. If there exist eventse,e′,e′′ with {e,e′,e′′} ⊆ {e1, . . . ,en} such that

1. (e,e′) ∈ E, (e′,e′′) ∈ E, and2. labels(e,e′)∩ labels(e′,e′′) = /0,

then s is CVA-redundant (i. e., redundant with respect to causal variable analysis).

In the above example, the sequence s is CVA-redundant, because labels(e1,e2)∩labels(e2,e3)= /0. Similarly to POR-redundant sequences, CVA-redundant sequencesneed not be considered w.r.t. code coverage. More formally, for a finite set of eventsEvts and the Event Dependency Graph EDG = (Evts,E) for Evts, the sets of gen-erated test cases with algorithm GenerateRelevantTestCases based on the event se-quences

EDG(n) and EDG(n) \ {s | s is CVA-redundant}

are equivalent for all n ∈ N.

3.3 Slicing-based Implementation

In the previous section we have provided the theoretical background to identify re-dundant event sequences. In this section we put these theoretical notions of redun-dancy into practice. A central concept of our implementation is the concept of pro-gram slicing [72]. We particularly address the questions of how to compute slices (inparticular, how to compute interprocedural slices), and how to effectively and effi-ciently implement the two notions of redundancy. The overall implementation and itscomponents are depicted in Figure 3.4 and integrated into the tool Gazoo [26]. In thefollowing, we provide a more detailed description.

3.3.1 Soot

In a first step, the Soot component performs a source-to-source transformation of theinput GUI program to an appropriate intermediate format that serves as input for thebody transformer. In particular, the intermediate format is supposed to appropriatelyreflect the required information about defs (definitions) and uses for the further anal-ysis (e. g., the bytecode analysis tool ASM [9] is not appropriate for our purpose,

41

Page 42: Program Analysis and Black-box GUI Testing

since we would have to build an interpreter, which analyzes the stack of bytecodeinstructions [46]). The defs and uses are necessary in order to compute the programslices in a later step.

Program'Slicer'

Sequence'Generator'

Body'Transformer'

GUI'Ripper'

Event'Handler'Extractor'

GUI'P

rogram

'

EDG$

EFG$

Bytecode$

GUI$

Test$Sequences$

$Bytecode,$GUI$

Event$Handler$

Test$Cases$pass$ fail$

Soot'

Jimple$Code$

Defs,$Uses$

Replayer'GUI$

Slices$

r0$=$@this;$

defs$=${x}$uses$=${y}$$$

Fig. 3.4: Our implementation is based on three main components: (1) The Body Transformer collects alldefs and all uses from each (method) body. (2) The Program Slicer constructs the EDG data structureand creates a def-use chain (i. e., a program slice) for each field definition (field-def). (3) The SequenceGenerator generates a set of event sequences from the EDG and uses the slices to eliminate redundantevent sequences.

42

Page 43: Program Analysis and Black-box GUI Testing

To compute such an intermediate format, our implementation applies the toolSoot [45]. Soot takes the bytecode [46] of the GUI program as input, and transformsit to Jimple code [70]. Jimple code is three-address code, which is an intermediaterepresentation of bytecode that simplifies the analysis. Most importantly, Soot pro-vides an API to extract the defs (definitions) and uses of each statement in a method.A statement is called Unit in Jimple code.

3.3.2 The Body Transformer

The body transformer takes as input the Jimple code, and returns the set of defs anduses of fields and (local) variables of the GUI program. This information is laterused in the program slicer in order to create backward slices [72] of fields. Roughlyspeaking, a backward slice of a def expresses which previous uses in the Jimple code(and thus, in the Java code) affect the definition of the field. To generate the necessaryinformation about defs and uses, the body transformer analyzes the units of the Jimplecode (which are comparable to statements in Java code). For each unit u, the defs anduses are extracted and maintained in a mapping Defs(u) and Uses(u), respectively.

As an example, Figure 3.5 shows the Jimple code for the event handler onSliderpresented in the motivating example; see Figure 3.1. Here, the body transformercollects the following defs (in the following depicted in red boxes): r0 (line 6),$r1 (line 7), i0 (line 8), r0.angle (line 9), $i1 (line 10), and $i2 (line 11).

Furthermore, the body transformer collects the following uses (depicted in greenboxes): @this (line 6), r0.slider (line 7), $r1 (line 8), i0 (line 9),r0.convert (line 10), r0.angle (line 11), $i1 and $i2 (both line 12).

1 void onSlider() {2 ModifyImageWindow r0;3 int i0, $i1, $i2;4 Slider $r1;5

6 r0 := @this;7 $r1 = r0.slider;8 i0 = $r1.getValue();9 r0.angle = i0;

10 $i1 = r0.convert;11 $i2 = r0.angle;12 r0.print($i1, $i2);13 }

Fig. 3.5: Jimple code of the event handler onSlider of the example program. The code snippet con-sists of 10 units (i. e., line 2-4 and line 6-12) used to extract defs and uses, and to compute backwardslices.

43

Page 44: Program Analysis and Black-box GUI Testing

3.3.3 The Program Slicer

The program slicer takes as input the event handlers (that come from an additionalcomponent that extracts them accordingly), and the set of defs and uses from the bodytransformer. Then, the slicer both constructs the EDG (according to the techniquepresented in Chapter 2) and computes the set of field slices of the GUI program. Forthe computation of the slices, we have to take into account that an edge in the EDGreflects a def-use dependency between two events, rather than between two fields inthe events. Unlike a (local) variable, the scope of a field is not restricted to only onemethod. In order to provide the dependency information of fields, we apply backwardslicing, i. e., we compute the backward slice for each field definition (field-def). Toeffectively and efficiently compute backward slices, we address the two followingadditional technicalities.

For example, for the Jimple code depicted in Figure 3.5, the slicer computes thebackward slice for the field-def r0.angle in line 9. By looking up the uses of thisfield-def in Uses(u), the slicer detects that this field-def depends on the use of thevariable i0 . Hence, i0 is added to the backward slice. Afterwards, the slicer con-tinues to look up a possible definition of the variable i0 by iterating over all unitsin the set Defs(u). It finds the definition in line 8 and detects, again by looking upthe set Uses(u), that this definition depends on the use of the variable r1 . Hence,r1 is added to the backward slice as well. The slicer continues to look for further

dependencies and only stops if a certain definition is not found, or the scope of theanalysis is reached (i. e., the classpath of the program). In our example, the backwardslice for the field-def r0.angle consists of i0 , r1 , and r0.slider .

Interprocedural Slicing

Slicing programs that call methods represents a challenging task [38]. In our imple-mentation, we follow both possibly called methods and possible callees of the currentmethod. For example, if a variable is defined by the return value of a method, then theslicer follows the method and performs a backward slice of the return statement ofthe called method. Moreover, the slicer follows the parameters of the called methodback to the callee of the method. For example, we follow the method getValuecalled in line 8 in Figure 3.5.

Noise Reduction

The program slicer additionally reduces possible “noise”, which can appear whenanalyzing code close to the bytecode level. For example, when computing backwardslices, the dedicated @this statement in the Jimple code can safely be ignored. Thisstatement maps each variable of the method to the current object used in the method.For our approach, this mapping is not needed, since we are not interested in slicing

44

Page 45: Program Analysis and Black-box GUI Testing

individual objects [1, 43] that may appear during run-time. Thus, our implementationdrops this information during the creation of backward slices.

3.3.4 The Sequence Generator

In the previous section we have discussed how the body transformer and the programslicer compute slices of events. In this section we will discuss how the slices areapplied by the sequence generator component in order to eliminate redundant eventsequences.

The sequence generator takes as input the EDG and the slices from the programslicer (as well the Event Flow Graph from the GUI Ripper; see the auxiliary sectionbelow). It returns the set of test sequences that can be executed by the replayer com-ponent. In a first step, the sequence generator uses the EDG in order to generate allsequences of a fixed length n as defined in Section 3.2. In a second step, the sequencegenerator uses the slices and performs the computation and elimination of redundantevent sequences. (In a future version of the implementation, we plan to combine thesetwo steps into one single step).

Computation of POR-redundant Sequences

In order to detect whether the ordering of two events e and e′ does not matter to testan event e′′ in a sequence, the sequence generator analyzes if the intersection of allbackward slices of field-defs from e and from e′ is empty. If this is the case, then eand e′ are independent, and thus, the sequence 〈e,e′,e′′〉 is equivalent to the sequence〈e′,e,e′′〉.

For example, in order to test the event onOK of the example program, the se-quence generator may first select the two sequences s1 = 〈 onCheckBox , onSlider ,onOK 〉 and s2 = 〈 onSlider , onCheckBox , onOK 〉. These two event sequences are

generated because the event onCheckBox defines the field convert and the eventonSlider defines the field angle, which are both used in the event onOK. Af-terwards, the backward slice of the field convert in the event onCheckBox andthe backward slice of the field angle in the event onSlider are analyzed. In thisexample we get the following result (for brevity, we use the Java code instead of theJimple code for presentation).

convert = { cbValue , checkBox }

angle = { sliderValue , slider }

We observe that the intersection of convert and angle is empty, which accu-rately reflects the independence of the corresponding fields convert and angle.

45

Page 46: Program Analysis and Black-box GUI Testing

Based on this information, we eliminate one of these sequences. In particular, weobserve once again that although the EDG expresses a def-use dependency be-tween onCheckBox and onSlider, there exists no causal data-flow between theseevents.

Computation of CVA-redundant Sequences

In order to detect whether there exists no causal data-flow between three events ina sequence, say 〈e,e′,e′′〉, the sequence generator analyzes if all backward slices offields in e′ do not contain any field-def of e. If this is the case, then e and e′ are not de-pendent, and thus, the effect of e does not affect e′′. Hence, the sequence is consideredas redundant, because it is sufficient to test the sequences 〈e,e′〉 and 〈e′,e′′〉.

For example, in order to test the event onSave of the example program, thesequence generator selects the sequence s3 = 〈 onCheckBox , onSlider , onSave 〉.This event sequence is generated because the event onCheckBox defines the fieldconvert which is used in onSlider, and the event onSlider defines the fieldangle which is used by onSave. Analyzing the backward slice of the field anglein onSlider provides the following information.

angle = { sliderValue , slider }

We observe that the field angle defined in onSlider does not depend on thefield convert defined in onCheckBox. Hence, the sequence is CVA-redundant,and thus, eliminated from the set of event sequences. In particular, we observeonce again that it suffices to test the event sequences 〈 onCheckBox , onSave 〉 and〈 onSlider , onSave 〉 instead.

Auxiliary Steps

For an explanation of the Auxiliary Steps (i. e., the steps Event Handler Extractor,GUI Ripper, and Replayer), we refer to Section 2.2.

3.4 Experiments

In this section we experimentally evaluate our approach. We first present the setup ofthe experiments. Then, we address a set of research questions and discuss them withthe help of the results of the experiments.

46

Page 47: Program Analysis and Black-box GUI Testing

3.4.1 Setup of the Experiments

We evaluate our approach using stable versions of six Java-based open source pro-grams: JabRef 2.7 [39] manages bibliographic references, FreeMind 0.9 [23] gen-erates mind maps, and Rachota 2.3 [62] is a time recording system. Furthermore,the programs TerpWord, TerpSpreadSheet and TerpPaint form a suite of office pro-grams developed by undergraduate students. For Rachota, TerpWord, TerpSpread-Sheet, and TerpPaint, we use the artifacts available from Community Event-basedTesting (COMET) [19]. In particular, we choose these programs to consider bothsmall (TerpWord: 6,842 LOC) and large programs (JabRef: 77,745 LOC). Figure 3.6shows some relevant statistics for each Application Under Test (AUT).

AUT Events Methods Units Defs UsesJabRef 776 7,930 123,760 5,980 18,206FreeMind 959 8,111 75,547 3,392 8,459

Rachota 154 1,838 24,635 1,399 3,189

TerpPaint 317 1,759 34,323 2,524 8,474

TerpSpreadSheet 312 1,295 15,206 722 2,013

TerpWord 159 965 10,821 368 1,538

Fig. 3.6: Statistical information about the applications under test (AUTs) used in the evaluation of ourapproach. The maximum defs and uses are highlighted in bold.

In our experiments, we compare the performance of our slicing-based approachfor n = 3 (in the following called EDG-3-Sliced) to the baseline EDG-approach with-out slicing (i. e., to the algorithm GenerateRelevantTestCases as described in Sec-tion 3.2 without slicing, called EDG-3 in the following). Moreover, we compare tothe approach proposed by Memon [52] (called EFG-3 in the following). The config-uration EFG-3 internally uses an Event Interaction Graph (EIG) [81], which repre-sents an optimized version of an EFG. Hence, both an EIG and an EDG only containevents that are associated to event handlers in the GUI program, whereas an EFGmay contain events that are not associated to event handlers (e. g., events that openpopup menus). For brevity we stick on the configuration name EFG-3 for the EventInteraction Graph.

In all three configurations, we set the parameter to n = 3. This choice is motivatedby the fact that the number of event sequences generated by a black-box approachalready becomes unmanageable for sequences of length greater than 3 [81]. For ex-ample, generating all sequences of length 4 for the program JabRef would alreadyresult in more than 175 million sequences to be executed, which is clearly too largeto be handled with reasonable resource limits. Still, by setting the parameter to n = 3,the total number of executed test cases in our experiments amounts to 49,378,177.

47

Page 48: Program Analysis and Black-box GUI Testing

Each event sequence is embedded into one test case as described in Section 3.3.The replayer generates random input data (e. g., random string values for text boxes).The computation of suitable input data, which can possibly achieve higher code cov-erage (see [5, 25, 28, 30, 31, 69]) represents an orthogonal problem and is not inthe scope of this chapter. The replayer employs a crash monitor as an oracle, whichis simple but reasonable. In particular, we record any exception occurred during testcase execution. For a discussion of alternative oracles, we refer to [22, 54, 56, 77].

The test cases are executed on a cluster with 50 virtual machines, which representsa rather common experimental setup both in the research community [50] as well asin industry.1 Each virtual machine utilizes one physical CPU-core with 2.0 GHz,1 GB RAM, and 20 GB HDD. In order to mitigate the effect of randomness, allconfigurations are executed three times.

3.4.2 Results of the Experiments

We evaluate our approach with respect to the following research questions.

Is our approach able to eliminate a non-trivial number of redundant eventsequences?

To answer this question, we refer to Figure 3.7, which provides an overview of ourresults. Apparently, for all AUTs, the configuration EDG-3-Sliced is able to elimi-nate a relatively high number of redundant sequences. In particular, for JabRef, theconfiguration EDG-3-Sliced eliminates 69% of the sequences generated by the con-figuration EDG-3. Similarly, EDG-3-Sliced eliminates 74% for FreeMind, 72% forRachota and TerpPaint, 46% for TerpSpreadSheet and 41% for TerpWord. Overall, inall of these GUI programs, the number of sequences is reduced to less than a half, in4 out of 6 programs even to roughly a quarter of the original sequences with EDG-3.

Does our approach scale to realistic GUI programs, particularly w.r.t. theoverhead for computing the slices?

To answer this question, we again refer to Figure 3.7, which particularly shows thenumber of generated event sequences, the overall generation time (i. e., the timeneeded for generating the entire set of event sequences), and the execution time (i. e.,the overall time needed to execute (“replay”) the generated event sequences, mea-sured on a cluster with 50 virtual machines).

First, comparing EDG-3-Sliced to EFG-3, we observe that the number of gen-erated sequences for EDG-3-Sliced is significantly lower for JabRef, FreeMind and

1 Personal communication (mid-sized software company and industrial partner)

48

Page 49: Program Analysis and Black-box GUI Testing

AUT EFG-3 EDG-3 EDG-3Sliced

JabRef# sequences 19,859,369 821,993 255,132generation time (m) 7,943 369 118gen. time per seq. (ms) 24 27 28execution time (d) 133.30 5.50 1.70line coverage (%) 56 58 58# detected bugs

FreeMind# sequences 17,830,612 6,093,201 1,600,638generation time (m) 7,132 2,667 754gen. time per seq. (ms) 24 26 28execution time (d) 123.82 42.30 11.10line coverage (%) 54 54 54# detected bugs - - -

Rachota# sequences 22,588 123,256 34,981generation time (m) 9 50 15gen. time per seq. (ms) 24 24 25execution time (d) 0.08 0.44 0.12line coverage (%) 62 64 64# detected bugs -

TerpPaint# sequences 1,098,639 495,467 138,603generation time (m) 439 213 62gen. time per seq. (ms) 24 26 27execution time (d) 4.06 1.82 0.50line coverage (%) 49 54 54# detected bugs -

TerpSpreadSheet# sequences 40,299 117,938 63,184generation time (m) 16 48 27gen. time per seq. (ms) 24 24 25execution time (d) 0.12 0.34 0.18line coverage (%) 35 38 38# detected bugs

TerpWord# sequences 122,991 415,888 243,398generation time (m) 49 173 105gen. time per seq. (ms) 24 25 26execution time (d) 0.36 1.24 0.72line coverage (%) 56 56 56# detected bugs - - -

Fig. 3.7: Results overview of the experiments. Abbreviations: #sequences: number of generated eventsequences. generation time: overall time (in minutes) needed to generate all sequences (including thetime needed for our slicing technique in EDG-3-Sliced). execution time: overall time (in days) neededto execute (“replay”) all sequences (on a cluster with 50 virtual machines).

49

Page 50: Program Analysis and Black-box GUI Testing

TerpPaint. For Rachota, TerpSpreadSheet and TerpWord, the configuration EDG-3-Sliced generates more sequences than EFG-3, because these programs apparentlycontain more depending events than consecutive (executable) events for sequencesof length n = 3. However, the execution time of EDG-3-Sliced compared to EFG-3is still acceptable, and significantly lower than EDG-3 without slicing in all theseprograms. In particular, compared to EDG-3, the reduced number of sequences withEDG-3-Sliced leads to a significant reduction of execution time. For example, theexecution time for the large AUTs JabRef and FreeMind is reduced by several days(3.8 days for JabRef and 31.2 days for FreeMind). Considering the overall resultingexecution time, FreeMind (where 11.1 days are needed to execute the 1,600,638 testcases) is an outlier. Note that the results confirm our theoretical results from the pre-vious section: With EDG-3-Sliced, we obtain the same code coverage as with EDG-3.We will discuss both execution time and code coverage in more detail in Section 3.5.

Second, considering the overhead to compute program slices, let us have a closerlook at the generation time. We observe that the generation time per sequence ofEDG-3-Sliced compared to EDG-3 and EFG-3 incurred by program slicing is not anissue in practice: For all AUTs, the occurred overhead consists of 2–4 milliseconds.This is probably best explained by Figure 3.8, which depicts the lengths of the back-ward slices occurred in the analysis: While there are some outliers (up to length 304for TerpPaint), the mean length of the slices is short (i. e., in a range of 5 to 7) andcan be computed efficiently.

AUT Min Median Mean MaxJabRef 1 3 7 191FreeMind 1 3 6 217Rachota 2 4 6 206TerpPaint 1 3 7 304TerpSpreadSheet 2 3 5 58

TerpWord 2 3 6 31

Fig. 3.8: Statistical information about the lengths (number of units) of the backward slices computedfor the AUTs in our experiments.

What is the proportion of POR-redundant and CVA-redundant sequencescompared to the overall number of redundant sequences?

To answer this question, we refer to Figure 3.9, which shows the corresponding pro-portions graphically for all AUTs. In this figure, sequences that are neither POR-redundant nor CVA-redundant are called relevant. We observe that CVA-redundantsequences occur in 36%-55% of the cases, whereas POR-redundant sequences range

50

Page 51: Program Analysis and Black-box GUI Testing

from 18%-25%. That is, causal event sequences of length 3 do not appear frequentlyin the GUI programs of our experiments. A potential explanation is that most of theevent handlers are coherent and not strongly coupled to other events. For example,it is likely that the events in a search-and-replace window do not affect the events inthe preference window, and vice versa. Overall, in the considered applications undertest, CVA-redundant sequences occur slightly more often than POR-redundant ones.

POR18 %

CVA52 %

Relevant31 %

(a) JabRef

POR26 %

CVA48 %

Relevant26 %

(b) FreeMind

POR24 %

CVA48 %

Relevant28 %

(c) Rachota

POR25 %

CVA47 %

Relevant28 %

(d) TerpPaint

POR16 %

CVA31 %

Relevant53 %

(e) TerpSpreadSheet

POR14 %

CVA28 %

Relevant58 %

(f) TerpWord

Fig. 3.9: The proportions of redundant sequences eliminated by the techniques POR and CVA, as wellas the proportion of remaining relevant sequences.

How do the lengths of generated event sequences with EDG-3-Sliced compareto those with EDG-2?

To answer this question in more detail, we refer to Figure 3.10, which shows thelength of the sequences (x-axis) and the number of generated event sequences (y-axis, in log scale). We present detailed data for the small program TerpWord and forthe two large programs JabRef and FreeMind (the results for the other three programsare similar). The black points indicate the number of sequences generated by EDG-3-Sliced; the gray points indicate the number of sequences generated by EDG-2.We observe that for most n, EDG-3-Sliced produces significantly more sequencesof length n. Furthermore, we observe that EDG-3-Sliced handles significantly longer

51

Page 52: Program Analysis and Black-box GUI Testing

sequences than EDG-2. For example, in FreeMind, the length of the longest generatedsequences increases from 10 (EDG-2) to 17 (EDG-3-Sliced). Note that, in order tocapture all these sequences with a black-box approach, the parameter n for EFG-nwould have to be increased even more, causing a much larger number of overall testcases.

110

1001,000

10,000100,000

3 4 5 6 7 8 9 10 11

(a) JabRef

110

1001,000

10,000100,000

3 4 5 6 7 8 9

(b) TerpWord

110

1001,000

10,000100,000

1,000,000

3 5 7 9 11 13 15 17

(c) FreeMind

Fig. 3.10: The distribution of sequence lengths. The x-axis stands for the length of the sequences. They-axis (in log scale) represents the number of generated sequences. The black points indicate the se-quences generated by EDG-3-Sliced; the gray points indicate the sequences generated by EDG-2.

Is our approach effective in detecting bugs?

To answer this question, we again refer to Figure 3.7. Overall, we detected four bugsin JabRef with the configuration EDG-3 and EDG-3-Sliced, respectively, compared toone detected bug with EFG-3. In Rachota, we detected one bug only with EDG-3 andEDG-3-Sliced. Furthermore, we detected two bugs in TerpPaint. In TerpSpreadSheetwe detected one bug with EFG-3, EDG-3 and EDG-3-Sliced. Again, note that theseresults testify our theoretical results from the previous section: As the code coverageof EDG-3 and EDG-3-Sliced remains the same, we also detect the same set of bugs

52

Page 53: Program Analysis and Black-box GUI Testing

given we use the same oracle in all configurations. We remark that all detected bugsare confirmed bugs and are reported to the corresponding developers (and have beenfixed in the meanwhile).

In the following, we sketch one bug that we have detected in JabRef using EDG-3-Sliced. Note that the bug was not detected in the evaluation of the approach pre-sented in Chapter 2, since an event dependency sequence of length 3 is needed. Thefollowing event sequence causes an ArrayOutOfBoundsException: (1) In themain window, click Edit strings , which opens a new dialog; (2) Switch back tothe main window and choose Close database ; (3) Again, in the main window,click Edit strings , which reopens the dialog; (4) Click New string in the dia-log; (5) Click OK in the dialog. The error occurs, because the newly opened dialogstarts modeless, which allows the user to close the database, although the dialog stillallows the user to modify the database. Note that JabRef does not show an error mes-sage. Instead, the stack trace of the ArrayOutOfBoundsException is printedon the standard output (stdout). Note that this bug is neither detected with EFG-3nor with EDG-2, because five consecutive events from the EFG, or three dependentevents from the EDG are needed.

Edit strings New string

OKClose Database

(a) EFG

Edit strings New string

OKClose Database

(b) EDG

Fig. 3.11: Snippets of the EFG and EDG of JabRef. The edges marked in red in the EFG and in theEDG are used to generate the test sequence that detects the bug.

Figure 3.11 shows a snippet of the EFG and the EDG of JabRef that correspondsto the detected bug. The EFG approach needs a sequence length of n = 5 to de-tect the bug: 〈 Edit strings , Close database , Edit strings , New string ,OK 〉. The EDG approach needs a sequence length of n = 3: 〈 Close database ,New string , OK 〉.

In the test sequence generation for event Close database our approach detectsa causal dependency between the three events Close database , New string ,and OK . The dependency consists of a field for JabRef’s metadata, which is de-fined in Close database , defined and used in New string , and used in OK .Thus, the event dependency sequence 〈 Close database , New string , OK 〉 is

53

Page 54: Program Analysis and Black-box GUI Testing

generated. This event dependency sequence is transformed into a test sequence,since there exists no corresponding path in the EFG. The shortest path from an ini-tial event to Close database , from Close database to New string , and fromNew string to OK is computed and inserted to the sequence. The resulting test

sequence is: 〈 Edit strings , Close database , Edit strings , New string ,OK 〉. The EFG-3 approach does not detect this bug, since a sequence length 5 is

needed using the EFG approach.

3.5 Discussion

In this chapter we have presented a new approach to identify and eliminate redundantevent sequences based on program slicing for GUI testing. The results of the evalua-tion indicate that the approach can greatly speed-up the overall execution time of theresulting test suite. In this section we discuss the metrics Execution Time and CodeCoverage reported in the evaluation of our approach.

3.5.1 Execution Time

The execution time of test cases generally is a resource critical factor in softwaretesting. In particular, the execution time is critical for GUI testing, because replayingthe test cases is particularly expensive in this context. For example, the replayer mustpause roughly 500 milliseconds after an event is triggered in order to wait until theGUI toolkit has “painted” the (possibly new) set of widgets.

Hence, one challenge in GUI testing is to assemble an efficient but yet effective testsuite. An efficient test suite should be as compact as possible (in terms of its numberof event sequences), whereas an effective test suite should be able to detect as manybugs as possible. For the former, our approach only adds relevant event sequencesto a test suite; for the latter, our approach safely discards redundant event sequencesfrom a test suite.

From the experiments, we have observed that the number of event sequenceswith EDG-3-Sliced could be significantly reduced, and the reduced number of testcases also translated to a reduction of overall execution time; see Figure 3.7 and3.12. In particular, we have seen that EDG-3-Sliced needs less time than EDG-3for all AUTs, and EDG-3 often occupies less time than EFG-3 for JabRef, Free-Mind, and TerpPaint. For Rachota, TerpSpreadSheet, and TerpWord, EDG-3 occu-pies more time than EFG-3, but is able to generate longer sequences at the sametime; see Figure 3.10. Apparently, FreeMind is an outlier, because it occupies morethan 10 days of overall execution time even for EDG-3-Sliced. However, FreeMindis a rather large real-world programs, and our reduction techniques considerably re-duce the number of 6,093,201 sequences obtained with EDG-3 – the execution time

54

Page 55: Program Analysis and Black-box GUI Testing

with EDG-3-Sliced for the resulting 1,600,638 test cases is still considerably lower(reduced by 31.2 days) compared to EDG-3.

0.01$

0.10$

1.00$

10.00$

100.00$

JabRef$

FreeMind$

Rachota$

TerpPaint$

TerpSpread.$

TerpW

ord$

EFG<3$

EDG<3$

EDG<3$Sliced$

Fig. 3.12: Execution Time of the test cases in our experiments. The y-axis (in log scale) states the timein days needed to execute all test cases.

3.5.2 Code Coverage

In Figure 3.7, we particularly report the line coverage, which apparently is relativelylow for all AUTs. The main reason for this is the selection of input data: In ourexperiments, we use a random input selection strategy (e. g., by randomly selectingstring values for text boxes) in order to avoid the computational overhead inducedby more informed strategies like symbolic execution [28]. Although the selection ofmore suitable input data improves the code coverage [5, 25, 28, 30, 31] (since GUIprograms typically employ many widgets that accept input data), we argue that this isan orthogonal research problem to the problem addressed in this chapter. Generally,our approach can be combined with arbitrary (and particularly, with more informed)strategies for selecting the input data, and still has the guarantee to retain the resulting(potentially higher) code coverage.

55

Page 56: Program Analysis and Black-box GUI Testing

4

Selection of Suitable Input Data

In Chapter 1 we emphasized that GUI tests are based on event sequences togetherwith input data. For selecting event sequences, one can use a black-box approachbased e. g., on Event Flow Graphs (EFGs) [52]. For selecting input data, one canuse a white-box approach based, e. g., on parameterized unit tests [67] and dynamicsymbolic execution [20]. Motivated by the established success of the black-box ap-proach to GUI testing [13, 16, 52, 74], we ask the question whether the black-boxapproach can be integrated with techniques from the white-box approach so that theresulting approach provides both, the selection of executable event sequences and theselection of suitable input data. Furthermore, given the established success of param-eterized unit testing [20, 22, 28, 67, 71], and given the apparent analogy betweenevent handlers called in a GUI test and methods called in a parameterized unit test,it seems natural to ask whether we can obtain the desired integration by replacingmethod calls with event handler calls. At first sight, this approach is not possible:The assignment of the input data (e. g., a string value filled in by the user in a textbox) cannot be found in any event handler called in a GUI test (the assignment isdone, letter by letter, in the message loop of the GUI toolkit). There are other, moretechnical obstacles (event handlers call native code of the GUI toolkit, event handlershold a private access modifier which makes them unavailable for symbolic execution,etc.). That is, the naive approach does not work.

In this chapter we make the principle of parameterized unit testing available toblack-box GUI testing. In a first step, we select event sequences from the EFG of aGUI program and generate a set of parameterized GUI tests. In a second step, weapply Pex [20] (i. e., dynamic symbolic execution) in order to instantiate the parame-terized GUI tests. In a final step, we replay instantiated GUI tests on the GUI program(as described in Chapter 2 and 3).

In the terminology of the black-box/white-box dichotomy, the approach startswith a black-box approach (using the EFG in order to select executable test se-quences), then moves on to a white-box approach (in order to generate parameterized

56

Page 57: Program Analysis and Black-box GUI Testing

GUI tests and instantiate them using Pex), and finally goes back to the black-box ap-proach (using a replayer in order to execute the (instantiated) GUI tests on the GUIprogram). To establish an interface between the black-box approach and the white-box approach, we need to overcome a number of technical hurdles. In particular, webuild an instrumented version of the GUI program in order to extract sequential pro-grams as used in the parameterized unit test. We replace GUI widgets by symbolicwidgets and inject symbolic events into the sequential programs in order to obtainwhat we call a parameterized GUI test (PGT). We evaluate the approach on fouropen source GUI programs. The results indicate that parameterized GUI tests havethe potential to achieve high code coverage.

The remainder of this chapter is organized as follows. Next, we motivate the ap-plicability of our approach using an example of a GUI program. Section 4.2 presentsthe approach and its implementation. We evaluate the new approach in Section 4.3,which is followed by a discussion of its results in Section 4.4.

4.1 Example

In this section we illustrate how our approach tests GUI programs using a simplifiedexample program given in Figure 4.1. The example program provides the function-ality of an address book. The main window consists of two buttons that can add orremove a contact. When clicking the Add Contact button, a dialog window ap-pears which provides two text boxes, for the first name and for the last name, and twobuttons to store (OK) or discard (Cancel) the contact.

Fig. 4.1: Screen shot of the example program. The AddressBook program consists of two windows, amain window and a dialog. Clicking on Add Contact opens a new dialog and disables the events ofthe main window. Clicking on OK or Cancel closes the dialog and re-enables the events of the mainwindow.

57

Page 58: Program Analysis and Black-box GUI Testing

4.1.1 Selecting Test Sequences

When testing programs through its GUI there exist different possibilities in whichorder to interact with widgets; see Section 2.1. In our example, one can first clickthe Remove Contact button and then the Add Contact button. However, thereverse order does not work as clicking the Add Contact button opens the dia-log window, so Remove Contact cannot be clicked until the dialog window isclosed. Thus, not all sequences of events are executable on the program. In order toavoid those non-executable event sequences, our approach incorporates a black-boxmodel of the GUI, the Event Flow Graph [52] depicted in Figure 4.2. In Chapter 2we described in detail the selection of event sequences from a black-box model. Forcompleteness we provide a brief rehash of the selection of event flow sequences andthe transformation to test sequences.

Add Contact

Remove Contact

OK

Cancel

Fig. 4.2: Event Flow Graph of the example program. The events Add Contact and RemoveContact represent initial events that can be executed immediately after the GUI program is launched.In contrast, the events OK and Cancel can be executed not until Add Contact is triggered.

Using the EFG one can generate a set of event flow sequences of the program. Anevent flow sequence is a walk of a fixed length in the EFG. Throughout this chapterwe generate event flow sequences of length 2, that is, pairs of events. Event flowsequences do not necessarily start in an initial event, and thus, are not executableon the program. We expand event flow sequences to test sequences by inserting theshortest path from the first event of the event flow sequence to an initial event of theEFG. Figure 4.3 shows all resulting test sequences of the example program obtainedby event flow sequences of length 2 from the EFG. For the example program, ourapproach generates 6 test sequences in total.

The benefit of the EFG is the possibility to generate test sequences which areexecutable on the program. However, test sequences do not account for input data towidgets. When executing a test sequence on a GUI, recent efforts [7, 52, 79, 81] insertrandom input data to widgets. We believe that the choice of input data is both vitalto the code coverage that can be achieved, and to the total number of executed tests.For example, choosing random values can result in low code coverage. Furthermore,

58

Page 59: Program Analysis and Black-box GUI Testing

multiple random values can result in a prohibitive large number of test cases (e. g.,one randomly chosen value is integrated in one test case).

t1 = 〈 Add Contact , OK 〉t2 = 〈 Add Contact , Cancel 〉t3 = 〈 Remove Contact , Add Contact 〉t4 = 〈 Add Contact , OK , Add Contact 〉t5 = 〈 Add Contact , OK , Remove Contact 〉t6 = 〈 Add Contact , Cancel , Add Contact 〉

Fig. 4.3: Test Sequences of the example program. The dark-colored events represent the events of theevent flow sequences of length 2. The light-colored events represent intermediate events that make theevent flow sequence executable on the GUI of the program.

4.1.2 Generating Parameterized GUI Tests

To enable the generation of input data for test sequences, we introduce ParameterizedGUI Tests that are test sequences parameterized by possible input data. In the follow-ing we first outline how parameterized GUI tests are generated. Then, we describehow input data to parameterized GUI tests is generated in our approach.

Figure 4.4 shows an excerpt of the underlying source code of the example pro-gram. The method OnAddContact represents the event handler which is executedonce the corresponding button in the main window is clicked. The method OnOKis executed if the OK button in the dialog window is clicked. The event handlerOnOK adds an element to the list of contacts (line 21), if the text of the text boxlastName is not empty and contains less than 256 characters. If the text is empty,the event handler returns without adding a contact (line 16). If the text is too long, itreturns with an exception (line 19).

Our approach detects that event handler OnOK evaluates the text of the text boxlastName in the conditions (line 15 and line 18). That is, the event handler OnOKmight need input data. We transform the test sequences from Figure 4.3 into param-eterized GUI tests depicted in Figure 4.5. In particular, we prefix the event OnOKwith the parameter lastName. The parameter lastName can adopt different val-ues which can lead to different execution paths in the event handler OnOK. In ourexample program, only one text box is evaluated. If further text boxes are evaluated,we add parameters to the test sequence for each of them. Our approach does not onlyconsider input data to text boxes, as described in Section 4.2.

Our approach instantiates each parameterized GUI test by an automatic computa-tion of suitable input data. In this chapter we incorporate Pex [20]. In general, inputdata can also be provided by alternative tools [24, 71]. Pex uses dynamic symbolic

59

Page 60: Program Analysis and Black-box GUI Testing

1 class AddressBookWindow {2 private ListView contacts;3

4 // handler for event "Add Contact"5 private void OnAddContact() {6 ContactDialog dialog = new ContactDialog();7 dialog.ShowDialog(this);8 }9

10 class ContactDialog {11 private TextBox lastName;12

13 // handler for event "OK"14 private void OnOK() {15 if ( 0 == lastName.Text.Length ) {16 return;17 }18 if ( lastName.Text.Length > 255 ) {19 throw new Exception("Text is too long.");20 }21 contacts.AddItem(lastName.Text);22 }23 }24 }

Fig. 4.4: Excerpt of the source of the example program. The source code consists of two classes(AddressBookWindow and ContactDialog) and three event handlers (OnAddContact,OnRemoveContact, and OnOK). The event handler OnOK evaluates the text of the text boxlastName. A new contact is only added, if the last name is not empty and contains less than 256characters.

p1 = 〈 Add Contact , lastName , OK 〉p2 = 〈 Add Contact , Cancel 〉p3 = 〈 Remove Contact , Add Contact 〉p4 = 〈 Add Contact , lastName , OK , Add Contact 〉p5 = 〈 Add Contact , lastName , OK , Remove Contact 〉p6 = 〈 Add Contact , Cancel , Add Contact 〉

Fig. 4.5: Parameterized GUI tests of the example program. In the PGT p1, p4, p5, the event OnOK isprefixed with the parameter lastName. The PGTs s2,s3,s6 do not need parameters.

execution to identify sets of input values that execute all control-flow paths of theprogram in the parameterized GUI test. For example, for the parameterized GUItest p1 = 〈 Add Contact , lastName , OK 〉, Pex identifies three distinct values oflastName that have to be tested as shown in Figure 4.6. With the valuations forlastName and the parameterized GUI test, we have all ingredients for a GUI test

60

Page 61: Program Analysis and Black-box GUI Testing

which can be executed using our replayer. The replayer accepts a set of GUI testsand mimics the events (user interactions) on the program. If a GUI test contains inputdata, this data is transferred to the corresponding widget. Furthermore, the replayerintegrates an oracle that determines whether a GUI test passed or failed.

t1a = 〈 AddContact , empty string , OK 〉t1b = 〈 AddContact , ’’a string with more than 255 characters’’ , OK 〉t1c = 〈 AddContact , ’’Last Name 3’’ , OK 〉

Fig. 4.6: Instantiated GUI tests of the parameterized GUI test p1. The GUI test t1a contains as inputdata an empty string; t1b contains a string with a length greater than 255; and t1c contains a string witha length lesser than 255.

4.2 Approach and Implementation

In this section we present details of our approach and its implementation. Our ap-proach depicted in Figure 4.7 consists of the following consecutive steps: (1) EventFlow Construction; (2) Symbolic Widget Injection; (3) Symbolic Event Injection; (4)Event Handler Elevation; (5) Generation of Parameterized GUI Tests, (6) SymbolicExecution, and (7) Replayer. We have implemented our approach into the tool Gazoo.

4.2.1 Event Flow Construction

The starting step of our approach is the Event Flow Construction. This steps incor-porates both a GUI Ripper and a Event Handler Extractor. That is, it infers an EventFlow Graph of a GUI program and it extracts for each event its corresponding eventhandlers. We refer to Section 2.2 for a more detailed description.

4.2.2 Symbolic Widget Injection

In our approach we want to generate suitable input data (e. g., we want to reasonabout string values of text boxes). However, in order to perform symbolic execution,we have to replace regular widgets by symbolic widgets. There are two main reasons:First, a change to a regular widget’s property leads to a native call to the GUI toolkit.Including code of the GUI toolkit in the analysis is usually not feasible, as it wouldsignificantly increase the size of the code that has to be analyzed. Furthermore, inmany cases, the code is in native format, and thus, not accessible by the analysis.Second, our approach focuses on validating the behavior of a program. In particular,we are not interested in validating the behavior of the GUI toolkit (i. e., validatingwhether a redraw of a widget was successful).

61

Page 62: Program Analysis and Black-box GUI Testing

The step Symbolic Widget Injection takes the CIL1 code of a program as input andreplaces widgets by symbolic widgets. Figure 4.8 shows an excerpt of the symbolicrepresentation of a text box. Gazoo uses CCI2 to modify the CIL code. By default, themain widgets included in the Windows Forms framework are considered (e. g., textboxes, check boxes, radio buttons etc.). Moreover, Gazoo is highly configurable: Onecan define further symbolic widgets for alternative GUI toolkits, such as Silverlight3.

GUI$Program$

EFG$

GUI$Test$Cases$

pass$ fail$

GUI$

1.#Event#Flow#Construc2on#

5.#PGT#Genera2on#

2.#Symbolic#Widget#Injec2on#

6.#Symbolic#Execu2on# 7.#Replayer#

PGTs$ Instrumented$$GUI$Program$

4.#Event#Handler#Eleva2on#

CIL$ GUI$

CIL$

3.#Symbolic#Event#Injec2on#

Fig. 4.7: Our approach, consisting of seven consecutive main steps. The input to our approach is aGUI program. Input data is generated on the Instrumented GUI program. GUI tests are replayed on theoriginal GUI program.

4.2.3 Symbolic Event Injection

In GUI programs, specific events do not have their own event handlers. For example,it is not likely to have an event handler which assigns a string value to a text box, oncea user presses a key on the keyboard. This behavior is implemented in the GUI toolkit

1 http://msdn.microsoft.com/en-us/netframework/aa569283.aspx2 http://ccimetadata.codeplex.com/3 http://www.microsoft.com/silverlight/

62

Page 63: Program Analysis and Black-box GUI Testing

and does not exist in the program itself. In order to assign a string value, generatedby symbolic execution, to a text box, our approach injects symbolic events to theprogram. That is, we partially “re-implement” the event handlers of the GUI toolkit.

1 class TextBox {2

3 public string Text {4 get {5 // native call6 return GetWindowText();7 }8 set {9 // native call

10 SetWindowText(value);11 }12 }13 }14

15

1 class SymbolicTextBox {2

3 string text;4

5 public string Text {6 get {7 // a "getter"8 return this.text;9 }

10 set {11 // a "setter"12 this.text = value;13 }14 }15 }

Fig. 4.8: Comparison of regular widgets (left) and symbolic widgets (right). In symbolic widgets, nativecalls to the GUI toolkit are pruned (line 6 and line 10). A symbolic representative of the widget property(line 3), i. e., text, is injected. This property can be read and written by the get and set operations.

The step Symbolic Event Injection takes CIL code with symbolic widgets as in-put and returns a modified version of the CIL code, including symbolic widgets andsymbolic events; see Figure 4.9. Gazoo visits the instructions of the CIL code. Ifit encounters an evaluation of a widget property (e. g., the Text property of a textbox is evaluated), a symbolic event is added to the CIL code. A symbolic event is asetter method that takes one parameter representing the value to be assigned to thecorresponding widget property. In our approach we separate the concerns of hav-ing both symbolic widgets and symbolic events: Symbolic widgets address the issuethat properties of GUI widgets imply native calls. In contrast, symbolic events pro-vide an interface that allows to assign a value to a widget property. Furthermore, inour setting we can assign values to widget properties. However, there exist widgetsthat prohibit the assignment of arbitrary property values. For example, the propertyCount which indicates the number of items in a list widget. In order to change theproperty Count, one has first to add an item to the list widget which increments theproperty Count. Those complex symbolic events are out the scope of this chapterand will be addressed in a future work.

63

Page 64: Program Analysis and Black-box GUI Testing

4.2.4 Event Handler Elevation

Usually, in programming languages like C#, event handlers are implemented as pri-vate methods within classes. They are only visible to their surrounding class, andthus, cannot be called directly. In order to allow an exploration of the event handlersby symbolic execution, we elevate event handlers. That is, for each method of theprogram we change their access modifiers from private to public; see Figure 4.9.

Gazoo visits the classes and methods of the executable. If it encounters a privateclass or a private method, it changes the access modifier and serializes all changesto the CIL code. Note that Gazoo also visits and modifies classes, in case they arenot visible. The output of these steps is a valid executable. In particular, elevatingclasses and methods do not raise conflicts. For example, the access modifier does notinfluence the unique signature of a method.

1 // regular text box2 private TextBox lastName;3

4 // regular event handler5 private void OnOK() {6 if (0 == lastName.Text7 .Length)8 {9 // ...

10 }11 }12

13

14

15

16

17

1 // symbolic text box2 public SymbolicTextBox lastName;3

4 // elevated event handler5 public void OnOK() {6 if (0 == lastName.Text7 .Length)8 {9 // ...

10 }11 }12

13 // symbolic event14 public void SetLastNameText(15 string text) {16 lastName.Text = text;17 }

Fig. 4.9: Comparison of code from the original program (left), and the instrumented program (right).The instrumented program contains symbolic widgets (line 2), symbolic events (line 14), and elevatedevent handlers (line 5).

4.2.5 Parameterized GUI Test Generation

Having obtained the EFG (and its event handlers of a GUI (step 1)), and having builtan instrumented version of the program (steps 2, 3, and 4), Gazoo generates a set ofparameterized GUI tests. This step consists of two sub-steps:

64

Page 65: Program Analysis and Black-box GUI Testing

First, Gazoo generates test sequences of a fixed length from the EFG. Each testsequence represents a program that sequentially calls the event handlers of the eventsin the sequence. Second, for each event in the test sequence, Gazoo analyzes whetherthe event handlers rely on input data. For example, an event handler evaluates theproperty of a widget. If so, Gazoo transforms the test sequence into a parameterizedGUI test. For each evaluated widget property, we add a new parameter to the param-eterized GUI test. Furthermore, we prefix the event handler (that relies on input data)with a call to the symbolic event that assigns the input data. The idea is that the sym-bolic event writes the input data, while the selected event handler evaluates the inputdata. Figure 4.10 shows the difference between a test sequence and a parameterizedGUI test.

1 // a test sequence2 void TestSequence()3 {4 OnAddContact();5 OnOK();6 }7

1 // a parameterized GUI test2 void PGT(string lastname)3 {4 OnAddContact();5 SetLastNameText(lastName);6 OnOK();7 }

Fig. 4.10: Comparison of a test sequence (left) and a parameterized GUI test (right). In the parameter-ized GUI test, the call of event handler OnOK is prefixed with the symbolic event SetLastNameText.This symbolic event sets the parameter value lastname of the PGT to a text box.

4.2.6 Symbolic Execution

Having generated a set of parameterized GUI tests, our approach instantiates eachparameterized GUI test by applying Pex [20]. Pex takes as input a parameterized testand performs dynamic symbolic execution on the instrumented program. The outputof Pex is a set of concrete values of the parameters in the parameterized test. Foreach element in this set, we create an instantiated GUI test. An instantiated GUI testconsists of the sequence of events from the parameterized GUI test, and the concreteparameter values for widget properties.

4.2.7 Replayer

The last step of our approach is the Replayer. That is, it executes the set of instantiatedGUI tests on the GUI program. We refer to Section 2.2 for a detailed description ofthe Replayer.

65

Page 66: Program Analysis and Black-box GUI Testing

4.3 Experiments

In this section we evaluate our approach. We compare how our approach performs, (a)when the computation of input data is replaced by the use of random values, and (b)when the Event Flow Graph is not considered for event sequence generation. We firstpresent the setup of the experiments. Then, we discuss the results of the experiments.We define the following two research questions:

• Q1: Is it reasonable to use Pex-generated values instead of random values for wid-gets? A priori, this is not clear, for two reasons: (1) In GUI programs, events thatevaluate input data might be simple, that is, they only check whether an input isentered or not. Then, one can achieve a reasonable coverage by providing arbi-trary input (or no input). (2) Events might evaluate input data in complex ways,that is, checking whether a specific string is entered or not. Then, one cannotachieve a reasonable coverage due to limitations of symbolic execution (w.r.t. tothe underlying constraint solver).

• Q2: Is it reasonable to incorporate the Event Flow Graph in order to generateparameterized GUI Tests? In principle, the idea of selecting event sequences of aprogram is related to the generation of method calls of a library. In libraries, onecan call each method at any time. Hence, there exists no order, in which librarymethods are allowed to call. In GUI programs one can call an event handler at anytime as well. However, a call of an event handler may not be allowed (e. g., whenthe window of an event handler is not yet displayed). This leads to GUI tests thatare not executable on the GUI.

4.3.1 Setup of the Experiments

We evaluate our approach on four C# open source programs: AddressBook managescontacts; OpenImage downloads images from websites; HandBrake Encoder con-verts video files; FareCalculator calculates ticket prices for trains. Except for Fare-Calculator [25], all other programs are fetched from CodePlex4. It is important toobserve that we use stable versions where bugs are rarely found. We choose variousprograms to cover different code styles. Figure 4.11 shows some statistics of eachApplication Under Test (AUT).

Our experiments consists of the three configurations A, B, C. The configuration Agenerates event sequences of length 2 from the EFG, and uses Pex to generate inputvalues for the event sequences. The choice of the parameter 2 is motivated by previousempirical studies on bugs in GUI programs [79]. The configuration B generates eventsequences of length 2 from the EFG, but uses random values as input data. In order tohave statistical confidence, we choose random values using 10 different seeds. Thus,

4 http://www.codeplex.com/

66

Page 67: Program Analysis and Black-box GUI Testing

each parameterized GUI tests is instantiated 10 times containing different input data.The configuration C generates all sequences of events of length 2. That is, it does notuse the EFG, and thus, might select non-executable event sequences. By comparingconfiguration A and B, we investigate the coverage that our approach can achieve.By comparing configuration A and C, we investigate the number of non-executableGUI tests that our approach discards.

As a precondition of all GUI tests we define that all user settings of an AUT haveto be deleted before executing the GUI test. As a postcondition of all GUI tests weuse a crash monitor. In particular, we record any exception occurred during test caseexecution, and we automatically observe if a test case is executable on the GUI. Fora discussion of alternative oracles we refer to [22, 54, 56, 77].

The GUI tests are executed on 10 virtual Windows machines with 2.0 GHz CPU,2 GB RAM, 500 GB HDD. In order to mitigate the effect of randomness, the config-urations A, B and C are executed three times. The total number of executed test casesamounts to 24,063.

AddressBook OpenImage Handbrake FareCalculatorLOC 2778 2347 520 298

Classes 98 87 30 22

Methods 163 109 19 14

Events 45 13 7 3

Fig. 4.11: Statistical data of the AUTs used in our experiments.

4.3.2 Results of the Experiments

Figure 4.12 shows the results of the experiments. We answer Q1 with Yes: We findthat it is reasonable to use Pex-generated input values instead of random input values.In all AUTs, the configuration A achieves a higher line and a higher branch cover-age than the configuration B. For OpenImage, the improvement of the line coverageamounts to 19%, for AddressBook 41%, and for HandBrake 45%. FareCalculator isan outlier; the line coverage improvement is 76%. The reason is that FareCalculatorconsists of event handlers that need specific input data. Pex is able to generate thisinput data, while random values do not suffice. It is unlikely to achieve 100% line andbranch coverage in a program, as the programs may also need input data that cannotbe generated automatically. For example, if a program requires a valid URL to down-load an image from the web, Pex cannot generate such a valid URL. In this case theprogram depends on external test data that must be specified by a test engineer.

67

Page 68: Program Analysis and Black-box GUI Testing

AUT / Configuration A B CAddressBookLine Coverage (%) 74 43 74Branch Coverage (%) 65 38 65# PGTs 319 319 2025# GUI Tests 349 3190 2352Generation Time (s) 407 255 2739Execution Time (m) 93 850 730# Non-executable GUI Tests - - 2003OpenImageLine Coverage (%) 63 51 63Branch Coverage (%) 59 29 59# PGTs 139 139 169# GUI Tests 148 1390 179Generation Time (s) 278 222 336Execution Time (m) 38 359 46# Non-executable GUI Tests - - 31HandBrakeLine Coverage (%) 88 48 88Branch Coverage (%) 84 44 84# PGTs 49 49 49# GUI Tests 73 490 73Generation Time (s) 71 42 71Execution Time (m) 17 116 17# Non-executable GUI Tests - - -FareCalculatorLine Coverage (%) 93 22 93Branch Coverage (%) 91 19 91# PGTs 9 9 9# GUI Tests 39 90 39Generation Time (s) 49 34 49Execution Time (m) 8 20 8# Non-executable GUI Tests - - -

Fig. 4.12: Results of the experiments.

We answer Q2 with Yes: We find that it is reasonable to incorporate the EventFlow Graph in order to generate parameterized GUI Tests. For AddressBook, theconfiguration A generates 319 PGTs which leads to 349 instantiated GUI tests. Incomparison, the configuration C generates 2,025 PGTs which leads to 2,352 instan-tiated GUI tests. Thus, 2,003 out of 2,352 GUI tests, that is 85%, are not executableon the program. For OpenImage, 17% of the GUI tests are not executable on its GUI.The reason is that in AddressBook and OpenImage it is not allowed to execute anarbitrary event at any time. For the AUTs HandBrake and FareCalculator, the config-uration C generates the identical set of PGTs as configuration A. In these programs,

68

Page 69: Program Analysis and Black-box GUI Testing

the EFG is fully-connected, and each event is also an initial event. We believe thatis reasonable to incorporate the EFG by default: For large programs, our approachgenerates a subset of event sequences of the GUI. The event sequences in this subsetare actually executable on the GUI. For small programs, our approach generates thesame set of event sequences which would be generated without considering the EFG.

4.4 Discussion

In this chapter we have presented a new approach to the selection of suitable inputdata for GUI tests. The results of the evaluation indicate that the approach can achievehigher code coverage. In this section we discuss the steps Event Flow Constructionand Replayer incorporated in our approach.

4.4.1 Event Flow Construction

In this chapter we have used a black-box model to represent events and their corre-sponding event flow. An EFG is constructed by executing the program and observingthe behavior of its GUI. In principle it is also possible to use a white-box modelof the program. For example, this white-box model might be constructed by tech-niques from static analysis. Since GUI code is written in many ways, a static analysistechnique must be tailored to comprehend how a GUI is built. The use of a black-box model is justified by the reasonable trade-off between applicability and preci-sion. The constructed EFG in our approach represents an approximation of the actualevent flow of the program. Thus, our approach cannot guarantee to find all events ofthe program. For example, the program itself might be hostile or even faulty.

4.4.2 Replayer

One can argue it is not necessary to replay instantiated GUI tests on the original pro-gram. For example, one can execute GUI tests in a fashion of unit testing by simplycalling the event handlers, and without mimic user interactions on the program. How-ever, we believe it is mandatory to replay instantiated GUI tests on the program inorder to comply with the idea of system testing. For example, timing problems mayonly be detected when executing the GUI test on the program itself. For example, thereplayer tries to execute an event on a window, but this window is not yet displayed.

The replayer assigns values (e. g., a string value to a text box) by reflection andmemory-mapped files. In principle, this may violate an invariant of the program.For example, it may not be allowed to access a certain text box, since the text boxis currently disabled. In our approach we use the EFG and its corresponding GUIstructure to “guess” that a widget is accessible. However, since the EFG representsan approximation, it cannot be guaranteed that a widget is actually accessible. A

69

Page 70: Program Analysis and Black-box GUI Testing

possible alternative is to add annotations to the source code, stating that a value to awidget may only be assigned under specific conditions.

70

Page 71: Program Analysis and Black-box GUI Testing

5

Threats to Validity

5.1 Threats to Internal Validity

The first threat to internal validity is the creation of the model of the GUI program,namely the Event Flow Graph (EFG). Since the EFG is created using a GUI Ripper, itis not guaranteed that all events of the program are found. For instance, the programitself might be hostile or even faulty (e. g., if the program opens a new window in thebackground, the GUI Ripper is not able to find it, and thus, it cannot be consideredduring EFG construction). Furthermore, the fact if a widget is enabled or disabledduring ripping may strongly depend on the environment (e. g., user settings). Theseproblems tend to be of technical nature and their severity might differ depending onthe used platform. Hence, the used EFG represents an approximation of the actualevent-flow of the GUI program. That is, a path in the EFG (an event sequence) mightnot be executable, since its events are pair-wise executable. However, there is empiri-cal evidence that even long event sequences can run without failures [79]. We remarkthat the evaluation of non-executable event sequences is not in the scope of this thesis,and refer to Bae et al. [11] for a discussion on the strengths of different approachesto the creation of GUI models.

The second threat to internal validity is the replication of the experiments, becausethe applications under test depend on influences from “outside”. For example, if theGUI program stores user settings (e. g., recently opened files) after the execution of atest case, then these user settings have to be deleted before running the next test case.Otherwise the test case may mistakenly fail (e. g., the GUI differs from the previoustest case). Moreover, some programs strongly depend on the date and time in themoment of their execution (e. g., calendar widgets). In order to decrease these threatsto internal validity, each virtual machine is assigned to the same date and time, andis reset to a common state before running a new test case. All experiments presentedin this thesis (see Sections 2.3, 3.4, 4.3) are executed three times, in order to mitigatethe effect of randomness.

71

Page 72: Program Analysis and Black-box GUI Testing

5.2 Threats to External Validity

One threat to external validity is the portability of the approaches presented in thisthesis. In Chapter 2 and 3, we evaluated Java programs that incorporate the Swingtoolkit1 as the GUI toolkit. Furthermore, in Chapter 4, we evaluated C# programsthat incorporate the Windows Forms toolkit as the GUI toolkit. Alternative toolkits(e. g., the Standard Widget Toolkit (SWT)2), follow different paradigms of buildinga user interface (e. g., using native widgets written in C, instead of “managed” wid-gets). Thus, the construction of the black-box model and the implementation of thestatic analyses must be adapted to the corresponding environment. Furthermore, theslicing-based approach presented in Chapter 3 works on Jimple code (which wastransformed from Java code). Other programming languages (e. g., C and C++) fol-low different language paradigms, which may limit the applicability of our approachfor these languages, because program slicing would have to be adapted to the cor-responding environment accordingly. Under the assumption that these adaptions arepossible, the approaches can be extended to those platforms.

1 http://docs.oracle.com/javase/tutorial/uiswing/index.html2 http://www.eclipse.org/swt/

72

Page 73: Program Analysis and Black-box GUI Testing

6

Related Work

GUI testing is an important and active research area, and various test case generationapproaches have been proposed [5, 7, 13, 15, 16, 25, 30, 31, 47, 48, 52, 60, 66, 73,74]. Furthermore, GUI testing approaches can be classified in many different ways,including iterative [30, 31, 47, 48] and non-iterative [5, 7, 13, 25, 52, 60, 66, 73, 74]as well as white-box [25, 60, 66] and black-box approaches [13, 15, 16, 30, 31, 47,48, 52, 73, 74]. In this chapter we present related work of the selection of eventsequences (see Chapter 2 and 3) and the selection of input data (see Chapter 4).

6.1 Selection of Event Sequences

Recently, two iterative approaches to the selection of event sequences have been pro-posed: AutoBlackTest [47, 48] uses reinforcement learning, and EXSYST [30, 31]uses search-based testing to obtain relevant event sequences. In contrast with thetwo cited approaches, our approaches execute the AUT only for constructing theblack-box model of the GUI, and not for generating event sequences. Moreover, ourapproaches are orthogonal to both iterative and non-iterative test case generation:Although used in a non-iterative setting, our techniques can be applied in iterativesettings as well. We leave a more detailed investigation for future work.

In a broad sense, the selection of event sequences is related to the generation ofsequences of method calls [18, 83]. The work by Zhang et al. [83] comes closestto our approaches. It uses a call sequence and a static analysis in order to generaterelevant sequences of method calls for Java objects. The work by Buy et al. [18] gen-erates test cases for a class under test. It uses several program analysis techniques(i. e., data-flow analysis, symbolic execution, and automated deduction) in order toproduce sequences of method calls for an object of the class to be tested. Our effortson event sequence generation differ in two main aspects: First, we consider depen-dencies of method calls across class boundaries (for the goal of system testing, asopposed to unit testing as in [83]). For example, we analyze the dependent libraries

73

Page 74: Program Analysis and Black-box GUI Testing

of a GUI program. Second, we accommodate the requirement of executability (andnot just the requirement of relevance, as in [18, 61, 67, 78, 83]), namely by incorpo-rating the use of a black-box model.

In Yuan et al. [81], GUI run-time feedback is obtained from the execution ofa “seed test suite”. The feedback is used to iteratively generate new improved testcases. However, a test suite with events of reasonable length has to be generated andexecuted before building new improved event sequences based on a black-box model.Our approaches do not need a “training set”, since they immediately select relevantevent sequences using a static analysis once a black-box model is provided. It is openwhether the approach proposed in Yuan et al. [81] can realistically be implementedin a fully-automatic tool.

The approach proposed in Jensen et al. [42] focuses on generating event sequencesthat reach complex parts of a program. Concolic execution of the event handlers isapplied in order to obtain summaries. Afterwards, those summaries and a model ofthe GUI are used together in order to generate event sequences. However, in ourapproach we additionally focus on the elimination of redundant event sequences.Furthermore, we believe that our approach can be integrated as an extension to theapproach proposed in Jensen et al. [42].

As already discussed, a particular bottleneck in GUI testing is the execution timeof the generated test cases. Hence, suitable techniques to minimize the resulting suiteare desired. In our approaches we specifically exploit the information gained fromgraphs that reflect the dependencies of GUI events – to the best of our knowledge,there are no further minimization approaches in this setting. In contrast, for classicaltesting, there has been major efforts on the minimization of existing test suites (e. g.,[36, 41, 64, 65, 75, 78]).

For example, Rostra [78] monitors the execution of methods in a unit test, in orderto detect redundant test cases. A test case is considered as redundant, if the methodexecutions of the test case are equivalent to the method executions of another testcase in the test suite. Our approaches consider testing on the system level. That is,unlike unit testing, it is not possible to invoke an event handler (which is comparableto a method) in an ad-hoc way.

Moreover, most of the works on test suite minimization focus on studies on therelationship to the effectiveness of fault detection [36, 41, 64, 75]. For example,Wong et al. [75] argues, that test suite minimization results in no loss in fault de-tection effectiveness; while Rothermel et al. [64] shows that test suite minimizationdoes result in loss of fault detection effectiveness. Furthermore, these works intu-itively argue, that the elimination of a test case from a test suite may result in a lowerbug detection rate, and thus, it is always a trade-off of test suite size and fault detec-tion effectiveness. The work proposed in [41] selectively keeps redundant test casesin a test suite in order to retain more fault detection effectiveness. In contrast, our ap-proach guarantees to obtain the same code coverage and to find the same bugs (using

74

Page 75: Program Analysis and Black-box GUI Testing

the same oracle) with the reduced test suite, as we only eliminate sequences that areprovably redundant.

In Chapter 3 we incorporate program slicing, which was first introduced byWeiser [72]. Slicing based approaches can be classified in static slicing [72] anddynamic slicing [1, 43]. Both static and dynamic slicing inherit their own pros andcons [35, 68] in terms of precision and efficiency. In our setting, static slicing is thesuitable choice as our current framework is non-iterative (as discussed above) anddoes not make assumptions on the program’s input. In general, the range of pro-gram input (e. g., characters for a text box) is prohibitively large in GUI testing. Weremark that an integration of our approaches into iterative test case generation algo-rithms will possibly require dynamic slicing as well; again, we leave a more detaileddiscussion for future work. Overall, program slicing is a well-established techniqueand has been successfully applied in various areas of computer science like programdebugging, software maintenance, reverse engineering, compiler optimization, andsoftware testing [12, 32, 34, 43].

In the context of software testing, program slicing has mainly been used for re-gression testing [2, 12, 32], e. g., to detect obsolete test cases in an existing test suite.To the best of our knowledge, slicing-based approaches have not been investigatedfor GUI testing so far. For example, Bates et al. [12] applies program slicing to iden-tify those parts of a modified program, which are covered by an existing test suite,and those parts, for which no test cases exist. The idea is to reduce the time that isneeded to set up new test cases. The work proposed by Gupta et al. [32] applies pro-gram slicing to identify def-use dependencies that are affected by modifications to aprogram. The approach eliminates the overhead of computing a complete data-flowof a program, and hence, the need of updating an entire test suite.

We remark that tool support for program slicing is available. For example, In-dus [40, 63] is a slicer for concurrent Java programs. However, Indus is limited toJava 1.4-compatible bytecode. Since today’s programs (e. g., the AUTs, the GUI Rip-per, and the Replayer) require newer Java versions, we have implemented a newprogram slicing tool in order to handle Java-7-compatible bytecode. The new toolrepresents a lightweight implementation and is tailored to the specific requirement ofevent sequence generation. This includes the efficient computation of backward slicesof fields while avoiding the overhead that occurs by slicing concurrent Java programs.JavaSlicer [33] supports traces of executions of Java programs and afterwards com-putes backward slices. We discarded JavaSlicer in our implementation, since we arespecifically interested in slicing programs in a static way, without executing a Javaprogram. Furthermore, JavaSlicer does not support the analysis of several standardclasses of the JVM such as java.lang.String.

75

Page 76: Program Analysis and Black-box GUI Testing

6.2 Selection of Input Data

The approach proposed in [25] generates test cases for GUI programs using symbolicexecution. The approach presented in Chapter 4 differs in three main aspects: First,we incorporate a black-box model (an Event Flow Graph) of the GUI in order toselect event sequences that are actually executable on the GUI. Second, we generateparameterized GUI tests which can also be used with other techniques than symbolicexecution. Third, our approach is able to replay instantiated GUI tests in a black-boxfashion on the program.

In [61], Symbolic Java PathFinder is used to generate test cases. The symbolicexecution is performed on unit level and combines concrete execution on systemlevel. The use of Pex on a parameterized GUI test can be seen as symbolic executionon unit level. However, in our approach concrete execution on system level takesplace when replaying instantiated GUI tests.

The work in [44] is related to this thesis on an abstract level in that it combinesblack-box and white-box testing. Concretely, however, the underlying technical is-sues to be solved are incomparable due to the different settings (unit testing vs. sys-tem testing, method calls vs. event handlers).

76

Page 77: Program Analysis and Black-box GUI Testing

7

Conclusion

7.1 Summary

In this thesis we have presented three new approaches to GUI testing that establish aninterface between a black-box approach and a white-box approach. In particular, eachapproach presented in this thesis starts with a black-box approach, then moves on toa white-box approach, and finally goes back to a black-box approach. As a result, thedrawbacks of the black- and the white-box approach can often be eliminated, whilethe benefits of the black- and the white-box approach can be preserved.

In the first part of this thesis we have proposed an approach to select relevant eventsequences among the event sequences selected by a black-box approach. Departingfrom a black-box model, we apply a def-use analysis to the source code of the GUIprogram. We infer a set of event dependency sequences and use the black-box modelto construct a set of executable test sequences. The evaluation of the approach onseveral open source programs has the following findings:

• The approach scales to realistic GUI programs.• The approach discards an interesting number of irrelevant event sequences.• The approach generates test sequences that would be generated in the black-box

approach only with very high values for the parameter (and thus with very highcost).

• The approach achieves the same code coverage as the black-box approach, evenin the cases when the number of test cases is smaller.

• The approach detects previously undetected bugs.

In the second part of this thesis we have proposed an approach to eliminate redun-dant event sequences. Departing from a black-box model, we apply program slicingto the source code of the GUI program. The inferred slices of the GUI program allowus to identify and to eliminate redundant event sequences from the set of test cases.The evaluation of the approach on several open source programs has the followingfindings:

77

Page 78: Program Analysis and Black-box GUI Testing

• The approach is able to eliminate a non-trivial number of redundant event se-quences. Hence, it shows a significant reduction of event sequences, leading to astrong reduction in overall execution time.

• The approach scales to realistic GUI programs.• The approach detects that CVA-redundant test cases occur in 36%-55% of the

cases, and CVA-redundant test cases occur in 18%-25% of the cases.• The approach produces significantly longer sequences than the previous approach.• The approach detects more previously undetected bugs.

In the third part of this thesis we have proposed an approach to the selection ofsuitable input data for test cases. First, we select event sequences from a black-boxmodel and convert them into parameterized GUI tests. Then, we apply dynamic sym-bolic execution in order to instantiate these parameterized GUI tests. The evaluationof the approach on several open source programs has the following findings:

• The approach achieves high code coverage by using dynamic symbolic executionfor the selection of suitable input data.

• The approach prevents the selection of non-executable test cases by incorporatinga black-box model of the GUI.

7.2 Future Work

The three new approaches to GUI testing presented in this thesis are used in a non-iterative setting (i. e., we create the entire test suite). However, instead of generatingand replaying all test cases at once, we plan to integrate our approaches into an iter-ative setting such as [30, 31] or [47, 48]. The benefit is that one can bound the costof the testing process by defining a specific timeout. In particular, using an iterativesetting we plan to investigate, whether the number of test cases can be reduced, andhow the code coverage performs during the testing process. Moreover, we plan toconduct a study that systematically evaluates the effectiveness for detecting bugs bycomparing the iterative and the non-iterative setting.

Throughout this thesis we consider an event as an interaction triggered by a user(i. e., a human). However, in GUI programs, events may also be triggered by thesystem (e. g., the operating system). A typical example is a timer, e. g., used in e-Mail programs: After a specific timeout, the e-Mail program automatically fetchesnew arrived e-Mails and, more importantly, modifies the GUI. For example, the e-Mail program adds a new entry to the list of mails (e. g., the inbox). In a future work,we plan to incorporate approaches to generate sequences that may contain systemevents. In this context, a promising line of research is to consider multi-threaded GUIprograms [82], where user-initiated threads may lead to invalid thread accesses withinthe GUI program.

78

Page 79: Program Analysis and Black-box GUI Testing

Usually one expects that higher code coverage translates to higher bug detectionrate. For future work, we need to evaluate that this holds true, especially in the settingused in Chapter 4. Such an evaluation requires its own series of experiments whereone applies statistical methods to fault-seeded versions of AUTs, following, e. g.,[54, 81]. In this context, we further plan to combine the approaches presented inChapter 2 and 3 (i. e., the selection of relevant event sequences) with the approachpresented in Chapter 4 (i. e., the selection of suitable input data).

The approaches presented in this thesis incorporate an Event Flow Graph as theblack-box model of a GUI program. However, the approaches open an interestingperspective for future research, because the general scheme behind our approach goeswell beyond one specific black-box model (i. e., the Event Flow Graph). We need toexplore different alternatives such as, e. g., [13, 16, 74], and, e. g., [59, 83], for goingback and forth between the black-box approach and the white-box approach in thesense described in the previous chapters of this thesis.

79

Page 80: Program Analysis and Black-box GUI Testing

References

1. H. Agrawal and J. R. Horgan. Dynamic Program Slicing. In PLDI, pages 246–256, 1990.2. H. Agrawal, J. R. Horgan, E. W. Krauser, and S. London. Incremental Regression Testing. In

ICSM, 1993.3. S. Arlt, C. Bertolini, S. Pahl, and M. Schaf. Trends in Model-based GUI Testing. Advances in

Computers, 86:183–222, 2012.4. S. Arlt, C. Bertolini, and M. Schaf. Behind the Scenes: An Approach to Incorporate Context in

GUI Test Case Generation. In ICST, pages 222–231, 2011.5. S. Arlt, P. Borromeo, M. Schaf, and A. Podelski. Parameterized GUI Tests. In ICTSS, pages

247–262, 2012.6. S. Arlt, E. Ermis, S. Feo-Arenis, and A. Podelski. Verification of GUI Applications: a Black-Box

Approach (submitted). 2013.7. S. Arlt, A. Podelski, C. Bertolini, M. Schaf, I. Banerjee, and A. M. Memon. Lightweight Static

Analysis for GUI Testing. In ISSRE, pages 301–310, 2012.8. S. Arlt, A. Podelski, and M. Wehrle. Program Slicing for GUI Testing (submitted). 2013.9. ASM. A Java bytecode manipulation and analysis framework. http://asm.ow2.org, Sept.

2013.10. R. A. Assi and W. Masri. Identifying Failure-Correlated Dependence Chains. In ICST Workshops,

pages 607–616, 2011.11. G. Bae, G. Rothermel, and D.-H. Bae. On the Relative Strengths of Model-Based and Dynamic

Event Extraction-Based GUI Testing Techniques: An Empirical Study. In ISSRE, pages 181–190,2012.

12. S. Bates and S. Horwitz. Incremental Program Testing Using Program Dependence Graphs. InPOPL, pages 384–396, 1993.

13. F. Belli. Finite-State Testing and Analysis of Graphical User Interfaces. In ISSRE, pages 34–43,2001.

14. F. Belli, M. Linschulte, C. J. Budnik, and H. A. Stieber. Fault Detection Likelihood of Test Se-quence Length. In ICST, pages 402–411, 2010.

15. C. Bertolini, A. Mota, and E. Aranha. Calibrating Probabilistic GUI Testing Models Based onExperiments and Survival Analysis. In ISSRE, pages 319–328, 2010.

16. C. Bertolini, G. Peres, M. d’Amorim, and A. Mota. An Empirical Evaluation of Automated BlackBox Testing Techniques for Crashing GUIs. In ICST, pages 21–30, 2009.

17. R. Black. Foundations of Software Testing ISTQB Certification. Cengage Learning EMEA, 2012.18. U. A. Buy, A. Orso, and M. Pezze. Automated testing of classes. In ISSTA, 2000.19. COMET. Community Event-based Testing. http://comet.unl.edu, Sept. 2013.20. J. de Halleux and N. Tillmann. Parameterized Unit Testing with Pex. In TAP, pages 171–181,

2008.

80

Page 81: Program Analysis and Black-box GUI Testing

21. A. Dix, J. Finlay, G. D. Abowd, and R. Beale. Human Computer Interaction. Prentice Hall, 2003.22. G. Fraser and A. Zeller. Generating parameterized unit tests. In ISSTA, pages 364–374, 2011.23. FreeMind. A mind mapping software. http://freemind.sourceforge.net, Sept. 2013.24. V. Ganesh, A. Kiezun, S. Artzi, P. J. Guo, P. Hooimeijer, and M. D. Ernst. HAMPI: A String Solver

for Testing, Analysis and Vulnerability Detection. In CAV, pages 1–19, 2011.25. S. R. Ganov, C. Killmar, S. Khurshid, and D. E. Perry. Event Listener Analysis and Symbolic

Execution for Testing GUI Applications. In ICFEM, pages 69–87, 2009.26. Gazoo. A tool that generates relevant event sequences for GUI test cases. http://gazoo.

informatik.uni-freiburg.de, Sept. 2013.27. P. Godefroid. Partial-Order Methods for the Verification of Concurrent Systems - An Approach to

the State-Explosion Problem, volume 1032 of Lecture Notes in Computer Science. Springer, 1996.28. P. Godefroid, N. Klarlund, and K. Sen. DART: directed automated random testing. In PLDI, pages

213–223, 2005.29. P. Godefroid, D. Peled, and M. G. Staskauskas. Using Partial-Order Methods in the Formal Vali-

dation of Industrial Concurrent Programs. In ISSTA, pages 261–269, 1996.30. F. Gross, G. Fraser, and A. Zeller. EXSYST: Search-based GUI testing. In ICSE, pages 1423–1426,

2012.31. F. Gross, G. Fraser, and A. Zeller. Search-based system testing: high coverage, no false alarms. In

ISSTA, pages 67–77, 2012.32. R. Gupta, M. J. Harrold, and M. L. Soffa. Program Slicing-Based Regression Testing Techniques.

Softw. Test., Verif. Reliab., 6(2):83–111, 1996.33. C. Hammacher, K. Streit, S. Hack, and A. Zeller. Profiling Java Programs for Parallelism. In

IWMSE, pages 49–55, 2009.34. M. Harman and S. Danicic. Using Program Slicing to Simplify Testing. Softw. Test., Verif. Reliab.,

5(3):143–162, 1995.35. M. Harman and R. M. Hierons. An overview of program slicing. Software Focus, 2(3):85–92,

2001.36. M. J. Harrold, R. Gupta, and M. L. Soffa. A Methodology for Controlling the Size of a Test Suite.

ACM Trans. Softw. Eng. Methodol., 2(3):270–285, 1993.37. P. Haslum and H. Geffner. Admissible Heuristics for Optimal Planning. In AIPS, pages 140–149,

2000.38. S. Horwitz, T. W. Reps, and D. Binkley. Interprocedural Slicing Using Dependence Graphs. In

PLDI, pages 35–46, 1988.39. JabRef. A bibliography reference manager. http://jabref.sourceforge.net, Sept.

2013.40. G. Jayaraman, V. P. Ranganath, and J. Hatcliff. Kaveri: Delivering the Indus Java Program Slicer

to Eclipse. In FASE, pages 269–272, 2005.41. D. Jeffrey and N. Gupta. Test Suite Reduction with Selective Redundancy. In ICSM, pages 549–

558, 2005.42. C. S. Jensen, M. R. Prasad, and A. Møller. Automated Testing with Targeted Event Sequence

Generation. In ISSTA, July 2013.43. M. Kamkar, P. Fritzson, and N. Shahmehri. Interprocedural Dynamic Slicing Applied to Interpro-

cedural Data Flow Testing. In ICSM, pages 386–395, 1993.44. N. Kicillof, W. Grieskamp, N. Tillmann, and V. A. Braberman. Achieving both model and code

coverage with automated gray-box testing. In A-MOST, pages 1–11, 2007.45. P. Lam, E. Bodden, O. Lhotak, and L. Hendren. The Soot framework for Java program analysis:

a retrospective. In Cetus Users and Compiler Infrastructure Workshop, Galveston Island, TX,October 2011.

46. T. Lindholm and F. Yellin. Java Virtual Machine Specification. Addison-Wesley Longman Pub-lishing Co., Inc., Boston, MA, USA, 2nd edition, 1999.

81

Page 82: Program Analysis and Black-box GUI Testing

47. L. Mariani, M. Pezze, O. Riganelli, and M. Santoro. AutoBlackTest: a tool for automatic black-boxtesting. In ICSE, pages 1013–1015, 2011.

48. L. Mariani, M. Pezze, O. Riganelli, and M. Santoro. AutoBlackTest: Automatic Black-Box Testingof Interactive Applications. In ICST, pages 81–90, 2012.

49. S. McMaster and A. M. Memon. Call Stack Coverage for GUI Test-Suite Reduction. In ISSRE,pages 33–44, 2006.

50. A. Memon. Large scale test suites running live with GUITAR. http://samwise.cs.umd.edu:8080/, Sept. 2013.

51. A. M. Memon. Advances in GUI Testing. In M. V. Zelkowitz, editor, Advances in Computers,volume 58, pages 149–201. Academic Press, 2003.

52. A. M. Memon. An event-flow model of GUI-based applications for testing. Softw. Test., Verif.Reliab., 17(3):137–157, 2007.

53. A. M. Memon, I. Banerjee, and A. Nagarajan. GUI Ripping: Reverse Engineering of GraphicalUser Interfaces for Testing. In WCRE, pages 260–269, 2003.

54. A. M. Memon, I. Banerjee, and A. Nagarajan. What Test Oracle Should I Use for Effective GUITesting? In ASE, pages 164–173, 2003.

55. A. M. Memon and M. B. Cohen. Automated testing of GUI applications: models, tools, and con-trolling flakiness. In ICSE, pages 1479–1480, 2013.

56. A. M. Memon, M. E. Pollack, and M. L. Soffa. Automated test oracles for GUIs. In SIGSOFTFSE, pages 30–39, 2000.

57. G. J. Myers. The Art of Software Testing. John Wiley & Sons, 2011.58. Oracle. Trail: The Reflection API. http://docs.oracle.com/javase/tutorial/

reflect/, Sept. 2013.59. C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball. Feedback-Directed Random Test Generation.

In ICSE, pages 75–84, 2007.60. A. C. R. Paiva, J. C. P. Faria, and P. M. C. Mendes. Reverse Engineered Formal Models for GUI

Testing. In FMICS, pages 218–233, 2007.61. C. S. Pasareanu, P. C. Mehlitz, D. H. Bushnell, K. Gundy-Burlet, M. R. Lowry, S. Person, and

M. Pape. Combining unit-level symbolic execution and system-level concrete execution for testingNASA software. In ISSTA, pages 15–26, 2008.

62. Rachota. An application for timetracking projects. http://rachota.sourceforge.net,Sept. 2013.

63. V. P. Ranganath and J. Hatcliff. Slicing concurrent Java programs using Indus and Kaveri. STTT,9(5-6):489–504, 2007.

64. G. Rothermel, M. J. Harrold, J. Ostrin, and C. Hong. An Empirical Study of the Effects of Mini-mization on the Fault Detection Capabilities of Test Suites. In ICSM, pages 34–43, 1998.

65. G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold. Prioritizing Test Cases For RegressionTesting. IEEE Trans. Software Eng., 27(10):929–948, 2001.

66. J. C. Silva, C. E. Silva, R. D. Goncalo, J. Saraiva, and J. C. Campos. The GUISurfer tool: towardsa language independent approach to reverse engineering GUI code. In EICS, pages 181–186, 2010.

67. N. Tillmann and W. Schulte. Parameterized unit tests. In SIGSOFT FSE, pages 253–262, 2005.68. F. Tip. A survey of program slicing techniques. J. Prog. Lang., 3(3), 1995.69. T. Tuglular, C. A. Muftuoglu, F. Belli, and M. Linschulte. Event-Based Input Validation Using

Design-by-Contract Patterns. In ISSRE, pages 195–204, 2009.70. R. Vallee-Rai, P. Co, E. Gagnon, L. J. Hendren, P. Lam, and V. Sundaresan. Soot - a Java bytecode

optimization framework. In CASCON, page 13, 1999.71. W. Visser, C. S. Pasareanu, and S. Khurshid. Test input generation with java PathFinder. In ISSTA,

pages 97–107, 2004.72. M. Weiser. Program Slicing. In ICSE, pages 439–449, 1981.73. L. J. White and H. Almezen. Generating Test Cases for GUI Responsibilities Using Complete

Interaction Sequences. In ISSRE, pages 110–123, 2000.

82

Page 83: Program Analysis and Black-box GUI Testing

74. L. J. White, H. Almezen, and N. Alzeidi. User-Based Testing of GUI Sequences and Their Inter-actions. In ISSRE, pages 54–65, 2001.

75. W. E. Wong, J. R. Horgan, S. London, and A. P. Mathur. Effect of Test Set Minimization on FaultDetection Effectiveness. In ICSE, pages 41–50, 1995.

76. Q. Xie and A. M. Memon. Studying the Characteristics of a ”Good” GUI Test Suite. In ISSRE,pages 159–168, 2006.

77. Q. Xie and A. M. Memon. Designing and comparing automated test oracles for GUI-based softwareapplications. ACM Trans. Softw. Eng. Methodol., 16(1), 2007.

78. T. Xie, D. Marinov, and D. Notkin. Rostra: A Framework for Detecting Redundant Object-OrientedUnit Tests. In ASE, pages 196–205, 2004.

79. X. Yuan, M. B. Cohen, and A. M. Memon. Covering array sampling of input event sequences forautomated gui testing. In ASE, pages 405–408, 2007.

80. X. Yuan, M. B. Cohen, and A. M. Memon. Gui interaction testing: Incorporating event context.IEEE Trans. Software Eng., 37(4):559–574, 2011.

81. X. Yuan and A. M. Memon. Using GUI Run-Time State as Feedback to Generate Test Cases. InICSE, pages 396–405, 2007.

82. S. Zhang, H. Lu, and M. D. Ernst. Finding errors in multithreaded GUI applications. In ISSTA,pages 243–253, 2012.

83. S. Zhang, D. Saff, Y. Bu, and M. D. Ernst. Combined static and dynamic automated test generation.In ISSTA, pages 353–363, 2011.

83