As web applications become increasingly complex, automated testing becomes difficult due to the overload of information. The state of the art mechanisms of user-based testing involve automatically generating test suites in two main steps. First, the tester generates a sequence representing a user’s path through the application, and later she fills in parameter values. We focus on the second stage of generation, called the data model, and we propose a number of characteristics to classify parameters. Parameters are categorized based on their use in the applications as well as their treatment by the user. Based on the classification of the parameter, we can then pick the best combination of factors to determine the best or most realistic parameter value. Our factors are comprised of the other possible elements of the user’s path through application that may affect the value of the parameter. By determining which of these factors is most relevant given the type of parameter, we are able to build a more accurate data model and thus more accurately model our dersired test suites.
Summary
This paper's goal is to semi-automatically recreate the realistic test suites for any given web application. They analyze the html code generated by the application and use employ humans to determine all the input values “which cover all relevant navigations.” They also use a reverse engineering tool that they previously developed to “automatical[ly] extract… [an] explicit-state model of the … web application.”
Based on the variables pre-specified by the user, which create equivalence classes, they supply the application with a particular list of input in a separate file. When the user has not identified equivalence classes of input or when the web application is in a different state, for which the user's specifications do not apply,, the authors use a semi-automatic process called page-merging to “simplify the explicit state model” and determine equivalence classes for the input. They have three (decreasingly automatic) criteria for comparing two dynamically-generated html pages:
a) pages that are literally identical are considered the same page
b) pages that have “identical structures but different texts, according to a comparison of the syntax trees of the pages, are considered the same page”
c) pages that “have similar structure, according to a similarity metric, such as the tree edit distance, computed on the syntax trees of the pages, are considered the same.”
Essentially: if the pages are not identical, they look at the syntax with varying degrees of leniency. They also give a summarization of the main phases in statistical testing:
1) Construction of a statistical testing model, based on available user data
2) Test case generation based on the statistics encoded in the testing model (modeled as a Markov chain, called the usage model; the transition probabilities )
3) Test case execution and analysis of execution putout for reliability estimation Their main contribution is in the first step, with the semi-automatic creation of equivalence classes. They also deal with state by exploiting the hidden variables that determine whether the state stays constant.
Limitations:
In general, this paper seems to have some excellent ideas–like state parameters and equivalence classes of parameter values–but it relies far too heavily on humans to make decisions that should be automated.
Summary
- This paper explores the automatic creation of test suites from logged user sessions. They use two models to generate the test cases—a control model, which regulated the URLs in the test case, and a data model, which entered values for each parameter required in every given URL. Both the control model and the data model were statistical Markov models built based on the user data.
- The paper also introduces the term important parameter. For them, an important parameter is one whose value, at one point in the user sessions, stayed constant over two consecutive requests.
- The authors test varieties of two data models, Simple, which “captures only the probability that a set of values…is present on the given page” (unigram), and Advanced, which considers both the previous page (bigram) and, if there are any important parameters, it automatically takes on those values. Ultimately, they use uni-, bi-, and tri-gram models of simple and advanced to create test suites and found, surprisingly, that the unigram model quickly achieved the highest percent of statement coverage.
Limitations
- They recognize that a major limitation of their study is that it only included one application, an online bookstore. However, they do not acknowledge that their measure of success is faulty as well—they rank successful test suites based on the number of book purchases in those suites. The paper states that “a valid user session is one that exercises an application in a meaningful way,” but it does not acknowledge that there are many uses for an online book store that do not include purchasing books—for example, browsing, checking on certain information about a book (author, publishing date, etc.), comparing prices (with other bookstores). It is admittedly much harder to tell what information a user was looking for if she does not buy the book, but it should be noted that a user session that does not contain a book purchase is only unsuccessful for the bookstore, not the web developer. This measure of success also cannot easily extend to other applications, which presents a huge limitation as well. However, the authors redeem themselves by using the percent of statements covered as another comparison measure.
- The paper compares the rate of book purchase per user session from the real user sessions (1.5) to the rates produced in the test suites (which all fell between .4 and .8). This is a more valid comparison. However, it is not necessarily best to have the goal of generating the most realistic test suites—they should contain both the most and least likely user sessions. In fact, this seems to be an open question: it is easier to determine what the least and most likely sequences of requests are, but what should the test suites actually contain? As shown by the results, the unigram (random, essentially) quickly covered more statements than the other models (though statement coverage for all of them converged after about 40 sessions). They note that this is because the bi- and tri-gram models are, in a sense, too predictable.
- The authors write, “Our original motivation…was to be able to generate user sessions that combined subsequences from different original user sessions to create sequences that had not been seen previously. These novel user sessions would exercise parts of an application not exercised on the original test suite.” Despite this and a question in the introduction, however, the paper does not explore the idea that the unigram model does so well because it is random—but why is random good? One possible answer is because it finds unusual/unlikely user actions; another is that it simply has a better chance of generating more diverse sessions than bi- and trigram models mimicking the user sessions.
- An unrelated limitation is their narrow view of important parameters. A parameter is important if its value stays constant over two consecutive requests. The idea of important parameters is very helpful, but it is important to note that if by chance a parameter’s value stays constant just once, it will be constant in all the test suites. This should not happen. One simple way to address it would be to assign each parameter a probability which measures how likely the value is to stay constant.