My thoughts on "An Exploration of Statistical Models for Automated Test Case Generation" (Sant, Souter, Greenwald)
- This paper explores the automatic creation of test suites from logged user sessions. They use two models to generate the test cases—a control model, which regulated the URLs in the test case, and a data model, which entered values for each parameter required in every given URL. Both the control model and the data model were statistical Markov models built based on the user data.
- The paper also introduces the term important parameter. For them, an important parameter is one whose value, at one point in the user sessions, stayed constant over two consecutive requests.
- The authors test varieties of two data models, Simple, which “captures only the probability that a set of values…is present on the given page” (unigram), and Advanced, which considers both the previous page (bigram) and, if there are any important parameters, it automatically takes on those values. Ultimately, they use uni-, bi-, and tri-gram models of simple and advanced to create test suites and found, surprisingly, that the unigram model quickly achieved the highest percent of statement coverage.
- They recognize that a major limitation of their study is that it only included one application, an online bookstore. However, they do not acknowledge that their measure of success is faulty as well—they rank successful test suites based on the number of book purchases in those suites. The paper states that “a valid user session is one that exercises an application in a meaningful way,” but it does not acknowledge that there are many uses for an online book store that do not include purchasing books—for example, browsing, checking on certain information about a book (author, publishing date, etc.), comparing prices (with other bookstores). It is admittedly much harder to tell what information a user was looking for if she does not buy the book, but it should be noted that a user session that does not contain a book purchase is only unsuccessful for the bookstore, not the web developer. This measure of success also cannot easily extend to other applications, which presents a huge limitation as well. However, the authors redeem themselves by using the percent of statements covered as another comparison measure.
- The paper compares the rate of book purchase per user session from the real user sessions (1.5) to the rates produced in the test suites (which all fell between .4 and .8). This is a more valid comparison. However, it is not necessarily best to have the goal of generating the most realistic test suites—they should contain both the most and least likely user sessions. In fact, this seems to be an open question: it is easier to determine what the least and most likely sequences of requests are, but what should the test suites actually contain? As shown by the results, the unigram (random, essentially) quickly covered more statements than the other models (though statement coverage for all of them converged after about 40 sessions). They note that this is because the bi- and tri-gram models are, in a sense, too predictable.
- The authors write, “Our original motivation…was to be able to generate user sessions that combined subsequences from different original user sessions to create sequences that had not been seen previously. These novel user sessions would exercise parts of an application not exercised on the original test suite.” Despite this and a question in the introduction, however, the paper does not explore the idea that the unigram model does so well because it is random—but why is random good? One possible answer is because it finds unusual/unlikely user actions; another is that it simply has a better chance of generating more diverse sessions than bi- and trigram models mimicking the user sessions.
- An unrelated limitation is their narrow view of important parameters. A parameter is important if its value stays constant over two consecutive requests. The idea of important parameters is very helpful, but it is important to note that if by chance a parameter’s value stays constant just once, it will be constant in all the test suites. This should not happen. One simple way to address it would be to assign each parameter a probability which measures how likely the value is to stay constant.