Sant et al. Paper Comments

Statement of the Problem/Goals:

Overall problem

Web applications are hard to test, and current automated testing methods don't work well for web apps.

Goals

“Our goal is to use web logs to produce models that can be used to generate user sessions.”

Use logged user data to create models of web applications that can be used to generate new user sessions which can be combined into “effective” test suites. Use statistical/probabilistic methods to generate user sessions that are more/less likely to come up in the real world.

Contribution to state-of-the-art

“The main contribution of this paper are the design of data and control models that statistically represent the dynamic behavior of users of a web application, the algorithms to automatically construct the models, an approach that utilizes the models for automated test case generation, and a preliminary study evaluating the test cases.”

Instead of using logged user data directly (which was the state-of-the-art), they propose generating test cases/test suites based on the logged user data. This hopefully will result in realistic test cases/test suites that will more effectively test web apps than previously proposed methods.

Technical Approach:

Key insights

Use logged user data to generate different test cases. Separate process into a log parsing step and a model building step. In generating new test cases, take history into consideration. I.e. given the URLs that have already been visited (or the data values that has already been seen), what URLs (or data values) are more/less likely? Separate data and control models for data values and URL sequences respectively. (This is barely mentioned by Sant et al., but is significant to us because we use this insight in our work).

Overall approach/strategy

Collect user data for sample web applications. Parse user data. Create models (Markov models) based on the parsed user data that may include information about the history and possible pairings in data values. (Some models look farther back in history than others. What's the “right” amount?). Generate test cases by taking a “random walk” through the model.

Discussion/Critique:

How did they evaluate their efforts?

They evaluated their efforts by looking at the test cases/test suites generated by their models. They investigated the effectiveness of these test suites based on the rate of coverage for different types of models and the accuracy of the generated tests.

They asked themselves (1) if their models could be used to generate valid user sessions and test suites, (2) how the ordering of page requests within a session affects test case coverage results, (3) how the quality of the data model effects test case coverage results, and (4) if the ordering of user sessions within a test suite matters for validity or coverage.

Conclusions from evaluation results

All of their test suite generation methods produce valid test cases (successful book purchases), but (maybe?) less valid test cases than the original logged user data (based on successful book purchases). All test suite generation methods have equal coverage after 40 user sessions, but the 1-gram model (less history) achieves good coverage more quickly. Error checking is exercised less quickly when increasing amounts of history are considered, but more valid sequences of requests are produced by models that consider more history. Randomly reordering test cases within a test suite doesn't affect the validity (again, measured by successful book purchases), but does result in quicker coverage. They hypothesize that this is, in part, due to the artificial way in which the application was used … everyone registered within a short period at the beginning of the applications use, so without random reordering the registration pages were covered multiple times before other pages began to be covered.

What application/useful benefit do the researchers/you see for this work?

They see this work as a step toward making web applications easier to test. They see their contributions as “the design of data and control models that statistically represent the dynamic behavior of users of a web application, the algorithms to automatically construct the models, and approach that utilizes the models for automated test case generation, and a preliminary study evaluating test cases. I think the most useful benefit is simply that this method of using logged user data to generate new test cases that more effectively test the web application seems to be promising. Also the ideas about considering history and splitting the test case generation process into a data model and a control model are of particular interest to us.

Limitations mentioned

The study was only performed on one application (do the results generalize?). The application was not of very good quality. Users of their sample application (a bookstore application) were provided with a list of suggested activities (do people use real applications differently?).

Additional limitations

Could be more realistic. Factors other than history probably affect navigation through a web application and parameter values. Last summer we talked about different aspects of history, parameter interactions, and types of users/specific users. Looking at other ways in which navigation and parameter values are determined should result in more realistic test cases and, potentially, more diverse test cases. Is the number of successful book purchases really a good measure of the validity of test cases? Is good coverage necessarily an indicator of a good test suite? It seems that the main (important) parts of the application are covered by valid, realistic test cases, but the error pages (like for login without registration) are more quickly covered by invalid/less realistic test cases. Does the type of web application affect how it's used and, therefore, how to best test it? (i.e. not just the problem that they only had one application, but also the limitation that it's only one type of limitation). User sessions are currently being run sequentially (i.e. userSession1 runs completely before userSession2 begins etc.). Users interact directly in some applications, but even when they are not interacting directly, something one user does can affect another user. How much does this interaction matter (here vs. in the real world)?

Question …

Besides the rate of successful book purchases, how could we determine whether test cases were valid or not? (For some reason this successful book purchases as a way to measure validity is really bothering me right now, but maybe it's no big deal.)

You could leave a comment if you were logged in.