We spent quite a bit of last week working on abstracts for our posters for the Tapia Conference and W&L's SSA Conference. I think that figuring out how to compactly and clearly state our problem, method and goals was really helpful for reminding me about the higher-level reasons for our research. It's really easy to get bogged down in the details. For ex. I looked for patterns in parts of speech for a long time and it was hard to admit that it just wasn't leading anywhere, and when I changed directions it was sort of hard to get back to the bigger problem. We also applied for scholarships to the Tapia Conference last week, and we'll find out about that soon :)
For much of the semester, I have been working on separating the resource names into words and looking for patterns in the parts of speech of each of these words. There have been a few challenges. Not all of the resources are easy to separate into individual words automatically. We looked into AMAP, which would have done a good job of figuring how to correctly split the resources, but decided that this was probably unnecessary for our purposes (the percentage of resources split incorrectly in our applications was very small). Then I began to identify the words by their parts of speech with the goal of finding patterns in the parts of speech in the resources. This proved difficult, because, without context, a HUGE percentage of the words (almost all of them, in fact) were hard to determine a part of speech for (ex. Grade can be a verb or a noun). Even in context, it was often hard to determine what part of speech certain words were (ex. login and logout). So I did my best to figure out the correct part of speech in context, and started to look for patterns in the part of speech within the resources (ex. bookstore/verifyGrade's pattern would be noun verb noun). Unfortunately, this didn't seem to be very useful. There were some patterns, but there wasn't much we could really learn from any of them. Many of the resources are exactly the same except for one word (ex. maybe bookstore/verifyGrade and bookstore/verifyName are the same except for the last word), but they have the same part of speech pattern. When we saw patterns, it was usually a case like this with very similar resources except for one word. For example, the pattern would be a whole set of bookstore/verify«NOUN»s rather than a bunch of different «NOUN»«VERB»«NOUN» combos like bookstore/verifyGrade, house/cleanSink, earth/saveTrees, etc. which would all have the same pattern but very different resources. In these cases, we could learn almost as much information (if not more) by just looking at the resources or by JUST splitting them into words without identifying parts of speech.
So now I'll be going in a new direction just looking at what we can learn from the resource names and navigation from resource to resource.
Lately I've been working on looking for patterns in words in resources for web applications. We're looking at what the most common words are in each application, what part of speech they are, patterns in the parts of speech for the whole resource. One problem I've been running into is that there are a lot of words that are hard to classify as nouns, adjectives or verbs without their context - for example grade can be a noun or a verb, abstract can be an adjective or a noun, etc. etc. etc. Usually with the context of the application or the rest of the resource, it's pretty easy to figure out what part of speech the words actually are, but there are some we're still not sure about (login and logout are the most notable examples of that). So the next thing I'll be doing is going into our automatically generated files and manually updating the parts of speech based in the specific contexts of the resources and keeping track of how many changes have to be made for each application (to determine how important it would be to automate that process. We also want to see whether or not this really gives us valuable insight about how people navigate through the applications and how we might be able to group similar sets of pages.
My most recent “new” project has been reading a couple of papers written by another group from the Delaware lab about how to automatically split code into words: “Mining Source Code to Automatically Split Identifiers for Software Analysis” and “AMAP: Automatically Mining Abbreviation Expansions in Programs to Enhance Software Maintenance Tools” – I'm not going to write them up as formally as we have the other papers, because we're looking at them not as related work (i.e. they're not related to automatically testing web apps at all) but for solutions/help with problems we've run across in the course of the research.
I've been working on splitting our resources (URLs) into words, but we're having some issues with abbreviations and words not being split up correctly. The Delaware group's paper is about splitting up code into words. The problem is slightly different because they use the comments in the programs to figure out how to split the words up, but we might be able to use their technique on our urls without needing comments or by including the code for the web app so that it does have comments. It's an interesting extension of their project … we think that their technique will also be applicable and work as well for web apps as it does for “normal” programs.
Another interesting aspect of the paper that I think might be useful in our research … but that's not at all related to either problem: the group used square roots and natural logs to dampen the effects of certain things. In their problem, they had indicators for the likelihood that certain strings were actually distinct words based on how often it came up in the particular code/comments of the program it was currently splitting and also based on a dictionary made from lots of different programs, but they wanted the info from the current program to have more weight in determining how words should be split, so they took the square root of the outside data to dampen it. I think this is interesting for us because it might help us with some of the issues we've been running into with either analysing the number of times something happens vs. the percent of time something happens.