Reading Papers on Splitting Words In Code
My most recent “new” project has been reading a couple of papers written by another group from the Delaware lab about how to automatically split code into words: “Mining Source Code to Automatically Split Identifiers for Software Analysis” and “AMAP: Automatically Mining Abbreviation Expansions in Programs to Enhance Software Maintenance Tools” – I'm not going to write them up as formally as we have the other papers, because we're looking at them not as related work (i.e. they're not related to automatically testing web apps at all) but for solutions/help with problems we've run across in the course of the research.
I've been working on splitting our resources (URLs) into words, but we're having some issues with abbreviations and words not being split up correctly. The Delaware group's paper is about splitting up code into words. The problem is slightly different because they use the comments in the programs to figure out how to split the words up, but we might be able to use their technique on our urls without needing comments or by including the code for the web app so that it does have comments. It's an interesting extension of their project … we think that their technique will also be applicable and work as well for web apps as it does for “normal” programs.
Another interesting aspect of the paper that I think might be useful in our research … but that's not at all related to either problem: the group used square roots and natural logs to dampen the effects of certain things. In their problem, they had indicators for the likelihood that certain strings were actually distinct words based on how often it came up in the particular code/comments of the program it was currently splitting and also based on a dictionary made from lots of different programs, but they wanted the info from the current program to have more weight in determining how words should be split, so they took the square root of the outside data to dampen it. I think this is interesting for us because it might help us with some of the issues we've been running into with either analysing the number of times something happens vs. the percent of time something happens.