Birthmarks for GPL

Apparently some folks at Saarland University in Germany have come up with an automated method to detect code theft, and they call it birthmarking. The method as demonstrated uses Java, but the poster at Slashdot claims that the method “could be used to detect GPL violations in particular.” For those interested in the paper from which that conclusion is made here is the link. Perhaps of most interest to this readership, the paper explicitly links the method to evidence questions in disputes such as the IBM and SCO code fight.

Hat Tip: Slashdot.

2 thoughts on “Birthmarks for GPL”

James Grimmelmann August 26, 2007 at 10:25 pm

k-gram call sequence analysis can be surprisingly powerful; it does seem that a program’s sequence of API calls is a reasonably hard-to-change property. As an undergraduate, I saw a presentation of some of the “computer immunology” research cited by this paper: the reasoning goes that a novel sequence of API calls is evidence of a new program running (which could be a piece of malware). This paper just inverts that logic: a familiar sequence of API calls is evidence that the “new” program is an old program in disguise.

I’m surprised that they don’t extend the binary nature of this “birthmark” to keep a statistical count of the sequences. At least for programs executing similar tasks, it would seem that the frequency of a k-gram would be even more revealing that whether that sequence was executed at all.
Bruce Boyden August 27, 2007 at 9:03 am

I’m curious why this works. Wouldn’t two programs that do roughly the same thing, e.g., Word and WordPerfect, make a lot of the same API calls in the same order? But I notice there’s barely any overlap between the various PNG and XML readers tested in the paper.

Comments are closed.