Just like a painter or author, programmers tend to have their unique style in which they code. As they line up thousands of lines of code, they leave behind a sort of personal “signature” in it.
Now researchers have found out that machine learning could be used to identify pieces of code even if they are written by anonymous programmers.
How does it work?
The machine learning system they have developed can ‘de-anonymize’ programmers by analyzing the patterns in raw source code or compiled binaries.
As told to the Wired, an algorithm is trained to identify a programmer’s coding pattern and uses them to spot similar traits in different code samples.
The best part about this system is that it doesn’t necessarily require large portions of code – even short snippets are sufficient for identification.
In a presentation at Defcon, the researchers Rachel Greenstadt and Aylin Caliskan explained that this AI-based technology was relatively accurate, if not entirely foolproof.
They tested codes submitted by 600 programmers with eight samples each and the system could correctly identify 83% of the times the algorithm was run.
The pros and cons
This technology has its own pros and cons. On one hand, it can prove useful to investigators, especially in identifying malware authors. It can even help in solving plagiarism issues as the machine learning based system can tell between coincidental similarities and copycats.
But the downside to it is that it can create difficulties for coders who like to contribute code anonymously. There are times when programmers would like to remain unknown for legit reasons and getting identified is not always a good thing.
Therefore, any future implementation of this technology will have to create a careful balance between the need for security and privacy.
Also Read:Â Hollywood Goes Open Source; Collaborates With Linux Foundation