No doubt, coding is popular because of its ability to improve the lives of people. It, however, attracts unscrupulous programmers due to the anonymity it offers.
Such coders take the advantage of the anonymity to lurk silently in the darkness of the web only to disseminate malwares or plagiarize the creations of other developers.
They have no name, no face, no records, and identifying them has been next to impossible … until now.
Two researchers have found out that software developers leave behind their signatures on the codes programmed by them … and machine learning can be used to identify the codes even if they were programmed anonymously.
Rachel Greenstadt, a computer science associate professor at Drexel University, and her former PhD student Aylin Caliskan, who is now an assistant professor at George Washington University, presented their studies at the DefCon hacking conference on August 10.
Both the researchers conducted various studies on machine learning techniques and developed a system that can 'de-anonymize' the code developers through compiled binaries or raw source code.
The system could be immensely helpful for investigators to identify malware creators, particularly when the malware creators attempt to frame someone else. The technology can also be useful in plagiarism disputes, where machine learning can differentiate coincidental similarities from blatant copying.
The technology can, however, have privacy concerns for those programmers who anonymously contribute open source code. Even though is possible to conceal the code’s origins, the technology may make it tough to contribute open source code anonymously. The technology can identify the open source code developers even if they are switching accounts to avoid leaving a trail.
How the technology works?
The algorithm that both the researchers developed recognizes all the features found in a collection of code samples. That’s hundreds of thousands different characteristics and features! The list is then narrowed down to include only those features that actually differentiate code programmers from each other, trimming the list to more or less 50 features.
Instead of depending on low-level features, Greenstadt and Caliskan created ‘abstract syntax trees’ that reflect the underlying structure of the code.
The technology also needs examples of a developer’s work to so that the algorithm can be taught to identify when it come across another one of the code sample of the developer. In a 2017 paper, the researchers along with two other researchers showed that even small pieces of code on the repository site GitHub is more than adequate to distinguish one coder from another with a high level of precision.
In a separate paper, Caliskan and another team of researchers demonstrated that it is feasible to identify a developer using just their compiled binary code. After a programmer completes writing a section of code, a program known as a compiler converts the code into a series of 0s and 1s, called binary, which can be read by a machine.
The binary was decompiled back by the researchers into C++ programming language, while the elements of a programmer’s unique style was preserved.
To carry out the binary experiment, the team of researchers used code samples from Google’s Code Jam. The algorithm accurately recognized a group of 100 individual developers 96% of the time using eight code samples from each. When the sample size was extended to 600 developers, the machine learning algorithm accurately identified individual developers 83% of the time.
© 2019 THE TECHNOLOGY HEADLINES. ALL RIGHTS RESERVED.