Mystery of Duqu Programming Language Solved
wiredmikey writes "Earlier this month, researchers from Kaspersky Lab reached out to the security and programming community in an effort to help solve a mystery related to 'Duqu,' the Trojan often referred to as 'Son of Stuxnet,' which surfaced in October 2010. The mystery rested in a section of code written an unknown programming language and used in the Duqu Framework, a portion of the Payload DLL used by the Trojan to interact with Command & Control (C&C) servers after the malware infected system. Less than two weeks later, Kaspersky Lab experts now say with a high degree of certainty that the Duqu framework was written using a custom object-oriented extension to C, generally called 'OO C' and compiled with Microsoft Visual Studio Compiler 2008 (MSVC 2008) with special options for optimizing code size and inline expansion."
Oh no, Allens do exist. Although he spells it Alan.
Here you go: http://www.securelist.com/en/blog/667/The_Mystery_of_the_Duqu_Framework
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
A well publicized article featuring Microsoft Development products of all things, I think they should use that PR in their Microsoft Visual Studio Ads...
"Enjoy what you're doing! If it becomes drudgery, you're doing it wrong!" - Jim Butterfield
Different languages compile down very differently. Indeed, different compilers compile the same source code differently (try comparing GCC output to Visual Studio output and you'll see some obvious differences in how the assembly/machine code is crafted). In this case, there were clear signs of an object-oriented approach (data and functions were located around each other in memory, which is not likely to happen in non-OO languages, etc).
It seems they recognized a sequence of instructions that are typical of a class constructor, just not like any class constructor they were familiar with.
There are certain characteristics to the way C++ behaves (the manner in which you pass parameters, etc). Mainly, through having looked at lots and lots of code samples, they can say what they expect the compiled code to look like. If they know C++ compiled code looks like x, regular C looks like y, and this looked like z, it can't be C. Essentially, the code did things you simply can't do in C++ or C (even Objective C) by itself. The problem is, that method only allows you to compare to known languages. More details here.
It's basically like identifying an animal by footprint. Once you know a deer leaves a certain kind of footprint, you can identify more deer by examining footprints. But you can't identify an unknown animal that way: if you haven't seen a given footprint before, you won't know what animal it is, only what general characteristics it has (weight, etc.)
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
Knowing the language and techniques used can speed up analysis of future variants found, because they'll know what patterns to look for first.
You do not have a moral or legal right to do absolutely anything you want.
FTFA:
Why did the authors of Duqu use OO C? While there is no easy explanation why OO C was used instead of C++ for the Duqu Framework, Kaspersky experts say there are two reasonable causes that support its use [More control over the code & Extreme portability]. These two reasons indicate that the code was written by a team of experienced ‘old-school’ developers
Why OO C? Because it worked, because they new how to use it, because they knew it would throw Kaspersky for a loop, because they thought it was cool. There are many many reasons and they do not all have to be logical.
Kaspersky experts might want to consider that the programming wheel of life may have turned and that what was once old-school is now new-school. Whose to say that the under-estimated script-kiddies cannot grow up to be formidable adults with a whole new bag of tricks?
They did open the lines up for suggestions, and some community members suggested that it looked like OO C. How did they know? They probably had experience using and debugging OO C, if I had to guess. There were also plenty of people who said that it definitely wasn't compiler X or language Y from their own experiences. The article links to this discussion: http://www.securelist.com/en/blog/677/The_mystery_of_Duqu_Framework_solved
But about discovering the specifics of the truth? It's probably like you alluded to in your comment - fingerprinting the machine code. It would take a while, but you could come up with fingerprints for a great many various compilers and features. You could do that for Common Lisp, too. (In fact, someone DID suggest for them to look at various LISP dialects.) It has taken long enough that such a scenario - having a good library of fingerprints - is believable. Given a scanner with a dictionary of fingerprints, one could reasonably say that you either have hand-assembled machine code made to mimic another language, or that you have code generated by a very specific language and compiler. If nothing in your library of fingerprints matched, assuming you had a good handle on hand-assembling machine code, you could look and see if it smells like such a beast. It would be tremendously laborious to hand-assemble code to make it look like a specific compiler generated it, and why would you do that in the first place? I fail to see the benefit when you could just use that compiler. If you were trying to throw off the analysts with a false positive match, there would still be a ton of mysterious data that still needs examination.
Think about DNA analysis. We can look at our DNA and determine some chunks of it came from virus, and that some of it is "junk" that serves no purpose.
Also think about image analysis like OCR or various captcha-breaking software. You can map images to characters with a program, and detect anomalies and known signatures.
Then there is heuristic antivirus scanning. It knows enough to find some previously unthought-of malicious code, even if it does sometimes generate false positives.
So why not apply those techniques to machine code, and see what you get? If multiple methods give you similar results, you would be onto something, I imagine.
For O'Reilly's "Mastering Duqu"?
To tag along - it's hard to tell data from code, and it helps the decompiling app to detect what is code vs. data if it knows which compiler created it.
It looks like the original blog used IDA Pro, which has library signatures for different compilers. It can identify functions and auto-comment the code, making disassembly easier. Auto-identify stack variables and keep track of them through lots of PUSH and POP and RETURN X statements, it's quite powerful.
In this case, IDA probably gave a lot of erroneous warnings or disassembled data or refused to disassemble code, requiring lots of manual work. The classes apparently were done inconsistently, making it hard to even write a plug-in to automatically detect them (scripts exist to identify MSVC objects through their RTTI properties, and do a decent job identifying non-RTTI classes, but this would not work with this code).
http://www.hex-rays.com/products/ida/index.shtml
When reverse engineering, and your tool basically says "WTF do I do with this?" it's one of those moments where you want to know how the attacker made it.
Is it hand-rolled? Or a new attack creation kit that script kiddies can cobble something together using?
And "unknown language" was not a really good way to describe it. "Unrecognized output" would have been better. The assumption is that a language like C would compile to a C-like syntax, C++ would do things differently. But it could have been just C++ with an unknown compiler.