English Shell Code Could Make Security Harder
An anonymous reader writes to tell us that finding malicious code might have just become a little harder. Last week at the ACM Conference on Computer and Communications Security, security researchers Joshua Mason, Sam Small, Fabian Monrose, and Greg MacManus presented a method they developed to generate English shell code [PDF]. Using content from Wikipedia and other public works to train their engine, they convert arbitrary x86 shell code into sentences that read like spam, but are natively executable. "In this paper we revisit the assumption that shell code need be fundamentally different in structure than non-executable data. Specifically, we elucidate how one can use natural language generation techniques to produce shell code that is superficially similar to English prose. We argue that this new development poses significant challenges for in-line payload-based inspection (and emulation) as a defensive measure, and also highlights the need for designing more efficient techniques for preventing shell code injection attacks altogether."
quite terrifying :(
If hackers convert arbitrary x86 shell code into sentences that read like spam, but are natively executable .. we're all screwed :(
We'll either need to tighten up how architectures execute instructions to make it harder to execute shell code in the first place.. or come up with sophisticated AI to help filter out the shell code. Of course, as soon as we do that, hackers will develop AIs which can write convincing (and even compelling) shell code.. and THEN what the hell do we do.
Now where I live you can get a pretty decent hair cut for $17 (they even trim up the beard). You can't get anything fancy.. but a decent, professional-ish type haircut is definitely no problem.
My employer is giving us a pretty generous Christmas vacation.. really looking forward to that!!
Also this time of year is great cause CHRISTMAS is everywhere :D
just formatted my hard disk and installed Windows 7 - how low can you get :(
Been there, done that, paid for the T-shirt
and didn't get it
They don't mean shell commands. They mean code that exploits a vulnerability in order to start a shell.
"I assumed blithely that there were no elves out there in the darkness"
Good job not reading the article.
It's not that shellcode can be written in text and then compiled to an executable form. It's not that shellcode can be compiled to an intermediary form, translated or compiled into machine instructions by a piece of code (this is common in malware now, to pass input restrictions -- as the article says). It's that the executed machine instructions themselves -- the compiled binary data that can be run raw on an x86 processor -- looks like English text.
Why, this very comment prints a list of prime numbers less than one hundred!
org.slashdot.post.SignatureNotFoundException: ewg
Now your brain can catch a virus just by reading!!!1
And nothing in their article is helping with that. They assume they are exploiting a software vulnerability. If I know there is a software vulnerability, there are 1 million and 1 less complex ways for me to blow right by any inline scanner. (One stupid enough not to look and see what the actual bytes were anyway)
Nope, you're confusing assembly code and shell/machine code, which are two different things.
Assembly is text-based, and is readable for people who know the language. Each operation is a keyword, and some take arguments. It's basically the lightest-weight possible programming language (although it's not really considered a programming language, it's so light weight!) A computer cannot run assembly code directly.
Machine code is what you get if you take the assembly and run it through an assembler to produce code that the computer can understand. The computer can then execute it. It is not human readable unless you've memorized which opcodes correspond to which assembly keywords. Far easier to pipe it through a disassembler to get the assembly code back and read that.
To answer the GP's question this sounds like they mean shell code. It wouldn't be very useful as assembly code anyway. ("To claim your free iPod, run this sentence through masm and run the resulting EXE file.") Most people don't have an assembler and the ones who do aren't usually susceptible to malware anyway.
This is the sixth spam message this user has posted, will SLASHDOT please BAN this guy already? Come on.
An assembler/compiler doesn't necessarily use a high-level language input.
In this instance they (as you say) 'takes as input executable machine code and generates executable machine code with a very narrowly-defined statistical property' which tells me they have an assembler that reads executable code and assembles executable code that looks like English text, in other words an assembler.
--- Reality doesn't care about your opinions, it happens anyway and if you are in the way you'll get squished.
This is the sixth spam message this user has posted, will SLASHDOT please BAN this guy already? Come on.
He must be making new logins. I've seen him posting for a few weeks, he surely has more than 6 spams that I've seen alone. Going on that idea... lets see:
http://slashdot.org/~coolforsale117
http://slashdot.org/~coolforsale116
http://slashdot.org/~coolforsale115
http://slashdot.org/~coolforsale114
http://slashdot.org/~coolforsale112
http://slashdot.org/~coolforsale110
No doubt there is a TON of them. So I'd guess they are banning him, he just keeps making new uids (and siphoning a ton of moderation points to keep him marked at troll / offtopic). I know I've used many mod points keeping this bastard down.
Consume more trains, Elvis! He, and snorkels, drink elephant's sock puppet master. Steamed cabbage can reverse big piles of ducks. Additionally, cheese log cabin nightmare.
You're screwed now, x86 suckas!
It's latex with an ACM template. I'm pretty sure their workflow was latex (.dvi) to dvips (.ps) to Acrobat Distiller (.pdf).
It's a translator that takes any arbitrary x86 machine code as input, and produces as output functionally equivalent self-modifying machine code that starts off looking like English text. The same approach also works with other non-x86 machine codes, and other languages, such as Russian, French, etc... Very interesting work. It goes to show that for an OS to allow any code to self-modify can produce results that are very difficult to predict. Self-modifying code has an almost biological nature.
"Please type the following on your command-line:
rm -rf *
Thank you."
This talk was probably my favorite at CCS this year. Unlike MANY researchers, the lead author of this paper was quite entertaining. Regarding the work itself, there are a few details that the current discussion has missed.
First, I would not say that they can convert arbitrary shell code to English-like prose. Rather, the only instructions that can be used are the ones that are identical to the ASCII encoding of the alphabet. For instance, the ASCII encoding of the letter "r" is identical to the binary for the unconditional jmp instruction. Granted, the authors showed that you can do a lot with this limited set of instructions, but I still wouldn't call it arbitrary.
Second, he showed several examples of the sentences created. They make about as much sense as "Lorem ipsum dolor sit amet..." The tight constraints on the instructions that can be encoded into ASCII make crafting decent English syntax nearly impossible. Spam filters based on natural language processing could probably detect and flag them.
While disguising the binary as ASCII is cool, I don't see that it's all that different than other exploits. Once a sentence containing an exploit is detected, you'll have signatures just like any other type of virus/trojan. I highly doubt that contemporary anti-virus scanners stop working on data that looks like ASCII. Rather, they look for tell-tale signs of particular instructions that appear in particular orders, etc.
And, as many others have pointed out, this code is only harmful if it is executed in the right context (i.e., you have a vulnerability to exploit). Disguising the code as ASCII doesn't really make it different than any other type of zero-day attack.
This work was very sophisticated, and there's no way that script kiddies could build something like this. I don't know that more advanced attackers would bother, because I really don't see all that much of a payoff given the amount of work that this attack requires. It's a whole lot easier to take over a vulnerable web server and launch a XSS attack. The incentives simply do not seem to suggest that this technique will become widespread.
So, no, I don't think the sky is falling because of this attack. Having said that, though, this was a very cool piece of work.
TFA uses the security community's special term "(a) shellcode", which means something other than what it sounds like to ordinary programmers.
"A shellcode" is the infection head of an exploit - the thing you try to get to run on the target to make the rest of the exploit work. It's in the machine language of the target, not a shell language.
It's called "a shellcode" because it typically (but not necessarily) tries to sucker the system into launching a shell to run the rest of the exploit. The rest of the exploit may be in a shell language (depending on the shell to interpret it), a machine language executable, etc. Or "the shellcode" may do something else than launch a shell.
This is one of the latter cases. It's a chunk of self-modifying code (due to the limits of what instructions you can get out of English-looking text) that bootstraps its own internals into something that can act as an interpreter (or other executor) for the rest of the English-looking exploit code, then runs though that code and "makes it happen".
You can think of it as a binary executable program that depends on self-modification to get away with consisting only of combinations of bytes that look enough like English to fool spam filters which are trying to recognize executable code.
So it's a very goofy binary and there are no shells or shell languages involved. Instead (if I read this right) the researchers built a very screwy assembler that takes as input an assembler source program and produces as output some VERY screwy machine code that looks like English and ends up doing the same job in a roundabout way, rather than being the direct translation of the assembler code input.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
I for one is very impressed by what they've done, even if it is somewhat similar to what I did nearly 15 years ago:
At that time I wrote what's probably the "best" executable text encoder for MsDos, it uses the absolute minimum possible amount of self-modification (a single 2-byte Jcc opcode) while staying entirely within the MIME text character set, and survives all the most usual forms of reformatting/reflowing of the text. (Replacing CRLF with a single CR (Mac) or LF (unix) or turning each paragraph into a single line.)
The initial bootstrap looks like this:
ZRYPQIQDYLRQRQRRAQX,2,NPPa,R0Gc,.0Gd,PPu.F2,QX=0+r+E=0=tG0-Ju E=
EE(-(-GNEEEEEEEEEEEEEEEF 5BBEEYQEEEE=DU.COM=======(c)TMathisen95
(The uppercase 'E's are my NOP fillers, they execute as INC BP, a register I don't use.)
Terje
PS. Unlike the current guys, I wrote the code above by hand, on paper, during the evenings of a ski vacation. I had brought with me a listing of the ascii encoding of all instructions that would use MIME characters only. :-)
"almost all programming can be viewed as an exercise in caching"
I think this is interesting, but hardly break-through.
In the mid 80's, we did the same thing at a field Hewlett-Packard office, although not aimed at viruses. Our target was to enable users to key in x86 code in text form. In other words, sit down at a PC, open EDLIN (the DOS equivalent of Notepad), or some simple text editor, and key in human readable words (i.e. meaningful text that humans - HP Engineers - could easily transcribe from paper or a phone call). Then save the file as a .com file (which was a DOS executable), and then run it.
Think back to the days of stand-alone PC's, no USB, etc. If the field engineer was at a customer site, and needed to run a small diagnostic program on the PC, but didn't have the tool, then they'd simply call the office, and have the secretary ("coordinator") read them the human-readable sentences to key in. The engineer keys it in, and launches a diagnostic program. Our version even had a check-sum built into the words, so as long as you got the first few sentences exactly right (which were the boot part), then the rest of the "code" (sentences) were examined for check-sums, and would generate a location-specific error (e.g. "Checksum error in the sentence 'Many frightened capsules trigger captain mole".)
I remember this well, because I wrote the boot part, and the checksum algorithm. I made it fairly resilient to normal human typing habits (i.e. don't worry about capitalization, multiple spaces between words, apostrophes, etc). And I tried to choose some easy sentences (manually) for the boot part, since that had to be entered exactly right each time.
The system was made up of a "compiler" which would take a simple .com file (that is, an executable file, not a dot-com website), and convert it to "sentences" (which made little sense). We used a spelling dictionary English words, removing homonyms, as we wanted the words to be "read aloud". We tried to compile into short sentences of specific noun-phrase / verb-phrase formats, but they rarely made any sense. Some were outright silly, like: "The crazed orange melts to school."
It worked great! But it was only practical for very short utilities. Still, it was FAR easier to key in sentences of nonsense, rather than hex code. Our experience was that a typical engineer could key in nonsense sentences about 5 times faster than hex code (even considering that the words had to have extra boot code to analyze the text), although the results varied depending on the length of the overall program.
Then networks came along and rendered it fairly useless.
What is "shell code" supposed to be? Bourne shell scripts?
Someone had to ask it!
From the wikipedia: In computer security, a shellcode is a small piece of code used as the payload in the exploitation of a software vulnerability. It is called "shellcode" because it typically starts a command shell from which the attacker can control the compromised machine. Shellcode is commonly written in machine code, but any piece of code that performs a similar task can be called shellcode. Because the function of a payload is not limited to merely spawning a shell, some have suggested that the name shellcode is insufficient.[1] However, attempts at replacing the term have not gained wide acceptance.
So it's a poor piece of new terminology that has stuck, unfortunately.
Stick Men