Slashdot Mirror


HTML Encoded Captchas

rangeva writes to tell us about a twist he has developed on the common Captcha technique to discourage spam bots: HECs encode the Captcha image into HTML, thus presenting an unsolved challenge to the bots' programmers. From the writeup: "The Captcha is no longer an image and therefore not a resource they can download and process. The owner of the site can change the properties of the Captcha's HTML, making it unique,... add[ing] another layer of complication for the bot to crack." HECs are not exactly lightweight — the one on the linked page weighs in at 218K — but this GPL'd project seems like a nice advance on the state of the art.

28 of 177 comments (clear)

  1. I failed to see how this'll help by Rosco+P.+Coltrane · · Score: 5, Interesting

    At the end of the day, this captcha is displayed on the screen as a colorful harder-to-read mumbo-jumbo, just like jpeg captchas, so all a bot has to do is use a html renderer to turn it into a regular image that can be processed. So the added complication is linking one of the existing captcha decoders and the gecko engine for example, maybe a half day's work. Not exactly uncrackable...

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
    1. Re:I failed to see how this'll help by Anonymous Coward · · Score: 3, Informative

      Well, considering that the sample captcha is just a large table where every pixel is set as a background color, I'd say it would probably be a ten line perl script you can write in a lot less than half a day work.

    2. Re:I failed to see how this'll help by rangeva · · Score: 5, Insightful

      "so all a bot has to do is use a html renderer to turn it into a regular image that can be processed"

      It's not that simple. Since the Captcha is no longer an image that you can download, the bot will first has to locate the position of the Captcha. The owner of the site can modify the layout of the page and Captcha making it unique. By rendering the image into HTML you practically modify to encoding of the image to a new and unique one - making it highly difficult to create a generic bot that will learn to decode all the HTML variations out there.

      The problem today is with automated software that download the Captcha images from a pre-defined location (URL) and crack them. HECs makes it much harder to locate this resource.

      Oh and everything is Crackable;)

    3. Re:I failed to see how this'll help by Aladrin · · Score: 3, Interesting

      Even worse, this catcha would be -easier- than a regular one. It lists every pixel as a TD, in rows... So easy to render that it's idiotic. And the image itself is simple as well... The background letters are much lighter in color and could easily be filtered.

      Add in the huge size of the html and the annoyance factor of captchas in general, and this is amazingly stupid.

      --
      "If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
    4. Re:I failed to see how this'll help by Aladrin · · Score: 3, Insightful

      I should have added this disclaimer to the post:

      Yes, I see that they recommend adding in random divs and crap. If it's still a table, it's still very very easy to parse, even without a parser. If they intend for you to replace the table with 'random elements' ... Do you KNOW how hard it would be to get it to show up correctly on each different browser? Another nightmare.

      --
      "If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
    5. Re:I failed to see how this'll help by Giorgio+Maone · · Score: 2, Informative

      Gecko is absolutely overkill there: the HTML "encoding" is pretty lame, as the image is entirely made of 1px table cells, each one carrying its color information inlined in the style attribute.

      Just one Perl line can extract the color matrix and pass it straight to your OCR algorithm.

      Maybe if they used JavaScript to render the table on the client side, that would require Gecko or something like that (SpiderMonkey or Rhino would likely suffice), but still the complexity of a captcha cracker is noise reduction and character recognition, rather than image decoding.

      That said, I've seen no "Content-encoding: gzip" in their response: gzip encoding cannot be remotely compared to jpeg compression, but it would nevertheless cut the weight of a very redundant HTML table by a 1:16 factor or more... (hurry up guys, you've been slashdotted!)

      --
      There's a browser safer than Firefox, it is Firefox, with NoScript
    6. Re:I failed to see how this'll help by Jerf · · Score: 2, Insightful

      Oh, piffle. That's not hard either.

      The "HTML renderer" in question will be either Mozilla or IE, both of which offer through Javascript the ability to find the absolute position of an element, and its absolute width and height. So the only "hard" part left is identifying the HTML location of the test, probably with something like XPath, or Mozilla's DOM Inspector which already allows you to just click on the element (and maybe go up in the hierarchy a bit.)

      And I'm pretty sure the spammers already have programs to make it easy to have a human do just the hard parts, like identifying the location of the test, because I'm pretty sure that I've seen them have that sort of program to figure out the form field names easily. (Unique blogs, that is, blogs not based on any common software, have gotten blog spam too quickly and thoroughly before for any other explanation to make sense.)

      You can try to move the test around, but you're right back to an arms race (which is where we already were, so no progress), and it's one where the spammers have a system that automatically notifies them of when they need to make changes.

      The only spam solution is total moderation of the comment queue. If everyone did that there would be no spam anymore. (Somewhat ironically.)

  2. Render, PrintScr, OCR? by Frogular · · Score: 3, Interesting

    Can't the bot simply render and OCR it?

    A better solution might be the authentication system old 386 games had where you have to do some simple but human intelligence requiring task. "Find the word in the upper right of manual pg 4" -> "Enter the 3rd word from the following paragraph"

    1. Re:Render, PrintScr, OCR? by Geoffreyerffoeg · · Score: 2, Funny

      human intelligence requiring task

      "Prove or disprove P=NP. (You have 500 characters remaining.)"

  3. watermarking by dattaway · · Score: 2, Interesting

    How about watermarking the captcha with the site's address and a short message?

  4. Bad form by Zaph0dB · · Score: 5, Insightful

    I think using a captcha like this one (html-table rendered) is bad web-manners. The rendering of such a table, pixel by pixel, is a huge toll on browsers. Even on my (relatively) new and (relatively) powerful machine, it took Firefox a noticeable amount of time to render the image, and caused my hard drive to crunch a little. I don't even want to imagine less powerful machines or, random-fluctuation-of-time-and-space forbid, mobile devices. All in all, I think this method severely limits the users accessing this site.

    --
    When in danger or in doubt, run in circles, scream and shout [Robert Heinlein]
    1. Re:Bad form by the_womble · · Score: 3, Informative

      It did not take a noticable time to either download or render: Firefox, linux and dialup.

  5. workaround... by zozzi · · Score: 5, Informative
    Spammers already have a workaround for catchpas:

    1. Show the image in an alternate pornographic/warez/whatever website

    2. Ask the user to type it in to access the site

    3. Use the user's input to access the original protected site

    4. There is no step 4.

    --
    ---
    1. Re:workaround... by rjamestaylor · · Score: 2, Funny

      Brilliantly devious. Hundreds of pr0n-seeking addicts are itching at any given moment to get their fix. Only problem is that there probably aren't enough CAPTCHAs available on the web to meet the pr0n-seekers demand! Either free "inventory" will be given away for repeated CAPTCHA solving or, if repeats not used, CAPTCHA won't be available and will frustrate the frustrated seeker even more. So, PhpBB-admins do your part: enable CAPTCHAs to meet the demand!

      --
      -- @rjamestaylor on Ello
    2. Re:workaround... by Phillup · · Score: 5, Funny

      When it comes to porn, I'm no slouch and I can count the number of times I've seen sites that give you free access after entering a captcha on one hand.

      One hand eh?

      Guess we don't really need to ask how you know this...

      --

      --Phillip

      Can you say BIRTH TAX
  6. A captcha is still a captcha by Cee · · Score: 4, Interesting

    One of the main objections of a captcha is that an attacker could steal the image file and simply use it on their site (XXX sites...) to get it "cracked".
    A HTML generated captcha would prevent that, since there is no image file to copy.
    However, what prevents the attacker to simply copy the relevant HTML source and put it on his or her site, just like the image? Sure, you can make it quite complicated by adding CSS layers and whatnot, but in the end that would just merely be an extra annoyance.

    And stopping the attacker on using OCR on the captcha won't really work either. It's not that hard to render HTML code to an image, which you can feed to the OCR software.

    In short, this hack is just another step in the arms race, that just buys us some time.

  7. Re:What are the gotchas with these captchas by YrWrstNtmr · · Score: 4, Insightful

    Blind, color blind, text only browsers, more of a hassle, just to name a few.

  8. Do others use such spam-bot blockers? by msobkow · · Score: 2, Interesting

    I've had sessions that took an inordinately long time to initialize with various web service providers (it's very noticeable on dial-up.) I'm wondering whether similar techniques might be used to attack rather than defend, possibly including rogue AJAX code.

    --
    I do not fail; I succeed at finding out what does not work.
  9. Screen Captcha! by mrmeval · · Score: 2, Interesting

    It's easy no?

    The file size is what intriques me. Just make a 'hidden' captcha that a bot would download. Now figure out how to make a jpeg decompressor uncompress that to 2 gigs or better.

    It's like the old "I'll compress 2gigs of the letter A with zip and upload it to that BBS and let the virus checker gag" gag.

    Or maybe a gif file. I wonder how solid black or white compress......

    --
    I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
  10. Lunacy by Stormx2 · · Score: 4, Interesting

    Lunacy! I've made apps which can do this sort of thing before, and this one is totally unoptimized! Take a look at this:

    With the limited amount of colours used, it would make much more sense to
    a) give the table an id, then:
    table.tabid td { width:1px; height:1px; )
    b) give some classes for each colour used
    td.colid { background-color: blah; }

    I'm sure that would half the source code size... How can you trust a HTML solution that hasn't even been properly thought through?

  11. Processing by jones_supa · · Score: 2, Interesting

    The Captcha is no longer an image and therefore not a resource they can download and process.

    Err...but the HTML captcha is a resource they can download and process.

  12. Re:What are the gotchas with these captchas by Nyh · · Score: 2, Insightful

    Or just users who have the sitteings for Firefox on 'Alway use my colors' because they don't like the angry fruit salads of most sites.

    Nyh

  13. Captcha's are annoying by tacocat · · Score: 4, Insightful

    While this has little to do with the original post I have a really annoying experience with captchas

    I have 20/20 vision and am not color blind. Captchas are becoming so complicated and garbled that I get the code wrong about 40% of the time. Another portion of the time I take to long trying to answer the code question and type in the right characters. I typically get screwed on the number Zero and the letter 'O' and lowercase 'L' and the number 1.

    It'b becoming, for me, an entry barrier to signing up and gaining access to websites. It would be much easier to simply use email authentication. What do you do with the people who are color blind? I spent some years dealing with display design and this was a legitimate concern that we addressed at the time for a specialized group of people. In the common population there are a lot more occurrences of people who are color blind.

    Are captcha's really worth the effort compared to other more human friendly processes? Is anyone working on what we will be doing next? Considering that there are decades of technology in machine vision technology to pull from I think it will be fairly trivial for the bots to become better at reading captchas than humans.

    It might be effective to take the email authentication process and apply everything that mail servers do to authenticate the user. What I mean by this is apply all the mail server rules like FQDN requirements for HELO, fully resolvable email domains, valid email addresses, non-open relays. Much of this would eliminate either the bots or the ISP's who are too stupid to properly configure a mail server. Similarly it might be sufficient to code the HTML/HTTP to expect a properly responding client and not some hacked up bot that can't do most of it right.

  14. Broken by Kurayamino-X · · Score: 5, Interesting

    All text based captcha's are broken, it doesn't matter how they're rendered, they're still a pre-defined set of characters that a bot can pick out eventually. Now, the "Click three kittens" captcha, that was fucking genious, no bot on the planet will be able to tell the difference between a kitten and a ham sandwich. Why isn't it being used? People seem to think obscuring text and making it harder for humans to read is a better idea than using something a computer will not be able to identify.

    --
    ...I got nothing.
  15. A matter of time by superbrose · · Score: 2, Interesting

    The advantage of this captcha is that it is not widespread yet and so the chances that a bot can crack it are lower.

    Funny that when OCR software is supposed to work it often fails, but when there is some effort to hinder recognition then bots can deal with that. Maybe general OCR software should try to crack input instead!

  16. 218k of junk by suv4x4 · · Score: 2, Informative

    This GPL-ed project can be reproduced by a junior coder in an hour so the fact it's GPL-ed I guess isn't of so much help.

    Also on the subject of it being 218k, each pixel looks like:

    ... tr... <td style='height:1px;width:1px;background-color:#fcfb ff'></td> ... /tr...

    which is badly redundant, the very first thing is you can make all "td"-s in the table be 1px/1px with a simple: table.captcha td {width:1px; height:1px} rule, then background-color can be shortened to just "background" and still be valid.

    Furthermore you don't need table with rows and columns, if you float the pixels to left, then you only need a container of the right width and columns/rows wil naturally form, to keep it down we can style a shorter tag for our purposes, like <b>

    So at this stage we arrive at the much simpler:

    <b style="background:#abcdef"></b>

    But this can be simplified even further by indexing the colors used as around a 40-50 css classes (fiven the image has a lot more than 40-50 pixels and 40-50 colors are enough for it, it's still a net gain), for example: .cA {background:#abcdef} .cB {background:#ffaabb}, at which point we get not only more obfuscation for the captcha crackers to solve, but much lighter code:

    <b class="cA">&lt/;b>

    and again the original:

    ... tr... <td style='height:1px;width:1px;background-color:#fcfb ff'></td> ... /tr...

    And this is before we start putting JavaScript in the picture...

  17. No need to download the image by lintux · · Score: 5, Interesting

    There's no need to download the image. Look at the source. Somewhere it says:

    Now, just go to MD5Lookup.Com and convert that little "hidden" MD5Sum back to the original text:

    ad6ade8a0b6e2f748b80a390ff45cf31 - &NMTB

    Maybe the author should add some salt. :-)

  18. Clever but no cigar. by MikeFM · · Score: 2, Insightful

    Locating the captcha in the rendered page can't take more than a couple seconds. You'd have to change it a lot to change that. It's a blocky, colorful, bit of screen near a form submit button. Even if you change it there are only so many ways you can change it without making it confusing to users. If a user can find it then I can write a script to find it.

    It's a useful tool to slow down script kiddies but it won't stop anyone that could actually write the code to grab the characters in the image in the first place.

    --
    At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.