Slashdot Mirror


Future Hack: New Cybersecurity Tool Predicts Breaches Before They Happen

An anonymous reader writes: A new research paper (PDF) outlines security software that scans and scrapes web sites (past and present) to identify patterms leading up to a security breach. It then accurately predicts what websites will be hacked in the future. The tool has an accuracy of up to 66%. Quoting: "The algorithm is designed to automatically detect whether a Web server is likely to become malicious in the future by analyzing a wide array of the site's characteristics: For example, what software does the server run? What keywords are present? How are the Web pages structured? If your website has a whole lot in common with another website that ended up hacked, the classifier will predict a gloomy future. The classifier itself always updates and evolves, the researchers wrote. It can 'quickly adapt to emerging threats.'"

33 comments

  1. Utter garbage by Anonymous Coward · · Score: 0

    Why is this on /., the article is absolute crap!

  2. WordPress? by Anonymous Coward · · Score: 0

    Let's take WordPress sites out of this equation and see how accurate this tool is.

    1. Re:WordPress? by Penguinisto · · Score: 1

      True - and how is it that they say they're not counting vulns when that is precisely what they're doing (albeit counting past vulns and extrapolating...)

      --
      Quo usque tandem abutere, Nimbus, patientia nostra?
  3. "Accurately" by Anonymous Coward · · Score: 0

    It then accurately predicts what websites will be hacked in the future. The tool has an accuracy of up to 66%.

    So... by "accurately," you mean "not really all that accurately at all."

    -- CanHasDIY, am I really this lazy? Yes, apparently.

  4. Nothing New Here by sehlat · · Score: 1

    Precrime Division has had it for years.

  5. Isn't the correct answer: by jmauro · · Score: 1

    Given enough time all of the sites on the Internet will eventually be hacked?

    1. Re:Isn't the correct answer: by mark-t · · Score: 1

      Not necessarily true.... somes sites on the internet are not of general interest to enough people to ever draw the attention of somebody who might even want to hack it.

    2. Re:Isn't the correct answer: by Penguinisto · · Score: 1

      Exception:
      My ancient and long-dead first domain/site ever had never got hacked, and it never will: I shuttered it in 2001 (-ish) when I sold the domain name (spark.org). ;)

      --
      Quo usque tandem abutere, Nimbus, patientia nostra?
    3. Re:Isn't the correct answer: by K.+S.+Kyosuke · · Score: 1

      You seem to be assuming that being an HTTP server implies having security holes.

      --
      Ezekiel 23:20
    4. Re:Isn't the correct answer: by bloodhawk · · Score: 1

      a large percentage of attacks are performed by automated tools searching for targets. They don't give a shit if the site is of huge interest or your Granny's blog talking about how cute her poodle is. check your logs, even your home computers will be receiving regular port scans, and knocks on various ports/protocols to see if there is anything to attack.

    5. Re:Isn't the correct answer: by vux984 · · Score: 3, Insightful

      The premise was "given enough time...".

      By taking the site down, you limited the time.

      That's not an "exception", that's violating the premise.

  6. Mostly Wordpress, then. 50% accurate: all sites by raymorris · · Score: 5, Informative

    I see of the top "features" they identified, mostly is just various tags that mean Wordpress is in use. So they learned that Wordpress sites tend to get hacked. Duh. The Wordpress team isn't interested in security. I demonstrated an exploit for a serious vulnerability in Wordpress and submitted it to their bug tracker. For two years it sat, with one WP developer saying "it can't be exploited" - even though I attached an exploit directly to the tracker issue. Two years later, the vulnerability was added to a 'sploit kit and thousands of sites were compromised over the course of just a few days. That's when WP finally got around to patcing the clear and significant vulnerability.

    I see TFA claims "66% accuracy". "All sites will be hacked at some point" is about 50% accurate. I bet we could have 66% accuracy simply by saying "sites running PHP 5.2 or below will be hacked."

    1. Re:Mostly Wordpress, then. 50% accurate: all sites by Anonymous Coward · · Score: 0

      sounds like pre-crime profiling for websites
      lucky software don't have human rights

    2. Re:Mostly Wordpress, then. 50% accurate: all sites by Anonymous Coward · · Score: 0

      When people use ROC curves in studies like this they are generally trying to hide inaccuracies. I mean, who plots two points--whether something was inaccurate or accurate--against an inferred third value (usually number of samples or time) rather than just putting out the accuracy data directly? The authors of this paper even go so far as to write a separate section justifying why they aren't doing this; shouldn't the graphs be enough to stand on their own without a separate section justifying their format?

  7. 16% Improvement! by mythosaz · · Score: 2

    That's like a 16% improvement over the quarter I flip...

    1. Re:16% Improvement! by dfsmith · · Score: 0

      No, it's up to 16% of a quarter flip.

  8. This was elementary analysis in the '90s and 2000s by Anonymous Coward · · Score: 0

    You're absolutely right. That is seriously one of the shittiest sites I've seen Slashdot link to.

    Come on, Slashdot editors. Please! Why are you doing this? Why are you systematically ruining Slashdot more than it already has been ruined by putting shitty submissions like this on the front page? Why?

    Even if we ignore that the linked article is complete garbage, why is this even on Slashdot?

    Five or ten years ago, before the software industry was ravaged by wave after wave of shitbrained PHPers, JavaScripters and Ruby on Railists, this sort of analysis was the first thing you'd do when setting up a new server!

    In the 1990s and early 2000s, the first thing you'd do when setting up a server was make sure it was running Solaris, FreeBSD or OpenBSD. All three are known to be very secure by default. Then a secure web server like Apache would be used, if necessary. If any custom web app software was used, it had to be written in a secure language like Java, Tcl, Python or even Perl. PHP was not allowed. Ruby was not even considered.

    Yeah, things are different today. People use shitty PHP and Ruby software, running on shitty Linux distributions like Ubuntu, using obscure web servers, and then wonder why their shit gets cracked and broken into. Come on. If you do stupid stuff, you're going to feel pain. If you use half-assed software, your server is going to get compromised! IT'S REALLY FUCKING OBVIOUS!

  9. 66%? Worthless trash... by gweihir · · Score: 3, Interesting

    I can predict for most sites that they will be hacked eventually, because they do not have anything resembling a secure set-up. But predicting when? That is impossible. Likely this tool gets even its pathetic 66% only dues to cherry-picked test data (also known as "lying" in scientific circles).

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    1. Re:66%? Worthless trash... by Anonymous Coward · · Score: 0

      Very possible. They just have to perform the attack.

    2. Re:66%? Worthless trash... by iiii · · Score: 1

      My algorithm does better than 66% and I'm open sourcing it right here...
      (Predicts whether site will be hacked between now and the destruction of earth)

      public boolean willSiteBeHacked(Vector whateverYouFeelLike) {
              return true;
      }

      You can't disprove my claim.

      --
      Light cup, beer drink, thin so chain, neck turtle fat, man I won't say it again
    3. Re:66%? Worthless trash... by ThatAblaze · · Score: 1

      I'm pretty sure your algorithm would be worse than 50%. It basically amounts to "which even comes first? A) site gets hacked or b) site gets taken down."

      I think more sites get taken down every day than get hacked.

  10. ... accurately predicts .. by CaptainDork · · Score: 1

    66% = "could happen."

    --
    It little behooves the best of us to comment on the rest of us.
  11. RUns PHP? by certain+death · · Score: 1

    100% chance it will be hacked and used as a launching point for EVARYTHANG!!!

    --
    "My immediate reaction is "WTF? What kind of moron doesn't make things 64-bit safe to begin with?" Linus
  12. Results? by manu0601 · · Score: 1

    Is there a page somewhere where I can query the results to see how my own site goes?

  13. What a coincidence. by Kazoo+the+Clown · · Score: 1

    66% of all websites get hacked. So if you predict EVERY website will get hacked, you'll be right 66% of the time.

    1. Re:What a coincidence. by aaronb1138 · · Score: 1

      Wouldn't it just be easier to aggregate information from social media sites using a weighted system. Just put 4Chan at the top of the weighting, with Facebook next and use separate weighting scales for positive versus negative mention counts. Both are valid predictors, so it should work and get closer.

      I'm glad one of my side jobs is setting up IPS / IDP and similar security on firewalls. I'll never be thirsting for work.

  14. In totally unrelated news by Mr.+Freeman · · Score: 1

    New cyber security tool doesn't work!

    --
    -1 disagree is not a modifier for a reason. -1 troll, flaimbait, redundant, overrated are NOT acceptable substitutes.
  15. Meaningless tautology by Anonymous Coward · · Score: 0

    Oh, it predicts hacks before they happen. Wow. That's so much better than predicting hacks after they happen.

    Illiterate fuckers.

  16. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  17. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  18. It's a confidence score. Normal for binary decisio by raymorris · · Score: 1

    The "inferred third value" is almost certainly the probability/score/confidence level, and it's normally included for machine-learning or any classifier algorithm, such as one that makes a yes/no decision based on a numeric value within a range. You'll see it a lot with spam filters. It's required because the USER choses at which threshold they wish to take certain actions.

    I'm going to use the spam filter example because that's one many people are familiar with, specifically Spamassassin. It will score a message like this:
    Body includes the word "free": 2 points
    HTML and text parts are different: 1 point
    Sent through an open relay: 2 points
    Tiny font: 1 point
    From address default whitelist: -3 points

    Adding up the scores, the total score for that email is 3 points. The server admin can configure how many points are required before an email is placed in the spam box, and how many are required before the email is deleted outright. Note that the choice of how high the score needs to be to be considered spam is completely separate from the algorithm generating those scores. One admin might be very tough on spam and decide that anything over 2 points is treated as spam. Another admin might be more lenient and set it to 4, so anything 4 or higher is treated as spam. The ROC informs the admin as to the results of different settings. A threshold of 2 will obviously have more false positives than a threshold of 4.

    Note again the choice of threshold to take some action is selected by the USER, not by the group who designed the algorithm. In the case of this predictive tool, a web hosting company might choose to have the following policies:

    No site with a risk score over 80 can be hosted on our servers.
    Any site with a score over 40 will be informed and our security team will offer assistance in making the site more secure.

    Those policies of what to do at different score thresholds are completely separate from the algorithm, the team who wrote the paper doesn't choose the thresholds for specific actions. Instead, the graph informs the web hosting company "at a risk score of 80, you can expect 5% false positives. At a risk score of 40, you can expect 15% false positives".

  19. The tool has an accuracy of up to 66% by TemporalBeing · · Score: 1

    So in other words it could be 0% accurate...

    --
    Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)