Future Hack: New Cybersecurity Tool Predicts Breaches Before They Happen
An anonymous reader writes: A new research paper (PDF) outlines security software that scans and scrapes web sites (past and present) to identify patterms leading up to a security breach. It then accurately predicts what websites will be hacked in the future. The tool has an accuracy of up to 66%. Quoting: "The algorithm is designed to automatically detect whether a Web server is likely to become malicious in the future by analyzing a wide array of the site's characteristics: For example, what software does the server run? What keywords are present? How are the Web pages structured? If your website has a whole lot in common with another website that ended up hacked, the classifier will predict a gloomy future. The classifier itself always updates and evolves, the researchers wrote. It can 'quickly adapt to emerging threats.'"
Why is this on /., the article is absolute crap!
Let's take WordPress sites out of this equation and see how accurate this tool is.
It then accurately predicts what websites will be hacked in the future. The tool has an accuracy of up to 66%.
So... by "accurately," you mean "not really all that accurately at all."
-- CanHasDIY, am I really this lazy? Yes, apparently.
Precrime Division has had it for years.
Given enough time all of the sites on the Internet will eventually be hacked?
I see of the top "features" they identified, mostly is just various tags that mean Wordpress is in use. So they learned that Wordpress sites tend to get hacked. Duh. The Wordpress team isn't interested in security. I demonstrated an exploit for a serious vulnerability in Wordpress and submitted it to their bug tracker. For two years it sat, with one WP developer saying "it can't be exploited" - even though I attached an exploit directly to the tracker issue. Two years later, the vulnerability was added to a 'sploit kit and thousands of sites were compromised over the course of just a few days. That's when WP finally got around to patcing the clear and significant vulnerability.
I see TFA claims "66% accuracy". "All sites will be hacked at some point" is about 50% accurate. I bet we could have 66% accuracy simply by saying "sites running PHP 5.2 or below will be hacked."
That's like a 16% improvement over the quarter I flip...
You're absolutely right. That is seriously one of the shittiest sites I've seen Slashdot link to.
Come on, Slashdot editors. Please! Why are you doing this? Why are you systematically ruining Slashdot more than it already has been ruined by putting shitty submissions like this on the front page? Why?
Even if we ignore that the linked article is complete garbage, why is this even on Slashdot?
Five or ten years ago, before the software industry was ravaged by wave after wave of shitbrained PHPers, JavaScripters and Ruby on Railists, this sort of analysis was the first thing you'd do when setting up a new server!
In the 1990s and early 2000s, the first thing you'd do when setting up a server was make sure it was running Solaris, FreeBSD or OpenBSD. All three are known to be very secure by default. Then a secure web server like Apache would be used, if necessary. If any custom web app software was used, it had to be written in a secure language like Java, Tcl, Python or even Perl. PHP was not allowed. Ruby was not even considered.
Yeah, things are different today. People use shitty PHP and Ruby software, running on shitty Linux distributions like Ubuntu, using obscure web servers, and then wonder why their shit gets cracked and broken into. Come on. If you do stupid stuff, you're going to feel pain. If you use half-assed software, your server is going to get compromised! IT'S REALLY FUCKING OBVIOUS!
I can predict for most sites that they will be hacked eventually, because they do not have anything resembling a secure set-up. But predicting when? That is impossible. Likely this tool gets even its pathetic 66% only dues to cherry-picked test data (also known as "lying" in scientific circles).
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
66% = "could happen."
It little behooves the best of us to comment on the rest of us.
100% chance it will be hacked and used as a launching point for EVARYTHANG!!!
"My immediate reaction is "WTF? What kind of moron doesn't make things 64-bit safe to begin with?" Linus
Is there a page somewhere where I can query the results to see how my own site goes?
66% of all websites get hacked. So if you predict EVERY website will get hacked, you'll be right 66% of the time.
New cyber security tool doesn't work!
-1 disagree is not a modifier for a reason. -1 troll, flaimbait, redundant, overrated are NOT acceptable substitutes.
Oh, it predicts hacks before they happen. Wow. That's so much better than predicting hacks after they happen.
Illiterate fuckers.
Comment removed based on user account deletion
Comment removed based on user account deletion
The "inferred third value" is almost certainly the probability/score/confidence level, and it's normally included for machine-learning or any classifier algorithm, such as one that makes a yes/no decision based on a numeric value within a range. You'll see it a lot with spam filters. It's required because the USER choses at which threshold they wish to take certain actions.
I'm going to use the spam filter example because that's one many people are familiar with, specifically Spamassassin. It will score a message like this:
Body includes the word "free": 2 points
HTML and text parts are different: 1 point
Sent through an open relay: 2 points
Tiny font: 1 point
From address default whitelist: -3 points
Adding up the scores, the total score for that email is 3 points. The server admin can configure how many points are required before an email is placed in the spam box, and how many are required before the email is deleted outright. Note that the choice of how high the score needs to be to be considered spam is completely separate from the algorithm generating those scores. One admin might be very tough on spam and decide that anything over 2 points is treated as spam. Another admin might be more lenient and set it to 4, so anything 4 or higher is treated as spam. The ROC informs the admin as to the results of different settings. A threshold of 2 will obviously have more false positives than a threshold of 4.
Note again the choice of threshold to take some action is selected by the USER, not by the group who designed the algorithm. In the case of this predictive tool, a web hosting company might choose to have the following policies:
No site with a risk score over 80 can be hosted on our servers.
Any site with a score over 40 will be informed and our security team will offer assistance in making the site more secure.
Those policies of what to do at different score thresholds are completely separate from the algorithm, the team who wrote the paper doesn't choose the thresholds for specific actions. Instead, the graph informs the web hosting company "at a risk score of 80, you can expect 5% false positives. At a risk score of 40, you can expect 15% false positives".
So in other words it could be 0% accurate...
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)