Domain: sitetruth.com
Stories and comments across the archive that link to sitetruth.com.
Comments · 190
-
Other examples. Google still evil.
That's not a lone example. Search with Google for "craigslist auto posting software". These are all paid Google ads:
- "CL Posting Software www.adsoncraigs.com The worlds Best Selling CraigsIist software. Works with new CAPTCHA!"
- "Craigs Works Must Try Us webtrafficus.com We do the work no software To Buy Best Service All Ads Guaranteed Up"
- TopPost Inc. www.toppost.com The Leader in Posting Services 866-895-6888 -- info@toppost.com
- Buy Craiglist accounts Phone verified accounts, hassle-free, only 4.95$/account . www.craigsup.com
We track the "bottom feeders" in Google AdWords over at SiteTruth. We consider about 36% of Google's advertisers, out of a set of 20,000 ad domains, to be "bottom-feeders" - no visible business address, or we have other negative info. If you download AdRater, our Greasemonkey script for Firefox, we rate the advertiser behind every Google ad you see and display a rating icon on top of the ad. (Yes, the plugin "phones home". It tells us lots of stuff about the advertiser, which we're interested in, and very little about the user's browsing, which we don't care about. The plugin is open source, so you can check this.)
With the information we have, it's painfully obvious that Google isn't picky about their advertisers. The example in the article is one of many, not a unique exception.
Google CEO Eric Schmidt was quoted last month as saying "The Internet is fast becoming a cesspool" Was he complaining, or boasting? Much of that is Google's doing.
-
Some pain needs to be applied
If you're serious about blocking phishing sites, you have to accept some collateral damage. Blocking by URL stopped working last year; most attacks have unique URLs now. Many have unique subdomains. So you have to block at the second-level domain level to be effective.
We publish a list of major domains being exploited by phishing scams. Today, there are 46 domains listed. eBay, for example, is on the list, because eBay has an open redirector exploit. Click on that URL. It says "ebay.com", right? It looks like eBay, right? It's not.
On the other hand, "tinyurl.com", which used to be popular with phishers, has been able to get off the blacklist by cracking down on misuse of their service. It's possible to do redirection competently.
When we started our list last year, it had about 175 exploited domains. After some serious nagging and an article in The Register, we're down to 46. And only 11 have been on the list for more than three months; the others come and go as exploits are reported and holes plugged. So this is a problem that can be solved.
I'm glad to see Google taking a hard line on this. It's necessary that sites that do redirection feel the pain when they accept redirects to hostile sites. Google can apply much more pain that we can. Few sites will want to be on Google's blacklist for long.
-
Wikipedia does reasonably well on this.
Wikipedia already does reasonably well at this. The Wikipedia verifiability and reliable source rules tend to force partisan articles to contain criticism sections, cites to critics, and verifiable negative information. Any cult that's had legal problems will have those prominently mentioned. It's hard to keep a hype article in Wikipedia, although some people keep trying.
Business reliability can be addressed. We do that at SiteTruth. That works because business cannot legally be anonymous. Businesses have a trail of records behind them - corporate filings, credit ratings, criminal records, regulatory filings. Legitimate business sites can be tied back to that information to find out who's behind the business. As for less legitimate business sites, we just move them to the bottom of search results.
Reputation on the Web is a difficult issue. Slashdot has "karma", which helps. The problem on the Web is that not only can one be anonymous, one can create a large number of anonymous identities. (Mostly this is used for spamming; on Wikipedia, it's called "sockpuppetry"). An inter-site karma system, where a single signon accumulated karma from multiple sites, might be useful. It helps if there's some consequence for being a jerk.
So a modest level of web reputation can easily be added to the Web as it exists. Some reasonable solutions are already working, and just need to be deployed more widely.
-
Wikipedia does reasonably well on this.
Wikipedia already does reasonably well at this. The Wikipedia verifiability and reliable source rules tend to force partisan articles to contain criticism sections, cites to critics, and verifiable negative information. Any cult that's had legal problems will have those prominently mentioned. It's hard to keep a hype article in Wikipedia, although some people keep trying.
Business reliability can be addressed. We do that at SiteTruth. That works because business cannot legally be anonymous. Businesses have a trail of records behind them - corporate filings, credit ratings, criminal records, regulatory filings. Legitimate business sites can be tied back to that information to find out who's behind the business. As for less legitimate business sites, we just move them to the bottom of search results.
Reputation on the Web is a difficult issue. Slashdot has "karma", which helps. The problem on the Web is that not only can one be anonymous, one can create a large number of anonymous identities. (Mostly this is used for spamming; on Wikipedia, it's called "sockpuppetry"). An inter-site karma system, where a single signon accumulated karma from multiple sites, might be useful. It helps if there's some consequence for being a jerk.
So a modest level of web reputation can easily be added to the Web as it exists. Some reasonable solutions are already working, and just need to be deployed more widely.
-
Wikipedia does reasonably well on this.
Wikipedia already does reasonably well at this. The Wikipedia verifiability and reliable source rules tend to force partisan articles to contain criticism sections, cites to critics, and verifiable negative information. Any cult that's had legal problems will have those prominently mentioned. It's hard to keep a hype article in Wikipedia, although some people keep trying.
Business reliability can be addressed. We do that at SiteTruth. That works because business cannot legally be anonymous. Businesses have a trail of records behind them - corporate filings, credit ratings, criminal records, regulatory filings. Legitimate business sites can be tied back to that information to find out who's behind the business. As for less legitimate business sites, we just move them to the bottom of search results.
Reputation on the Web is a difficult issue. Slashdot has "karma", which helps. The problem on the Web is that not only can one be anonymous, one can create a large number of anonymous identities. (Mostly this is used for spamming; on Wikipedia, it's called "sockpuppetry"). An inter-site karma system, where a single signon accumulated karma from multiple sites, might be useful. It helps if there's some consequence for being a jerk.
So a modest level of web reputation can easily be added to the Web as it exists. Some reasonable solutions are already working, and just need to be deployed more widely.
-
User recommendations would be gamed
If Google uses user comments to affect search, massive attempts would be made by the "search engine optimization" people to game the system. If you thought link farms were bad, phony user farms would be worse. Google won't be able to identify the phonies; they can't even More fundamentally, there's a scaling problem. As I've pointed out before, the number of raters per site has to be large for rating to work. Rating for movies and TV shows works fine. Hotels might get enough ratings to be useful. Joe's Plumbing will be rated only by Joe, Joe's relatives, and Joe's employees.
CustomizeGoogle and GiveMeBackMyGoogle have some good ideas, although GiveMeBackMyGoogle is probably violating Google's terms of service by redistributing Google search results as a web site. Google lets you annotate their search results via their AJAX API, but you're not allowed to add or delete from their results list. If you want to delete items from Google search results, you have to do that via a browser plug-in. (Note, by the way, that Google's Chrome doesn't allow non-Google browser plug-ins. That's a form of DRM, when you think about it.)
With our SiteTruth SiteTruth system, we're addressing the problem by looking at off-web sources of legitimacy. The first question is always "can we find a name and address for the business behind the web site"? We have about four ways to do that. If none of them work, and they're selling something, they get moved down in our search results. If they do have an address, we look them up in various business databases. Considerable data is available about a business, once you can identify it. Ultimately, we want to make the business's credit rating affect their search results. It's necessary to reach out to those hard off-web data sources to separate the real companies from the bottom-feeders. Yes, the "affiliate" crowd will scream. Tough.
As for bottom-feeders, I really like this site, where someone in Brooklyn, NY, took pictures of the storefronts of every Brooklyn photo company he could find that advertised online. It's very funny. Now that's what Google should be doing with StreetView.
Here's our master plan for cleaning up the Web.
-
They're not all phony
I ran some of these through our SiteTruth system to get legitimacy ratings. None of them rate very high.
- boredatgustavus.net No website - not rated.
- contributegustav.org Redirector - not rated - redirects to "braf.org"
- braf.org Found in Open Directory, has business address, no ads. Turns out to be Baton Rouge Area Foundation, which has a 3-star rating in Charity Navigator and a writeup in Wikipedia, so they're legitimate.
- contributiongustav.org redirects to "braf.org"
- donategustav.org redirects to "braf.org".
- donationgustav.org redirects to "braf.org".
- gustav-hurricane.info No rating, contains frame of empty parking page.
- gustav-hurricane.net SiteTruth says: "Site ownership unknown or questionable. - No Location". It's just a parking page.
- gustav-hurricane.org Frame of Sedo parking page.
- gustav-hurricane.us No website.
- gustav-relief.org No rating - GoDaddy parking page with ads.
- gustavassistance.org redirects to "braf.org".
- gustavattorney.com No rating - GoDaddy parking page with ads.
- gustavcharities.com Rating: "Site ownership unknown or questionable. - No Location". We were too harsh there; the site does have a street address, but it wasn't enough like a mailing address to be picked up. This page was set up by Samaritan's Purse, which gets four stars from Charity Navigator.
- gustavcharity.com Rating: "Site ownership unknown or questionable. - No Location". Samaritan's Purse again.
- gustavclaims.netRating: "Site ownership unknown or questionable. - No Location". Parked page with ads.
- gustavcontribution.org Redirect to "braf.org".
Thus far, I'm not seeing major scams; just aggressive marketing by existing charities.
(SiteTruth is really the wrong tool for this job, because it's focused on business legitimacy, for which we have databases.)
-
And most of them are webspam
But how many of those trillion pages have unique, useful content? E-mail is over 95% spam, and the web is getting there.
There were about 153 million registered domains at the beginning of the year. The ones from the spam-friendly registrars are mostly junk. Tim Bernars-Lee said in 2006 that web junk was becoming a major problem, and it's become worse since then.
If you throw out all the anonymous but commercial domains (we call them "bottom-feeders"), as we do with SiteTruth, the Web looks a lot better. Search engines are getting stricter about this. You don't see that many "landing pages" in Google any more. Bad news for companies like Marchex, the publicly traded web spammer that cranks out all those junk "What you need, when you need it" sites.
"The mass trials are going well. There will be fewer Russians, but better ones." - Greta Garbo in Ninotchka.
-
Social networking site ads nearly worthless
The original article was about ads on social networking sites, which have very low value. This has been discussed over on Search Engine Watch. Google AdWords has "exclusion lists", lists of sites where you don't want your ads to appear. If you have expensive per-click ads, adding MySpace and Facebook to the exclusion list cuts your ad cost without impacting revenue much. You don't want your ad for expensive watches or mortgage refinancing on MySpace; Google will make money from you, but you won't make money.
Remember, 10% of the users produce 50% of the clicks but don't buy much. That 10% seem to be heavy social networking site users. You can buy their clicks, but they don't buy your product. This is a huge money drain on advertisers. There's much discussion in advertiser forums like Search Engine Watch about what to do about this. Tools have been developed to help advertisers filter out the non-producing ad sites. Google was fighting this; their terms of service prohibit AdWords advertisers from exchanging ad performance data. But ways have been developed to do it anyway.
The result of this has been 1) the price of ads on social networking sites is very low, and 2) the ad quality on them is terrible. We track this with SiteTruth AdRater. If you install AdRater (a Firefox plug-in, requires Greasemonkey), a rating icon appears atop each Google ad. Try this on Myspace, and almost all the ads will come up with a red "do-not-enter" sign. Then try, say, Bloomberg, and you'll see plenty of legit companies. The contrast is striking.
Myspace ads seem to be mostly links to ad farms, bottom feeder dating services (one is telling me "Three of your friends have a crush on you", even though I'm not logged into Myspace), and similar junk. Myspace just did a site redesign, and there are far fewer Google ad slots than before. This probably reflects unsold inventory.
Traffic alone is not enough. The users have to buy. Advertisers have now figured this out.
-
Levels of certification
There are already plenty of providers selling crap "domain control only validated" certs. We (as SiteTruth) regard those as having no value, and we encourage others to do the same. If it doesn't have an "L" (location) field, it's worthless. The introduction of those crap "quick SSL" certs poisoned the whole cert industry.
It's a problem that certificates which verify business name and address cost too much. They ought to cost maybe $25 per year. Validation isn't that expensive. That's what registered mail is for.
There used to be some enthusiasm for "web of trust" schemes of certification, but since the bad guys organized into criminal networks, domain farms became popular, and it became easy to get phony GMail accounts in bulk, that approach is obsolete.
-
Re:It's a relaunch of an old API with a new TOS
BOSS is not really new. Yahoo already had the Yahoo Search API, which does essentially the same thing. BOSS is essentially the Yahoo Search API with different terms of service. In particular, BOSS will, in future, allow "monetization". BOSS also allows users to intersperse their own search results with Yahoo's and run ads.
Google used to have a SOAP-based API, but they stopped allowing new users in 2006. It didn't force the caller to display ads. There's still a Google search API, but it's tied to their widgets and has restrictive terms of service.
We support both with SiteTruth. Yahoo search API version Google AJAX search version. The interface code is quite different but the end results are similar.
It's not about technology. It's about what you're allowed to do with the data:
- The Yahoo search API terms of service have a rate limit, don't allow you to add ads, but do allow reordering of results.
- The Google AJAX API terms of service don't have a rate limit, restrict presentation to Google's format, and don't allow reordering of results.
- The first rule of the BOSS Terms of Use is that you don't talk about the BOSS terms of use. "You shall not issue a press release or other written public statement regarding this TOU without Yahoo!'s written approval."
- The second rule of the BOSS Terms of Use [yahoo.com] is that you don't talk about the BOSS terms of use.
- The third rule of the BOSS Terms of Use [yahoo.com] is that you don't talk about the BOSS terms of use.
-
Re:It's a relaunch of an old API with a new TOS
BOSS is not really new. Yahoo already had the Yahoo Search API, which does essentially the same thing. BOSS is essentially the Yahoo Search API with different terms of service. In particular, BOSS will, in future, allow "monetization". BOSS also allows users to intersperse their own search results with Yahoo's and run ads.
Google used to have a SOAP-based API, but they stopped allowing new users in 2006. It didn't force the caller to display ads. There's still a Google search API, but it's tied to their widgets and has restrictive terms of service.
We support both with SiteTruth. Yahoo search API version Google AJAX search version. The interface code is quite different but the end results are similar.
It's not about technology. It's about what you're allowed to do with the data:
- The Yahoo search API terms of service have a rate limit, don't allow you to add ads, but do allow reordering of results.
- The Google AJAX API terms of service don't have a rate limit, restrict presentation to Google's format, and don't allow reordering of results.
- The first rule of the BOSS Terms of Use is that you don't talk about the BOSS terms of use. "You shall not issue a press release or other written public statement regarding this TOU without Yahoo!'s written approval."
- The second rule of the BOSS Terms of Use [yahoo.com] is that you don't talk about the BOSS terms of use.
- The third rule of the BOSS Terms of Use [yahoo.com] is that you don't talk about the BOSS terms of use.
-
It's a relaunch of an old API with a new TOS
BOSS is not really new. Yahoo already had the Yahoo Search API, which does essentially the same thing. BOSS is essentially the Yahoo Search API with different terms of service. In particular, BOSS will, in future, allow "monetization". BOSS also allows users to intersperse their own search results with Yahoo's and run ads.
Google used to have a SOAP-based API, but they stopped allowing new users in 2006. It didn't force the caller to display ads. There's still a Google search API, but it's tied to their widgets and has restrictive terms of service.
We support both with SiteTruth. Yahoo search API version Google AJAX search version. The interface code is quite different but the end results are similar.
It's not about technology. It's about what you're allowed to do with the data:
- The Yahoo search API terms of service have a rate limit, don't allow you to add ads, but do allow reordering of results.
- The Google AJAX API terms of service don't have a rate limit, restrict presentation to Google's format, and don't allow reordering of results.
- The first rule of the BOSS Terms of Use is that you don't talk about the BOSS terms of use. "You shall not issue a press release or other written public statement regarding this TOU without Yahoo!'s written approval."
-
It's a relaunch of an old API with a new TOS
BOSS is not really new. Yahoo already had the Yahoo Search API, which does essentially the same thing. BOSS is essentially the Yahoo Search API with different terms of service. In particular, BOSS will, in future, allow "monetization". BOSS also allows users to intersperse their own search results with Yahoo's and run ads.
Google used to have a SOAP-based API, but they stopped allowing new users in 2006. It didn't force the caller to display ads. There's still a Google search API, but it's tied to their widgets and has restrictive terms of service.
We support both with SiteTruth. Yahoo search API version Google AJAX search version. The interface code is quite different but the end results are similar.
It's not about technology. It's about what you're allowed to do with the data:
- The Yahoo search API terms of service have a rate limit, don't allow you to add ads, but do allow reordering of results.
- The Google AJAX API terms of service don't have a rate limit, restrict presentation to Google's format, and don't allow reordering of results.
- The first rule of the BOSS Terms of Use is that you don't talk about the BOSS terms of use. "You shall not issue a press release or other written public statement regarding this TOU without Yahoo!'s written approval."
-
It's a problem for client side threat scanners
We'd considered doing something like this for ad links. We offer the AdRater plug-in, which checks the legitimacy of advertised sites and puts a rating icon atop each ad. For some ad URLs, we can decode the URL and see what site is being advertised, so we don't have to follow the link. But there are cases where that's not enough. Sometimes the advertised site is just a redirector, and we'd like to follow the redirection chain and rate the ultimate target. Sometimes, the ad links are obfusicated. (Google doesn't do that; DoubleClick does.) For those cases, we'd have to pre-read the ad site from the plug-in in the user's browser, but not render the ad into a window.
If we do that, every advertiser sees a false click-through for every ad displayed. The AdWords advertiser community would not be happy.
This is the same problem AVG hit.
-
Re:Contact info is better found on the web site.
I don't want my real physical address listed on my domain for the world to see, and I don't have a P.O. box.
We get that a lot. Now go read California Business and Professions Code Section 17358, which applies if you sell to California, and the European Electronic Commerce Directive (2000/31/EC), which applies if you sell in Europe. Anonymous businesses are illegal in most of the developed world. Deal with it.
California prosecutors have used B&P code section 17538 when dealing with complaints against online businesses. If the business didn't comply with the address disclosure requirements, but accepted credit cards, the maximum penalty is six months in jail for that alone. Do anything that brings your anonymous business to the attention of prosecutors, and they have that hammer to hold over you.
-
Re:Contact info is better found on the web site.
I don't want my real physical address listed on my domain for the world to see, and I don't have a P.O. box.
We get that a lot. Now go read California Business and Professions Code Section 17358, which applies if you sell to California, and the European Electronic Commerce Directive (2000/31/EC), which applies if you sell in Europe. Anonymous businesses are illegal in most of the developed world. Deal with it.
California prosecutors have used B&P code section 17538 when dealing with complaints against online businesses. If the business didn't comply with the address disclosure requirements, but accepted credit cards, the maximum penalty is six months in jail for that alone. Do anything that brings your anonymous business to the attention of prosecutors, and they have that hammer to hold over you.
-
Contact info is better found on the web site.
There's been a formal study of bad WHOIS data by the Government Accounting Office, the investigative arm of Congress, titled "Prevalence of False Contact Information for Registered Domain Names", on this topic. They found at least 8% of contact info in WHOIS to be totally bogus. They also, as a test of ICANN, submitted 45 "WHOIS information problem reports", of which 11 resulted in correction and 33 did not. But GAO didn't break down the data by registrar.
We've been interested in this issue at SiteTruth for some time. We take a broader view of "bad" web sites than most; we consider any commercial site that lacks valid business name and address information to be bogus. Over 35% of Google AdWords advertisers fail that test. For advertisers whose ads appear on Myspace, the ratio is much higher.
Originally, we tried to get contact information from WHOIS data, but the data quality was so appallingly bad that we had to develop another approach. We have a system that looks for contact info the way a user would, looking at pages with names like "About", "Contact", and such, trying to find a user-readable street address. We also have some big databases of business addresses to check against. This turns out to work much better than looking at WHOIS data when the goal is to find the business behind the web site.
(You can see this info using our AdRater plug-in for Firefox. Download our plug-in to see the ratings for each Google advertiser as the ads go by. Unless you're already blocking all such ads, of course.)
-
Contact info is better found on the web site.
There's been a formal study of bad WHOIS data by the Government Accounting Office, the investigative arm of Congress, titled "Prevalence of False Contact Information for Registered Domain Names", on this topic. They found at least 8% of contact info in WHOIS to be totally bogus. They also, as a test of ICANN, submitted 45 "WHOIS information problem reports", of which 11 resulted in correction and 33 did not. But GAO didn't break down the data by registrar.
We've been interested in this issue at SiteTruth for some time. We take a broader view of "bad" web sites than most; we consider any commercial site that lacks valid business name and address information to be bogus. Over 35% of Google AdWords advertisers fail that test. For advertisers whose ads appear on Myspace, the ratio is much higher.
Originally, we tried to get contact information from WHOIS data, but the data quality was so appallingly bad that we had to develop another approach. We have a system that looks for contact info the way a user would, looking at pages with names like "About", "Contact", and such, trying to find a user-readable street address. We also have some big databases of business addresses to check against. This turns out to work much better than looking at WHOIS data when the goal is to find the business behind the web site.
(You can see this info using our AdRater plug-in for Firefox. Download our plug-in to see the ratings for each Google advertiser as the ads go by. Unless you're already blocking all such ads, of course.)
-
Statistics for phishing domains are different.
SiteAdvisor is basically an anti-virus program connected to a web spider; it downloads pages and looks for hostile code. This is valuable as a firewall feature, but it doesn't say much about whether a domain is worth visiting.
PhishTank has a list of sites currently involved in phishing scams. Let's take a look at that. At SiteTruth, we have historical PhishTank data in a database, with 40997 phishing attacks recorded. So when we ask the right question (which is "SELECT SUBSTRING_INDEX(domain,".",-1) AS tld, COUNT(*) as cnt FROM domainnegatives GROUP BY SUBSTRING_INDEX(domain,".",-1) ORDER BY cnt DESC LIMIT 20;"), we get
- "com",16284
- "cn",3787
- "net",2866
- "tw",2715
- "hk",2398
- "ru",1065
- "org",844
- "fr",797
- "uk",720
- "ph",599
- "kg",599
- "info",497
- "it",495
- "de",463
- "br",310
- "ch",303
- "us",282
- "pl",282
- "jp",279
- "at",270
Here, "com" is by far the most popular TLD with phishers. This reflects the desires by phishers to have a plausible-looking domain name. Some phishers, the ones who register domains in bulk, do pick rather bogus-looking domains (like "0001fyg0.com" "00039cscsgrjc.com" "0003s6tw0wqf70l.com" "0003ureb.com" "0004ssen.com" "0004y1x9.com" "00062lku1ekaj.com"). Others have more plausible choices, (like "americaonllinebank.com").
Top-level domain statistics are more of a curiosity than anything else. They don't help you avoid or deal with attacks. We could generate many other similar statistics, and we've posted some on the SiteTruth blog.
-
Web page redirection may have to go
We're seeing the need for some limits on web page redirection. Most of these attacks involve putting something on a trusted place which redirects to an untrusted place. Google, with incredible sloppyness, allows Blogspot accounts to do this, and as a result, they are heavily exploited by spammers. (Try, for example, "nikaluti21040.blogspot.com", which will redirect, via some iframes and other tricks, to "selissia.com", which is hosted on "secureserver.net").
Exploitation of legitimate sites to get through spam filters is a problem, but it can be dealt with if you're willing to take a hard line. Our first step in that direction was our list of major domains being exploited by active phishing scams. Our position is that one phishing attack from within a domain blacklists the whole domain. But within three hours after the problem is fixed, they're off the list. Major sites make the list now and then; Google, Dell, MSN, and Yahoo have all been on the list at one time or another. But they now know to take steps to get themselves off within hours. The Anti-Phishing Working Group and PhishTank have been helpful with this effort. We're down to 47 such domains today. It was about 175 when we started last fall. Most of the remaining entries are free web hosting services or DSL providers.
We and others have observed that there's an inverse relationship between the number of redirects and the legitimacy of a web page. We've been looking at this at SiteTruth. For things like AdWords ads, where some sites use redirection as part of a tracking systems, it's typically the bottom-feeders who are using redirection. An advertiser promoting their own product or service doesn't need it; it's brokers, intermediaries, and made-for-Adwords sites that use redirection. Anything with more than one redirect is almost bad. We expect to use redirection as part of our legitimacy metric in the future.
It's thus time for browsers to limit their acceptance of redirection. One HTTP-level redirect, OK. Beyond that, put up a popup warning of suspicious redirection behavior. Redirects via META tags and Javascript should produce a popup. Sure, some site operators will look bad, but they will adapt.
-
Web page redirection may have to go
We're seeing the need for some limits on web page redirection. Most of these attacks involve putting something on a trusted place which redirects to an untrusted place. Google, with incredible sloppyness, allows Blogspot accounts to do this, and as a result, they are heavily exploited by spammers. (Try, for example, "nikaluti21040.blogspot.com", which will redirect, via some iframes and other tricks, to "selissia.com", which is hosted on "secureserver.net").
Exploitation of legitimate sites to get through spam filters is a problem, but it can be dealt with if you're willing to take a hard line. Our first step in that direction was our list of major domains being exploited by active phishing scams. Our position is that one phishing attack from within a domain blacklists the whole domain. But within three hours after the problem is fixed, they're off the list. Major sites make the list now and then; Google, Dell, MSN, and Yahoo have all been on the list at one time or another. But they now know to take steps to get themselves off within hours. The Anti-Phishing Working Group and PhishTank have been helpful with this effort. We're down to 47 such domains today. It was about 175 when we started last fall. Most of the remaining entries are free web hosting services or DSL providers.
We and others have observed that there's an inverse relationship between the number of redirects and the legitimacy of a web page. We've been looking at this at SiteTruth. For things like AdWords ads, where some sites use redirection as part of a tracking systems, it's typically the bottom-feeders who are using redirection. An advertiser promoting their own product or service doesn't need it; it's brokers, intermediaries, and made-for-Adwords sites that use redirection. Anything with more than one redirect is almost bad. We expect to use redirection as part of our legitimacy metric in the future.
It's thus time for browsers to limit their acceptance of redirection. One HTTP-level redirect, OK. Beyond that, put up a popup warning of suspicious redirection behavior. Redirects via META tags and Javascript should produce a popup. Sure, some site operators will look bad, but they will adapt.
-
Re:Buy a real SSL cert, with location info
If you add the address to the contact page, SiteTruth should pick it up in 30 days or so. The whole point of SiteTruth is to associate a business name and address with a web site. Any site that's even vaguely commercial should have a clearly visible business name and physical address. In some jurisdictions that's required by law. We're trying to make a dent in the "on the Internet, no one knows if you're a dog" problem. Which, after all, was what SSL certificates were originally supposed to be for - validation of the identity of the remote party.
The "commercial/non-commercial" distinction is hard. Yahoo R&D tried training a Bayesian spam filter to make that distinction, but it didn't work out too well and that was only deployed on the R&D site. We initially presume ".com", ".net", and ".biz", plus their country domain counterparts like ".co.uk", to be commercial, while ".org" and ".edu" are presumed noncommercial. An Open Directory listing in a suitable category can override this. Presence of ad links makes a site commercial.
The main use of SiteTruth is not the search engine front end; it's AdRater, which rates Google ads as they go by. SiteTruth is a technology demo, an alpha test, and a means for gathering information about Google advertisers (not users). So we like to get comments from knowledgeable people. More uses of the data are coming.
We're one of the few operations out there seriously trying to do something about all the junk sites on the web.
-
Buy a real SSL cert, with location info
Buy a real SSL cert, one with "Location" (L field) information and a real business name (not a domain name) in the "Organization" (O field). Avoid those cheap "Instant SSL" "Domain Control Only Validated" certs.
At SiteTruth, we consider the low-end certs worthless. They don't provide any information about who you're dealing with. We encourage other developers of certificate-validation software to take a similar position. You don't want to input a credit card number to a site with a "domain control only validated" certificate. "Domain control only" validated certs are enough for logging into a blog, perhaps, but not more than that.
-
"Quick SSL" certs have no value
At SiteTruth, we divide certificates into three categories, rather than the usual two:
- "Extended Validation" certificates.
- "Organization validated" certificates, which must have an L (location) field and must not have a domain name in the O (organization) field.
- "Domain control only validated" or "Quick SSL" certificates, which say nothing about who's at the other end of the connection.
Browsers normally lump category 2 and 3 together. This is not a good thing.
Category 3 certs, the "Instant SSL" certs, have no value in identifying the business. A category 1 or 2 cert increases the site's SiteTruth legitimacy rating, since we have a third party which has vouched for the ownership of the site. A category 3 cert does not.
Browsers should make this distinction. You never want to enter a credit card number into a site that only has a class 3 cert. You have no idea where your money is going.
-
The uses of publicity
Public embarrassment can be useful. We publish a list of major domains being exploited by active phishing scams. These are major domains where an attacker has found a security hole allowing them to exploit the site for phishing purposes. There are 65 sites on the list. There used to be about 140, but by nagging and publicity, we've been able to get most big-name sites to tighten up. Now and then some big site makes the list, but it often disappears within hours as the hole is plugged.
So it actually is possible to get big companies to tighten up security, if you do it right.
-
Site seems to be violating California law
I took a look through the site, and got to the "Enter Credit Card Number" point without seeing the name and address of the business.
That's a criminal offense if selling into California: Before accepting any payment or processing any debit or credit charge or funds transfer, the vendor shall disclose to the buyer in writing or by electronic means of communication, such as e-mail or an on-screen notice, the vendor's return and refund policy, the legal name under which the business is conducted and, except as provided in paragraph (3), the complete street address from which the business is actually conducted.
... (g) Any violation of the provisions of this section is a misdemeanor punishable by imprisonment in the county jail not exceeding six months, by a fine not exceeding one thousand dollars ($1,000), or by both that imprisonment and fine.Let's see what we can find out.
WHOIS gives us:
Buzzelli, David
Memsen
3604 SE Powell Valley Road
#267
Gresham, Oregon 97080
United States
(503) 667-3136
That's a start. More info is available if you dig.
-
NebuAd info, and a request for info
I just checked NebuAd's Privacy policy:
NebuAd products do collect and use the following kinds of anonymous information:
- Web pages viewed and links clicked on
- Web search terms
- The amount of time spent at some Web sites
- Response to advertisements
- System settings, such as the browser used and speed of the connection
- ZIP code or postal code
Now that's way out of line for an ISP to collect, let alone send to an ad agency.
We may be able to do something about this.
We run SiteTruth AdRater, which rates advertisers. We have a Firefox extension which displays a rating icon for each ad served. When an ad link goes by, and it's not in the browser cache, the extension contacts our server for a rating of the advertiser. So we collect, over time, a list of advertisers for various ad systems. We're not collecting data about users; we're interested in advertiser behavior. (You can read the source code for the plug-in, so there's no mystery about what we're doing.)
We're not currently tracking NebuAd, Front Porch, or Phorm ads; we've been focusing on the bigger players. It looks like we need to be tracking this behavior. If anyone can find ad links from those services, please post the ad link here, or mail it to "info@sitetruth.com". We need some examples so we can modify the plug-in to recognize them.
If we can collect sufficient information about this class of advertisers, we may publish their customer list, which would be useful for boycott purposes. Thanks.
-
It's not that hard to get rid of the crap
We're back to the Yahoo! model because people have figured out how to game the system, namely Google, without adding content that's important to the searcher.
It's not hard to throw out most of the bottom-feeders. We do it. The crowd at Search Engine Watch (which, despite the name, is all about advertising, not search quality) is writing me angry messages for doing that. Now that we've demonstrated that 36% of Google AdSense advertisers are bottom-feeders, they know they're being watched. Some feel they're being targeted.
Bear in mind that most search requests are really, really dumb. That's what Google has to answer. In fact, most Google search requests don't hit the search engine at all; there's a cache of common queries and answers in all the front end machines, and a sizable fraction of requests are answered from cache.
-
This will go on your Permanent Record
The key feature seems to be "In other words, the Ringside platform allows business owners to gain insight into the social graph of users, relationships, groups, interactions, and sharing that is occurring on their Web site". Right. More targeted ads.
I have a browser extension that monitors advertiser (not user) behavior and reports it to a server. I mentioned this over on Search Engine Watch, where the Adwords crowd hangs out. Anger, threats, intimidation... The idea that someone is tracking advertisers, instead of users, just drives some of them nuts.
-
More ads to rate and filter
Ah, yet another class of ads to locate, rate, and filter. Now Adblock and CustomizeGoogle need to be updated.
We probably should look into rating the advertisers with AdRater. Outright ad blocking seems overkill for this class of ad, but rating doesn't interfere with user searches.
The revolt against excessive advertising is growing. Sao Paulo, Brazil eliminated outdoor advertising last year. All of it.
-
Re:Tracking the advertiser, not the user
I'd be interested in seeing the criteria, and sample data, for determining the quality of advertisers before I view your report as having any legitimacy.
Sure. See these documents.
-
Tracking the advertiser, not the user
We've been doing some tracking recently, but aimed at the advertiser side. We have a plug-in for Firefox which rates ads. A little icon is displayed next to each ad, showing what our system knows about the advertiser. As we tell users of the plug in, "AdRater 'phones home', but tells us as little as possible. AdRater sends the domain name associated with each advertisment you see to SiteTruth." SiteTruth then sends back advertiser information, in XML, which the plug-in turns into icons.
We use this to find out what the advertisers are doing. Individuals are entitled to privacy; advertisers are not. We're building up a picture of the on-line advertising market. We now have, for example, a list of Google's AdSense advertisers.
Soon we'll be issuing reports on advertiser quality. (Ads on Bloomberg: mostly legit. Ads on LinkedIn: quality varies, mostly OK. Ads on MySpace: mostly bottom-feeders.) More on this in coming weeks.
It's not just advertisers tracking users any more. Sometimes it's the other way round.
-
It remains an endpoint problem. In Windows.
It's worth realizing that we've solved most of the problems with hostile sites on the Internet other than ones that involve Windows zombies. Nobody is spamming from an identifiable source any more; that gets spammers turned off fast, or arrested. Spamming is now done using Windows zombies.
Hosting of scams tends to involve Windows zombies or server break-ins. We track this on our "Major domains being exploited by active phishing scams" list. Notice that almost all the sites with multiple exploits listed are services that provide DSL connectivity. The single-exploit sites are usually break-ins. Most of the open redirectors have been fixed, so that hole has mostly been closed.
The malware problem is, again, an endpoint problem, with programs given all the privileges of the user running them. Again, that's mostly a Windows problem. (Not that Linux is fundamentally better. Installs still typically have to be run as root. Few will run under a restrictive Secure Linux profile.) Of course, when Microsoft tightens things up, as they did minimally in Vista, people scream that their insecure apps won't run. Fixing the problem requires a clean start, like the OLPC. If the OLPC technology gets some traction at the high school, college, and road warrior level, we might have a way out of the current mess.
Once we get past outright criminality, we're faced with the "bottom-feeders" - the Made for Adwords sites, the "landing pages", the directory sites, the typosquatting sites, the domain parks, and similar annoying dreck. We're doing our bit to choke that off. If you're willing to lump the bottom-feeders together with the crooks, it's easier to separate them from the sites with some degree of legitimacy.
Most of the bottom-feeders get their revenue from Google's advertisers, via Google. Google is starting to do something about this with "landing page quality measurement". Their standards are very low, though, judging by what's still showing up in AdWords ads. (We have a free Firefox browser extension that rates AdWords advertisers, so we have a way to look at this. Advertiser quality varies drastically by site: advertisers on Bloomberg look legit, LinkedIn, mostly OK, Myspace, mostly bottom-feeders.)
There's a basic question here - how much of Google's revenue comes from bottom-feeders? Google recently tightened up their landing page standards, and Google's revenue dropped for the first time ever. Can Google still afford "don't be evil"? We'll find out this year.
All of these things are endpoint problems. Down at the IP level, we're doing OK.
-
It remains an endpoint problem. In Windows.
It's worth realizing that we've solved most of the problems with hostile sites on the Internet other than ones that involve Windows zombies. Nobody is spamming from an identifiable source any more; that gets spammers turned off fast, or arrested. Spamming is now done using Windows zombies.
Hosting of scams tends to involve Windows zombies or server break-ins. We track this on our "Major domains being exploited by active phishing scams" list. Notice that almost all the sites with multiple exploits listed are services that provide DSL connectivity. The single-exploit sites are usually break-ins. Most of the open redirectors have been fixed, so that hole has mostly been closed.
The malware problem is, again, an endpoint problem, with programs given all the privileges of the user running them. Again, that's mostly a Windows problem. (Not that Linux is fundamentally better. Installs still typically have to be run as root. Few will run under a restrictive Secure Linux profile.) Of course, when Microsoft tightens things up, as they did minimally in Vista, people scream that their insecure apps won't run. Fixing the problem requires a clean start, like the OLPC. If the OLPC technology gets some traction at the high school, college, and road warrior level, we might have a way out of the current mess.
Once we get past outright criminality, we're faced with the "bottom-feeders" - the Made for Adwords sites, the "landing pages", the directory sites, the typosquatting sites, the domain parks, and similar annoying dreck. We're doing our bit to choke that off. If you're willing to lump the bottom-feeders together with the crooks, it's easier to separate them from the sites with some degree of legitimacy.
Most of the bottom-feeders get their revenue from Google's advertisers, via Google. Google is starting to do something about this with "landing page quality measurement". Their standards are very low, though, judging by what's still showing up in AdWords ads. (We have a free Firefox browser extension that rates AdWords advertisers, so we have a way to look at this. Advertiser quality varies drastically by site: advertisers on Bloomberg look legit, LinkedIn, mostly OK, Myspace, mostly bottom-feeders.)
There's a basic question here - how much of Google's revenue comes from bottom-feeders? Google recently tightened up their landing page standards, and Google's revenue dropped for the first time ever. Can Google still afford "don't be evil"? We'll find out this year.
All of these things are endpoint problems. Down at the IP level, we're doing OK.
-
It remains an endpoint problem. In Windows.
It's worth realizing that we've solved most of the problems with hostile sites on the Internet other than ones that involve Windows zombies. Nobody is spamming from an identifiable source any more; that gets spammers turned off fast, or arrested. Spamming is now done using Windows zombies.
Hosting of scams tends to involve Windows zombies or server break-ins. We track this on our "Major domains being exploited by active phishing scams" list. Notice that almost all the sites with multiple exploits listed are services that provide DSL connectivity. The single-exploit sites are usually break-ins. Most of the open redirectors have been fixed, so that hole has mostly been closed.
The malware problem is, again, an endpoint problem, with programs given all the privileges of the user running them. Again, that's mostly a Windows problem. (Not that Linux is fundamentally better. Installs still typically have to be run as root. Few will run under a restrictive Secure Linux profile.) Of course, when Microsoft tightens things up, as they did minimally in Vista, people scream that their insecure apps won't run. Fixing the problem requires a clean start, like the OLPC. If the OLPC technology gets some traction at the high school, college, and road warrior level, we might have a way out of the current mess.
Once we get past outright criminality, we're faced with the "bottom-feeders" - the Made for Adwords sites, the "landing pages", the directory sites, the typosquatting sites, the domain parks, and similar annoying dreck. We're doing our bit to choke that off. If you're willing to lump the bottom-feeders together with the crooks, it's easier to separate them from the sites with some degree of legitimacy.
Most of the bottom-feeders get their revenue from Google's advertisers, via Google. Google is starting to do something about this with "landing page quality measurement". Their standards are very low, though, judging by what's still showing up in AdWords ads. (We have a free Firefox browser extension that rates AdWords advertisers, so we have a way to look at this. Advertiser quality varies drastically by site: advertisers on Bloomberg look legit, LinkedIn, mostly OK, Myspace, mostly bottom-feeders.)
There's a basic question here - how much of Google's revenue comes from bottom-feeders? Google recently tightened up their landing page standards, and Google's revenue dropped for the first time ever. Can Google still afford "don't be evil"? We'll find out this year.
All of these things are endpoint problems. Down at the IP level, we're doing OK.
-
The Microsoft ad model
It sounds so Microsoft. They control the OS and the browser, so they could keep detailed history information about what you've been looking at. But they don't seem to actually be doing that. The Atlas Media Console, which is what this is all about, is just a tool for managing multiple types of ads and reducing the data that comes back as they're viewed.
Microsoft has a point, though. "Advertising doesn't jerk, it pulls" - John Wanamaker. The ad that was clicked on may not have been the primary influence on the buying decision. For advertisers who have brands with some value, an online presence helps to market the brand. Then when an ad for something for a consumer actually wants is displayed, a sale is more likely. Advertisers can't currently measure that effect.
Many Google ads are, of course, from "bottom-feeders", with no brand of value. They just want the click-through. Anything that improves measurement of return on investment for the actual seller will reduce the value of all those "bottom feeder" ads - the "made for adwords" sites, the spam blogs, and such. It's unclear how much of Google's revenue is generated that way.
Some Google text ads have a form of mouse-over tracking. When you mouse over some Google text ads, nothing appears to happen, but in fact, some Javascript executes and the URL you can click on changes. I'm not yet sure just what they're sending back to the mothership, or if they send anything on mouse-overs without a click.
As for the bottom-feeder problem, we've recently developed some tools for SiteTruth that tell us some things about Google AdWords. SiteTruth rates web site legitimacy, and we have a browser-plug in which displays those ratings alongside each ad. It's striking to see the difference between the quality of ads served on different sites. Slashdot and Linkedin advertisers aren't too bad; Myspace advertiser quality is very low. Remember that Slashdot article about the people who will click on anything? That's the Myspace crowd.
-
It's a problem, but the size is limited.
We have a list of major sites being exploited by active phishing scams, which we update every three hours. There are 56 sites on the list right now. Most sites don't stay on the list too long, but we still have 14 that have been on the list since last year. Most of them are DSL service providers with compromised machines they haven't kicked off. Some providers are proactive about this, and some aren't. Then there are a few compromised sites that just have no clue about how to fix their problem. One such site is the teacher web space for a school district.
By, well, nagging, we've been able to get the big players to fix their problems. Google, Yahoo, MSN, and Dell were all on the list at one point, but they've all tightened up their systems.
The points we make with this list are that 1) the number of major sites involved is small, and 2) blacklisting at the second level domain level causes acceptable levels of collateral damage. So go ahead, blacklist the whole second level domain in your phishing filters. Think of it as a way to encourage sites to clean up their act. Or as a way to find out where to apply the clue stick.
This list is about "major" sites, ones in Open Directory (1.7 million sites.) The issue there is with attackers trying to steal the credibility of the major site. At the other end of the scale, any domain less than a few weeks old probably isn't worth connecting to. Or at least it should be read with all executable content disabled, including HTML email. Also, any link with more than one redirect probably shouldn't be followed.
It's easier to filter out the attackers if you're willing to filter out the bottom-feeders as well. But that's another story.
-
This can be fixed, but impacts ad revenue model
The paper points out that most of the attacks involve redirection of some portion of page content. That's a useful piece of information, because, other than for advertising purposes, redirection of IFRAME items and images is quite rare. A useful blocking strategy would be to block all redirects below the top level page. Many ads will disappear; no great loss.
Checking for hostile full web pages is already being done. McAfee SiteAdvisor was the first to do that, then Google copied them. Our "bottom feeder filter", SiteTruth, does some of that too, although it throws out far more sites than McAfee or Google do, just by insisting that some identifiable business stand behind any page that looks commercial.
Google's revenue model depends, to some extent, on those "bottom feeder" sites: all those anonymous "landing pages", "directory pages", "made for AdWords pages", and similar junk. Those things bring in substantial AdWords revenue, although they don't usually generate much in the way of sales for advertisers. Throwing them out of the "Google Content Network" would cut Google's ad income. This is where "don't be evil" collides with Google's profitability.
This looks like a solveable problem, but the solution will come from the security companies, not the search companies. The search companies can't afford to fix it.
-
Making others fix their problems
I can't tell you how many people send us bad data and flat out ignore the response.
Sometimes you can get things fixed at other sites. We have a list of major sites being exploited by phishing sites, which is updated every three hours by matching PhishTank (10,000 entries) against OpenDirectory (1.7 million entries), and looking for domains in both. We blacklist sites on a per-domain basis, and needed to measure and minimize the collateral damage.
When we started that list last November, it had 174 domains on it. After reports to abuse addresses, two articles in The Register, and help from PhishTank and the Anti-Phishing Working Group, we're down to 45 domains. Only eight of those domains have been on the list for more than 60 days. The remaining long term problem domains are five DSL providers, a free web hosting service, and two ordinary web sites that had break-ins they've never cleaned up. The rest of the list changes frequently, as sites are added to the list due to some problem, then removed from the list as the problem is fixed.
When we started, Google, Yahoo, MSN, and Dell were all on the list. They've all cleaned up their act. They just needed a little nudging.
With the legit sites tightened up, phishing blacklists become much more effective. It's now safe to blacklist entire base domains, not just URLs or subdomains. Anti-phishing tools just became more effective.
So, yes, you really can get such problems fixed.
-
Bottom-feeder filtering from the advertiser side
It's possible to filter out the bottom-feeders, as we do at SiteTruth. We're looking at this mostly from the user side. But there are also serious complaints about "domaining" from the advertiser side.
Clicks on "typosquatting" sites don't lead to many sales. Basically, they're targeting users who click on random stuff. That doesn't mean those users actually buy based on their mis-aimed clicks. More likely, some real company that advertised via Google AdWords is getting money sucked out of their ad budget without much return. The analytics people are skeptical of the claims of domainers.
The Direct Marketing Association has a white paper for advertisers which recommends that advertisers filter those sites out of their campaigns. "The traffic produced by sites utilizing the practices described above is almost always absolutely worthless. To ensure contextual advertising effectiveness, advertisers should eliminate these sites from their campaigns." Google, however, makes this difficult, because Google doesn't tell the advertiser where their ads are running, and requires excluding each individual domainer site by name, from Google's user interface. There's no "disable all bottom feeders" option. This is a problem.
The DMA's white paper suggests ways an advertiser can defend their ad costs against domainers, automatically accumulating a list of domainers feeding them clicks, discovering which sites generate poor returns, and excluding them. But with clicks coming in randomly from hundreds of thousands (maybe millions) of constantly changing bottom-feeder sites, blacklisting the bogus sites is like spam filtering by source address - it's a losing battle.
The advertiser community is getting wise to this. We may see some pushback from that side.
-
The real problem is phony "registrars"
Most of the "ICANN accredited registrars" are fronts for domain tasting. There are only a few real registrars; the rest are dummies for picking up dropped domains. Enom has a huge number of dummy fronts - "Enom1, Inc" through "Enom469, Inc".
One step needed is for ICANN to enforce the provision of the registrar agreement which allows ICANN to prohibit registrars from owning or speculating in domains. And the provision which requires that a registrar have assurance of payment before activating a domain. With that, the end of the "grace period", and Google refusing to monetize domains for the first five days, we should see this problem decrease. The
.org TLD recently got rid of their grace period, and domain transactions dropped 90%.We're working on this from the browser end. The general idea of our SiteTruth system is to filter out the bottom-feeders. It's the next step after ad-blocking - make the link pages, directory pages, typosquatters, and similar junk far less visible.
It's not even clear that advertisers benefit from all those junk pages. If you advertise with Google ads, and get clicks from junk pages, do they really result in sales? Or is this just a way to take money from the real advertiser and divert it to some bottom-feeder?
-
Monetizing the bottom feeders
I hope Google really does this. They need to, to restore their "don't be evil" reputation. Arguably, Google went over to the dark side when they started offering domain parking.. "Maximize revenue on your parked pages with Google AdSense for domains", they advertise. (Insert Darth Vader quote here.)
"Domain tasting" is a drain on the anti-fraud systems of the Internet. All those domain changes help conceal phishing attacks, many of which involve buying domains with stolen credit cards and exploiting them before the credit card transaction is reversed. Blacklist systems like McAfee SiteAdvisor and PhishTank are always running behind the domain changes.
We rate sites at SiteTruth, and all those domain changes are a headache for us. I'm considering taking the position that all domains less than 30 days old are junk, unless they have a good SSL certificate. Is that too severe, or a good idea? Comments?
-
Bottom-feeders, crooks, and all that.
Since we run a system for filtering bottom-feeders out of search results, I've had to look at this issue.
One of the basic requirements of SiteTruth is that a web site that's selling or promoting something must have an identifiable name and address on the web site. A "contact us" form isn't good enough. Legitimate sites selling something usually have a valid name and address on the site. Commercial sites without business names and addresses are generally "bottom-feeders". They may or may not be fraudulent, but there's no way to tell, so we down-rate them and move them down in our search results. It's illegal in many jurisdictions to run a business without disclosing an address (California and EU law are quite explicit on this), and so that's a good first filter.
This filters out the bottom-feeders who aren't willing to go all the way to using a phony address. That's a felony (wire fraud or identity theft), so most sites with even a pretense of legitimacy don't go there. Those guys are crooks; no question about that. We have some blacklists to check for that sort of thing; it's usually phishing-related.
So there are three general categories - legitimate, bottom-feeder, and felony crook. The bottom-feeders are the ones Cutts is talking about. If they hadn't done some "search engine optimization", they wouldn't rank high enough in a search engine that anyone would see them. Some of the bottom-feeders are annoying, but not illegal; those are the ones that are page farms, but at least on-topic page farms. Then there are those who just have pages of irrelevant links and ads. Their natural habitat is celebrity name searches. Since they're probably violating false advertising laws, they are misdemeanor-level crooks.
When bottom-feeders go bad, it's usually via downloading hostile software as an "affiliate". See, for example, Zango. That's an ongoing problem, and McAfee's SiteAdvisor filters out those sites. Even Google is finally checking for most of the usual suspects there.
Amusingly, the bottom-feeders can't go legitimate and give a name and address without losing search engine positioning. If the same name and address shows up on a huge number of sites, Google picks that up and down-rates the sites for duplicate content. One large bottom-feeder actually has a link to a common "about" page on each of their several hundred thousand sites, but uses the "robots.txt" file to keep Google from finding it. Our SiteTruth system won't read the page in violation of the "robots.txt" file, so we downrate them for lacking a business address. They just can't win.
This is starting to look like the history of spam. In the early days of spam, as some may remember, it was viewed as a bottom-feeder marketing medium, and reasonably legitimate companies used it. The CAN-SPAM act was enacted in a form that pleased the Direct Marketing Association, but had an effect unexpected by both the DMA and anti-spam workers. The CAN-SPAM act allows spam, but only if the sender and subject are identified properly. So any "legitimate" spam is easily filtered out by spam filters. As a result, today, spam is entirely a criminal activity. We never hear about the DMA in spam discussions any more. Now it's about putting people in jail.
The same thing is happening on the web. As the filters get better, the marginal bottom-feeders don't get through, and only the out and out crooks are left. As with spam, in time we'll get rid of most of the bottom-feeders, leaving only the crooks. As the ambiguity goes away, the job of law enforcement becomes easier. That's happened with spam. There's a high-profile arrest every month or two now. Alan Ralsky just went down.
-
Searched for "Tampa hotels", got bottom-feeders
Wales was quoted recently complaining about Google's results for "Tampa hotels", and talking about how Wikia was going to be better. So I searched Wikia for "Tampa hotels".
The first three results from Wikia search are all from the domain "visit-tampa-bay.com". That's one of those bottom-feeder ad link sites. The site is supposed to redirect traffic to Orbitz, but doesn't even do that right. Very disappointing result. Could they have been spammed already?
Trying "Tampa hotels" in Google gets us "travel.yahoo.com" for the top two results, which indicates that Google isn't biasing their search against their biggest competitor. Next is "traveladvisor.com". Those are OK results; you'd be able to get a hotel room that way.
Trying "Tampa hotels" in Yahoo search gets us a page from one of Yahoo's special cases. Yahoo knows about "hotels", so we get a list of hotels and prices from Yahoo, and three sponsored results. The top organic result is "tripadvisor.com", which is at least a big-name travel site, followed by "visittampabay.com" (not to be confused with "visit-tampa-bay.com"), the site for the local Convention and Visitor's Bureau. Yahoo certainly tries hard for hotel searches, and seems to be doing OK.
Trying "Tampa hotels" in MSN search gets results that look much like Yahoo's, but with lower result quality. MSN understands hotels as a special case. There are three sponsored results, and addresses and phone numbers for three real hotels. The first three organic search results are Yahoo Travel, "tampa-hotels.net" (an ad-laden landing page), and "tampa-hotels-discounts.net" (a bottom-feeder generic landing page that isn't even on topic.) Poor results.
Trying our own SiteTruth the top result is "all-hotels.com", which has a list of hotels with pictures and a reservation interface. The second result is Yahoo Travel, and the third is Expedia. We're sorting Yahoo results on business legitimacy, so that's not surprising. OK here.
So there's where Wikia is today, on their recommended demo search.
-
Wikia, the place to go for furry fan fiction
Wikia has been something of a dud. What Wikia really does is monetize fancruft. Their big wikis are for Star [Trek|Wars|Gate|Craft], Everquest, Marvel comics, Yu-Gi-Oh, and similar subjects. They're the resting place for fan articles thrown out of Wikipedia.
Wikia's search engine, based on the user demographic they have now, is going to have great coverage of furry fan fiction.
There's already a good manually-updated search engine. It's called Open Directory. It's quite useful as a data source for answering the question "what is this web site about"? It tends to run months behind changes to the web, since it's manually updated. While not many people query DMOZ manually, it's used by Yahoo, Google, etc. to get some basic information about a web site.
As an example of how great Wikia search is going to be, Wales suggested searching for "Tampa hotels". The major search engines return too many bottom-feeder reseller and directory sites for searches like that. As I point out occasionally, we've already solved that problem over at SiteTruth, which looks for business legitimacy. Type in "Tampa hotels" there and watch it push the marginal sites to the bottom of the search results. We have that one handled.
Wikipedia works because people are willing to do substantial work for free for a non-profit organization. That doesn't work for a commercial business. You can get people to write about themselves (Myspace, Facebook, etc.) but beyond that, "crowdsourcing" doesn't go very far.
-
Re:The "ad-supported Internet"
A truly relevant shared agent would filter out all ads and click-through trap sites, and totally mess up the dynamic of the ad-supported Internet.
That's a feature, not a bug. We're working on the problem. So are others.
"Adblock" is just the beginning. There's Customize Google, which will remove Google text ads. It's a Firefox extension. Also removes Google ad tracking.
We have SiteTruth, which is a form of "intelligent agent" that rates sites for legitimacy, digging in various data sources and reading through the site for business addresses to find out who's behind the site. (No clear business location on a commercial site yields a bad rating.) We mostly use Yahoo search, but we also have a front end for Google which leaves the ads in, then rates both the organic search results and the ads for legitimacy.
As a general rule, advertised sites rate lower than organic search results. We see that with our system, and systems that rate by other criteria (user ratings, hostile code scanning, etc.) see similar results. This makes sense; if you're getting good positioning in organic search results, why run ads in the search engine? There's a clear "bottom-feeder effect" in search engine ads.
-
Re:The "ad-supported Internet"
A truly relevant shared agent would filter out all ads and click-through trap sites, and totally mess up the dynamic of the ad-supported Internet.
That's a feature, not a bug. We're working on the problem. So are others.
"Adblock" is just the beginning. There's Customize Google, which will remove Google text ads. It's a Firefox extension. Also removes Google ad tracking.
We have SiteTruth, which is a form of "intelligent agent" that rates sites for legitimacy, digging in various data sources and reading through the site for business addresses to find out who's behind the site. (No clear business location on a commercial site yields a bad rating.) We mostly use Yahoo search, but we also have a front end for Google which leaves the ads in, then rates both the organic search results and the ads for legitimacy.
As a general rule, advertised sites rate lower than organic search results. We see that with our system, and systems that rate by other criteria (user ratings, hostile code scanning, etc.) see similar results. This makes sense; if you're getting good positioning in organic search results, why run ads in the search engine? There's a clear "bottom-feeder effect" in search engine ads.
-
The major sites contributing to the problemFrom the article:
Gartner sees no easy way out of this dilemma unless e-mail providers have incentives to invest in solutions to keep phishing e-mails from reaching consumers in the first place, and unless advertising networks and other "infection point" providers (which theoretically can be any legitimate Web site or service) have incentives to keep malware from being planted on their Web sites to reach unsuspecting consumers.
In practice, only a small minority of "legitimate Web sites or services" are "infection point providers". We have a little list. Right now, there are 166 major sites known to be providing material support to phishing attacks. There were 171 when The Register covered this last week, so publicity is having some effect. Most sites on the list only stay there for a few days, until somebody fixes the problem. A few sites stay on the list, and may need a clue stick applied.
These are exploits of open redirectors, DSL lines with zombies, sites that let hostile content be uploaded (uploading a hostile ".swf" file to Photobucket, for example), and out and out break-ins. These aren't sites that are cooperating with phishers; they're innocent, but often clueless, victims.
We blacklist the entire second-level domain if there's any phishing activity anywhere in the domain. This is far more effective than blacklisting by URL. Phishing sites change URLs and subdomains constantly now, so blacklisting by URL is as useless as virus scanning by signature. Yes, there's some collateral damage. It's all to sites on that list. We make the list public, and provide links to the actual phishing information (which is from PhishTank.), so major sites can fix their problems.
This part of the problem can be fixed. It just takes a hard-line approach.
-
Google still hasn't fixed their open redirector
After reading this, I immediately checked to see if Google had fixed their open redirector. No, they haven't, and there are six exploits of it listed in PhishTank. Google needs to turn that off. If they absolutely insist on having an open redirector, it needs its own subdomain, which is what Yahoo does. Then the subdomain can be blacklisted without collateral damage.
Phishing via exploits of major sites is a big problem, but involves a small number of major sites. 168 major sites today. The usual exploits are:
- Phishing site web servers on DSL lines. Some ISPs are good at kicking these off, and some aren't as good. "bellsouth.net" has more entries in PhishTank than any other domain.
- "Open redirectors", URLs that can be exploited to redirect to another site, like the Google URL above.
- Web hosting services, especially free ones, sometimes find themselves hosting phishing sites.
- "Web 2.0" sites which allow uploading of user content but don't check it for exploits. Photobucket is used by some phishers, who upload hostile ".swf" files.
- Break-ins on legitimate sites, where, typically, some obscure page is hosting hostile content. When an ".edu" site shows up in our list, that's usually what happened.
Out of 1.6 million domains in DMOZ, and over 10,000 phishes in PhishTank, only 168 domains are in both. So the number of sites that need to be fixed is small. In fact, some of those sites are already fixed, but the entries haven't been removed from PhishTank yet. (Hint: if you kill a hostile page on your domain, make it a 404 error; that gets the page out of PhishTank's "active and online" list automatically. Don't just change the content or redirect it somewhere else, or it stays in the tank until somebody rechecks it manually, which can take weeks.)
For every site in the list, there's some competitor in the same business who isn't on the list. "Everybody has this problem" isn't a valid excuse any more. This is a useful point to make with management if you find your own company on the list.
This list of 168 exploited sites is updated automatically every three hours. There's also a list of sites recently removed from PhishTank. "n-insanity.com", "tropmet.res.in", "wsjob.com" were dropped from the list today; they no longer have active, online entries in PhishTank. "gentlesource.com", "t35.com" (an eBay phish), "tilapia.com" (another eBay phish), and "uic.edu" (already fixed) were added; they just appeared in PhishTank. If you have any responsibility for a site on the list, please take steps to fix the problem. If you're not part of the solution, you're part of the problem.