Domain: sitetruth.com
Stories and comments across the archive that link to sitetruth.com.
Comments · 190
-
Minor improvements
(Read the "print" version of the article, instead of the "tiny blocks of text spread over many pages of ads" version.)
I have misgivings about HTML5. It gives the page more control, and the user less. That's been a trend in HTML for years, and it's getting worse.
I'm dreading "canvas". Ad blockers need to get smarter. Noticed that popups are winning over Firefox's popup blocking? We're also going to see pages that use 100% of the CPU just for display. We're going to need a browser option for "don't run canvas code for windows that aren't on top.
The "input type" mechanism for forms is lame. There are a number of standard types like "tel", but it's just text with no line breaks. They should have provided for either regular expressions or syntax like the COBOL Picture clause ("CREDIT_CARD_NUMBER PIC 9999-9999-9999-9999").
Dynamically-loaded fonts have been working for some time now in all the mainstream browsers. (IE6 and Firefox 3.5 were the last mainstream browsers not to have it.) We've been playing with that for our steampunk site. Downloadable fonts without anti-aliasing turn out to look ugly for small font sizes, because most of the display-type fonts have too much detail and not enough hinting for small font sizes. (In an annoying piece of Apple incompatibility, the iPad requires fonts in SVG, of all things. Everybody else, including Microsoft, is going to Web Open Font Format.) I'd recommend against using this feature much unless you have a good sense of typography. (Bad example: our steampunk search engine.)
-
Our system says "don't go there"
The actual site mentioned is thenerdsupport.com
I ran them through our SiteTruth system. Here's what comes out. "Rating: "Site ownership unknown or questionable. No Location.
... This certificate identifies the domain only, not the actual business. No street address found on the site."Compare the SiteTruth results for Geek Squad. Street addresses found, found in the US business directory, found in Open Directory.
It's not that hard to sort out the phony business sites from the real ones. You have to check business databases, not just the Web, for business legitimacy. If you just look at the web, you get bogus results like this: McAfee SiteAdvisor: "We tested this site and didn't find any significant problems." The site itself doesn't try to attack the user, so McAfee says it's good to go.
-
Our system says "don't go there"
The actual site mentioned is thenerdsupport.com
I ran them through our SiteTruth system. Here's what comes out. "Rating: "Site ownership unknown or questionable. No Location.
... This certificate identifies the domain only, not the actual business. No street address found on the site."Compare the SiteTruth results for Geek Squad. Street addresses found, found in the US business directory, found in Open Directory.
It's not that hard to sort out the phony business sites from the real ones. You have to check business databases, not just the Web, for business legitimacy. If you just look at the web, you get bogus results like this: McAfee SiteAdvisor: "We tested this site and didn't find any significant problems." The site itself doesn't try to attack the user, so McAfee says it's good to go.
-
"London" is a heavily spammed term
"London", as a keyword, is a heavy spam target. I used to use "London Hotels" as a test case for SiteTruth's web spam detector. Google used to do badly on that search. (Since they started handling travel destinations as a special case, the first 10 Google results are now either paid ads or results from the business search engine.)
-
Most delay is ad-related.
Most real-world page load delay today seems to be associated with advertising. Merely loading the initial content usually isn't too bad, although "content-management systems" can make it much worse, as overloaded databases struggle to "customize" the content. "Web 2.0" wasn't a win; pulling in all those big CSS and JavaScript libraries doesn't help load times.
We do some measurement in this area, as SiteTruth reads through sites trying to find a street address on each site rated. We never read more than 21 pages from a site, and for most sites, we can find a street address within 45 seconds, following links likely to lead to contact information. Only a few percent of sites go over 45 seconds for all those pages. Excessively slow sites tried recently include "directserv.org" (a link farm full of ads), "www.w3.org" (embarrassing), and "religioustolerance.org" (an underfunded nonprofit). We're not loading images, ads, Javascript, or CSS; that's pure page load delay. It's not that much of a problem, and we're seeing less of it than we did two years ago.
-
He doesn't mention fonts
The current versions of all the major browsers can now dynamically download fonts. We can finally stop putting display text in images. Opera, Safari, Chrome, Firefox (3.6 or greater) and IE are all on board with this. By IE 9, they'll even be using the same font format, Web Open Font Format. (Except for the iPad, which, for some weird reason, currently requires fonts in SVG format. But even the iPad understands "@font-face")
Few sites are using this capability yet. We are, as a demo. Try our steampunk search engine with authentic Victorian fonts.
-
No, Google doesn't have a real search API.
Google once had a real search API. It was SOAP-based. But they discontinued it years ago.
Google's AJAX search API is, by design, very limited. All you can really do is create a little search widget, and perhaps add some fields of your own. The term prohibits doing much beyond that. "You are allowed to use the API only to display, and to make such uses as are necessary for You to display, Google Search Results on your Property. The API does not provide You with the ability to access, and You are not allowed to access, other underlying Google Services or data. Subject to the limitations and conditions described below, "
... "You agree that You will not, and You will not permit your users or other third parties to: (a) modify or replace the text, images, or other content of the Google Search Results, including by (i) changing the order in which the Google Search Results appear, (ii) intermixing Search Results from sources other than Google, or (iii) intermixing other content such that it appears to be part of the Google Search Results; or (b) modify, replace, obscure, or otherwise hinder the functioning of links to Google or third party websites provided in the Google Search Results. " Given those restrictions, you can't write Scroogle using that API.We have a SiteTruth search page which uses the Google AJAX API. We're prohibited from re-ordering the entries or removing any of them. Since the whole point of SiteTruth is to re-order search results by business legitimacy, and we don't do that for the Google results, the Google results are inferior to the ones from other search engines. So our primary search page uses Yahoo/Bing.
-
Re:The problem: low standards in search engines.
Re SiteTruth complaints: (We have a blog for that.)
Non-commercial web sites aren't rated at all. However, the presence of an ad link marks a site as "commercial", as does being in ".com". Our "commercial intent" detection is rather simplistic. We really should have a classifier system doing that. Yahoo search R&D, back when they had search R&D, built one of those, but never did much with it. We've been reluctant to use machine learning techniques, though, because they reduce the transparency of the system. At present, SiteTruth doesn't rely on "security by obscurity". Adding a classifier system would change that.
Credit rating information is useful because, for businesses, you can get business size information. Annual sales and number of employees are worth knowing, and displaying to the user in search results. (We'll be doing something in that area soon.) There's a guy in Brooklyn, NY, who took pictures of camera stores that advertise on line or for mail order. There are companies with giant warehouses and loading docks, and there are, well, "marginal locations". It's very funny. Search engines need info like that.
As for specific sites:
- Glaxosmithkline: We give them a yellow "?", which means we think they're legit, but don't have third-party verification that the domain is tied to the company. In our hard-ass view, that's an OK rating. SSL certs and BBBonline links provide such third-party verification. They did match our database. We weren't able to parse "Registered office: 980 Great West Road, Brentford, Middlesex, TW8 9GS, United Kingdom.", unfortunately; we only recognize multi-line postal addresses, usable on an envelope, at present.
- Vodaphone All the country sites have SSL certs, but the main ".com" site does not. It does have the address "Vodafone Group Plc / Vodafone House / The Connection / Newbury / Berkshire / RG14 2FN / England" on multiple lines, which we pulled out of the source HTML as a possible address, but did not parse successfully. Still, they got a yellow "?", and were matched to the UK business database.
- Oxfam gets a green checkmark, and the system was able to pull four business addresses from their web site.
-
Re:The problem: low standards in search engines.
Re SiteTruth complaints: (We have a blog for that.)
Non-commercial web sites aren't rated at all. However, the presence of an ad link marks a site as "commercial", as does being in ".com". Our "commercial intent" detection is rather simplistic. We really should have a classifier system doing that. Yahoo search R&D, back when they had search R&D, built one of those, but never did much with it. We've been reluctant to use machine learning techniques, though, because they reduce the transparency of the system. At present, SiteTruth doesn't rely on "security by obscurity". Adding a classifier system would change that.
Credit rating information is useful because, for businesses, you can get business size information. Annual sales and number of employees are worth knowing, and displaying to the user in search results. (We'll be doing something in that area soon.) There's a guy in Brooklyn, NY, who took pictures of camera stores that advertise on line or for mail order. There are companies with giant warehouses and loading docks, and there are, well, "marginal locations". It's very funny. Search engines need info like that.
As for specific sites:
- Glaxosmithkline: We give them a yellow "?", which means we think they're legit, but don't have third-party verification that the domain is tied to the company. In our hard-ass view, that's an OK rating. SSL certs and BBBonline links provide such third-party verification. They did match our database. We weren't able to parse "Registered office: 980 Great West Road, Brentford, Middlesex, TW8 9GS, United Kingdom.", unfortunately; we only recognize multi-line postal addresses, usable on an envelope, at present.
- Vodaphone All the country sites have SSL certs, but the main ".com" site does not. It does have the address "Vodafone Group Plc / Vodafone House / The Connection / Newbury / Berkshire / RG14 2FN / England" on multiple lines, which we pulled out of the source HTML as a possible address, but did not parse successfully. Still, they got a yellow "?", and were matched to the UK business database.
- Oxfam gets a green checkmark, and the system was able to pull four business addresses from their web site.
-
Re:The problem: low standards in search engines.
Re SiteTruth complaints: (We have a blog for that.)
Non-commercial web sites aren't rated at all. However, the presence of an ad link marks a site as "commercial", as does being in ".com". Our "commercial intent" detection is rather simplistic. We really should have a classifier system doing that. Yahoo search R&D, back when they had search R&D, built one of those, but never did much with it. We've been reluctant to use machine learning techniques, though, because they reduce the transparency of the system. At present, SiteTruth doesn't rely on "security by obscurity". Adding a classifier system would change that.
Credit rating information is useful because, for businesses, you can get business size information. Annual sales and number of employees are worth knowing, and displaying to the user in search results. (We'll be doing something in that area soon.) There's a guy in Brooklyn, NY, who took pictures of camera stores that advertise on line or for mail order. There are companies with giant warehouses and loading docks, and there are, well, "marginal locations". It's very funny. Search engines need info like that.
As for specific sites:
- Glaxosmithkline: We give them a yellow "?", which means we think they're legit, but don't have third-party verification that the domain is tied to the company. In our hard-ass view, that's an OK rating. SSL certs and BBBonline links provide such third-party verification. They did match our database. We weren't able to parse "Registered office: 980 Great West Road, Brentford, Middlesex, TW8 9GS, United Kingdom.", unfortunately; we only recognize multi-line postal addresses, usable on an envelope, at present.
- Vodaphone All the country sites have SSL certs, but the main ".com" site does not. It does have the address "Vodafone Group Plc / Vodafone House / The Connection / Newbury / Berkshire / RG14 2FN / England" on multiple lines, which we pulled out of the source HTML as a possible address, but did not parse successfully. Still, they got a yellow "?", and were matched to the UK business database.
- Oxfam gets a green checkmark, and the system was able to pull four business addresses from their web site.
-
Attacks against hosting providers
We noticed another attack against a hosting provider recently, but it wasn't GoDaddy; it was ThePlanet, or at least someone who uses their IP block. A number of phishing sites suddenly appeared on our list, and we noticed they all mapped to the same server. Multiple domains on the same server were all hosting the same phishing attack.
Annoyingly, the domain registration for the server's main domain ("websitewelcome.com") was "private". That's actually part of HostGator's system; there's no reason it should have "private registration". It just makes it harder to find the responsible party.
-
The problem: low standards in search engines.
These guys are doing good work, but really, all they're doing is checking for some specific types of black-hat SEO. This is inherently a losing battle, because there's active opposition. It's a "negative file" approach - making a list of the bad guys. Credit cards once worked that way; merchants were sent daily lists of canceled or stolen credit cards. Back then, getting a credit card was tough; the customer had to be a good customer of the bank. Not until credit card transactions were validated remotely against a "positive file" that checked the actual account could everyone have one. Web search is still in the "negative file" era.
As I point out occasionally, the main search engines have very low standards for business legitimacy. It's an ongoing, and losing, battle to filter out the totally bogus sites. But if you insist on some minimal standard of business legitimacy for a commercial web site, you kick out most of the "bottom feeders" with no business address, and along with them, most of the total phonies. We do this at SiteTruth, which exists to demonstrate that it's possible. SiteTruth tries to find some indication that a domain maps to a real-world business. If it can't, the site is moved down in search engine position. That's enough to move most "bottom feeder" downward, below the legit ones. It's not always successful in finding the business behind the site, but it looks harder than the average user would, looking through the site's "About", "Help", "Contact", etc. pages for a mailing address. If a search engine takes a hard line on this, the junk sites can be kicked out.
Once you have a business address for a web site, there are extensive resources for finding out more about the business. It's easy to get annual sales and number of employees if you know what database to buy. Corporate registration information and D/B/A name information is available. Business credit rating info is available in bulk for a fee. Crank that info into search engine positioning and you've got hard data driving search. Rating web sites by looking only at the web is a process easy to manipulate. Use info from the real world, and it's much harder.
Phony mailing addresses do show up, but that's usually associated with phishing sites. Not showing a business address is a misdemeanor in some jurisdictions, but common. Using the address of another business is felony fraud and identity theft. That gets law enforcement attention. So only outright criminals try that. To catch that, we fetch the entire PhishTank database every few hours and blacklist the entire domain for a single phishing entry. That's draconian, but if you're running a site that lets users upload entire pages, it's your job to kick the phishers off. Most of the innocent victims there are free hosting services with weak abuse departments. If you're in the free hosting business or the URL redirection business, you need a strong abuse department, or you will be pwned. Right now, "t35.com" is getting hit hard. By now, most free hosting sites with a clue automatically check PhishTank and the APWG list to see if they're on it. "t35.com" is still doing it by hand, and they're losing the battle.
So why doesn't Google do this? Google's business model depends on those ad-heavy "bottom feeder" sites. About 36% of Google's "content network" domains are "bottom feeders". When organic search takes you to the right place on the first try, Google doesn't make any money. But if you're led through an ad-heavy site, the Google cash register clicks. Google's business model thus takes them to the dark side. Google would take a big financial hit if they did even some basic legitimacy checking on their advertisers. Search Google for "craigslist auto posting tool", which brings up five Google ads for companies offering to spam Craigs
-
The problem: low standards in search engines.
These guys are doing good work, but really, all they're doing is checking for some specific types of black-hat SEO. This is inherently a losing battle, because there's active opposition. It's a "negative file" approach - making a list of the bad guys. Credit cards once worked that way; merchants were sent daily lists of canceled or stolen credit cards. Back then, getting a credit card was tough; the customer had to be a good customer of the bank. Not until credit card transactions were validated remotely against a "positive file" that checked the actual account could everyone have one. Web search is still in the "negative file" era.
As I point out occasionally, the main search engines have very low standards for business legitimacy. It's an ongoing, and losing, battle to filter out the totally bogus sites. But if you insist on some minimal standard of business legitimacy for a commercial web site, you kick out most of the "bottom feeders" with no business address, and along with them, most of the total phonies. We do this at SiteTruth, which exists to demonstrate that it's possible. SiteTruth tries to find some indication that a domain maps to a real-world business. If it can't, the site is moved down in search engine position. That's enough to move most "bottom feeder" downward, below the legit ones. It's not always successful in finding the business behind the site, but it looks harder than the average user would, looking through the site's "About", "Help", "Contact", etc. pages for a mailing address. If a search engine takes a hard line on this, the junk sites can be kicked out.
Once you have a business address for a web site, there are extensive resources for finding out more about the business. It's easy to get annual sales and number of employees if you know what database to buy. Corporate registration information and D/B/A name information is available. Business credit rating info is available in bulk for a fee. Crank that info into search engine positioning and you've got hard data driving search. Rating web sites by looking only at the web is a process easy to manipulate. Use info from the real world, and it's much harder.
Phony mailing addresses do show up, but that's usually associated with phishing sites. Not showing a business address is a misdemeanor in some jurisdictions, but common. Using the address of another business is felony fraud and identity theft. That gets law enforcement attention. So only outright criminals try that. To catch that, we fetch the entire PhishTank database every few hours and blacklist the entire domain for a single phishing entry. That's draconian, but if you're running a site that lets users upload entire pages, it's your job to kick the phishers off. Most of the innocent victims there are free hosting services with weak abuse departments. If you're in the free hosting business or the URL redirection business, you need a strong abuse department, or you will be pwned. Right now, "t35.com" is getting hit hard. By now, most free hosting sites with a clue automatically check PhishTank and the APWG list to see if they're on it. "t35.com" is still doing it by hand, and they're losing the battle.
So why doesn't Google do this? Google's business model depends on those ad-heavy "bottom feeder" sites. About 36% of Google's "content network" domains are "bottom feeders". When organic search takes you to the right place on the first try, Google doesn't make any money. But if you're led through an ad-heavy site, the Google cash register clicks. Google's business model thus takes them to the dark side. Google would take a big financial hit if they did even some basic legitimacy checking on their advertisers. Search Google for "craigslist auto posting tool", which brings up five Google ads for companies offering to spam Craigs
-
Getting their attention
It's hard getting the attention of some vendors. I see vulnerabilities in a slightly different context - hacked web sites hosting phishing pages. We distribute a list of major domains being exploited by active phishing scams. This is obtained by processing PhishTank data, and we do this because we want to reduce the collateral damage from a tough blacklist system. At any given time, there are about 30 to 80 domains on the list.
Some sites get themselves off the list quickly. By now, most of the better free hosting services and short-URL services are automatically checking PhishTank and the APWG blacklist to see when they've been hit. Today, if you run a service where anybody can put up a page that could be used for phishing (i.e. it's not full of your own headers and banners), you need automation to deal with attacks. I've been in contact with the abuse guy at "t35.com", which is a free hosting service. They've recently been hit by a flood of phishing attacks, with several hundred new reports in PhishTank per day. The attacks were coming in faster than the abuse guy could clean them out. They're now gaining on the problem, but haven't squashed it yet. Take-away lesson: automate this.
The ones near the top of the list have been there for a while. Note the dates, which are the date that the oldest phishing report still online and active appeared in PhishTank. Some just need help. Typically, these are small organizations like churches and nonprofits that have had a break-in and were partially taken over by a phishing site. I send them the Anti-Phishing Working Group's "What To Do if your Site Has Been Hacked". Sometimes I give them a phone call. They deserve sympathy.
Then there are the hard cases. These are sites with no visible contact address, or a clueless abuse department. At the moment, Google Sites and Google Spreadsheets are being used for phishing. Google is new to the free hosting business, and the phishers have discovered some tricks that Google can't yet handle. While Google puts a "report abuse" link on their site pages, it's possible to set up a file for downloading on Google Sites, and an HTML page can be served that way, without Google's abuse checking. There's also an exploit of Google Spreadsheets. That one is an example of Habbo Hotel phishing. We've reported these to Google several times, but they haven't been fixed yet.
We've been seeing a new type of attack recently - a phishing operation breaks into a shared hosting server and plants phishing pages on multiple domains on a single server. One of these hit one of the mysterious "*.websitewelcome.com" servers, which has "cloaked domain registration" and no useful default web page. These seem to be associated with "ThePlanet.com", but whether ThePlanet operates them, is providing wholesale hosting, is providing colocation, or is just the upstream connectivity provider is not clear.
Hiding the contact information of a hosting provider is legally unwise. The hosting provider may lose the "safe harbor" protection of the the DMCA. The "safe harbor" provision for "Information Residing on Systems or Networks At Direction of Users" only applies if "the service provider has designated an agent to receive notifications of claimed infringement... by making available through its service, including on its website in a location accessible to the public, and by providing to the Copyright Office, substantially the following information: the name, address, phone number, and electronic mail address of the agent." So when the RIAA or the MPAA come calling, a likely event for a hosting service, they get
-
Businesses are not entitled to anonymity
Neither WHOIS information nor IP address block allocation (ARIN's remit) should be private. Neither businesses nor anonymous web sites are entitled to anonymity in most of the developed world. Europe, in fact, is tougher on this than the US. Europe has the European Privacy Directive, but that's for individuals acting in their private capacity. Businesses come under the European Directive on Electronic Commerce.
1. In addition to other information requirements established by Community law, Member States shall ensure that the service provider shall render easily, directly and permanently accessible to the recipients of the service and competent authorities, at least the following information:
(a) the name of the service provider;
(b) the geographic address at which the service provider is established;
(c) the details of the service provider, including his electronic mail address, which allow him to be contacted rapidly and communicated with in a direct and effective manner;"Service provider" here means web site owner/operator. So even in an area with strong privacy laws, businesses don't have the right to run anonymous web sites.
California has a similar law for sites that accept credit cards. It's a criminal offense in California to accept credit cards from an anonymous web site.
At SiteTruth, our demo search site, we use this requirement to filter out "bottom-feeder" sites from search results. If it looks commercial, and we can't figure out who owns the site after trying about five different approaches, it's down-rated, and we move this down in search results. This puts teeth into fighting "search engine spam".
Sites can put up phony address info, of course, but that's a felony in many jurisdictions. It's generally treated as fraud, and if it's someone else's address, identity theft. That's a line most "bottom feeders" don't want to cross. Also, much such fraud is reported to sites like PhishTank, so there are red flags to check.
If you want to put up a personal site to express your political opinions, fine. But if it's selling something, it can't be anonymous. Deal with it.
-
Businesses are not entitled to anonymity
Neither WHOIS information nor IP address block allocation (ARIN's remit) should be private. Neither businesses nor anonymous web sites are entitled to anonymity in most of the developed world. Europe, in fact, is tougher on this than the US. Europe has the European Privacy Directive, but that's for individuals acting in their private capacity. Businesses come under the European Directive on Electronic Commerce.
1. In addition to other information requirements established by Community law, Member States shall ensure that the service provider shall render easily, directly and permanently accessible to the recipients of the service and competent authorities, at least the following information:
(a) the name of the service provider;
(b) the geographic address at which the service provider is established;
(c) the details of the service provider, including his electronic mail address, which allow him to be contacted rapidly and communicated with in a direct and effective manner;"Service provider" here means web site owner/operator. So even in an area with strong privacy laws, businesses don't have the right to run anonymous web sites.
California has a similar law for sites that accept credit cards. It's a criminal offense in California to accept credit cards from an anonymous web site.
At SiteTruth, our demo search site, we use this requirement to filter out "bottom-feeder" sites from search results. If it looks commercial, and we can't figure out who owns the site after trying about five different approaches, it's down-rated, and we move this down in search results. This puts teeth into fighting "search engine spam".
Sites can put up phony address info, of course, but that's a felony in many jurisdictions. It's generally treated as fraud, and if it's someone else's address, identity theft. That's a line most "bottom feeders" don't want to cross. Also, much such fraud is reported to sites like PhishTank, so there are red flags to check.
If you want to put up a personal site to express your political opinions, fine. But if it's selling something, it can't be anonymous. Deal with it.
-
Major domains being exploited
We've been doing something like this at SiteTruth for two years. We have the list of major domains being exploited by active phishing scams. This is simply a list of domains that are both in PhishTank (about 100,000 entries) and Open Directory (about 1.5 million entries). Today, 84 domains are in both. There's been a surge; it was 54 two days ago.
Domains are on this list for one of several reasons.
- They had a break-in, and didn't clean it up. Generally, the sites with this problem for long periods are ones without effective contact information, so there's no easy way to tell them about their problem.
- They have an open redirector. Those are rare now, but were common two years ago. Yahoo, eBay, and Microsoft Live all used to have open redirectors. After much nagging, and some press coverage, the big players have plugged that hole.
- They're a hosting service, especially a free hosting service. Free hosting services need to be very aggressive about checking themselves for exploits. The smarter players now read the PhishTank and APWG feeds automatically, to detect abuses of their own systems. Right now, "t35.com" is suffering from a massive attack, with 227 pages in PhishTank. Their problem is that they're being attacked by a program, but are cleaning up by hand. Every day they kick off hundreds of phishing pages, but they can't keep up. The previous site with the worst problems was "piczo.com" (some kind of social network/hosting service for teenage girls), but they've been gaining on the problem.
- They're an ISP There are a few ISPs with phishing sites they just never seem to kick off. Most of the active ones were kicked off long ago. In fact, other than ISPs which are also hosting services, we show only one entry in this category, and it's a DSL line on RoadRunner that redirects to a dead page.
- They're a "short URL" service. These are popular as a way to get phishing URLs past spam filters. The "short URL" services have become much more aggressive about kicking off phishing URLs over the last year.
While this is to some extent a "blame the victim" approach, it's more effective than "phishing education" aimed at end users. Hundreds of webmasters have to be educated, not hundreds of millions of end users.
-
Taking a harder line on phishing-friendly sites
On the phishing front, it's useful to stop blaming the end user, and blame the site that hosted the phishing page.
For some time, I've encouraged taking a harder line on phishing-friendly sites, sites that host phishing pages. I had a paper on this at the 2008 MIT Spam Conference. At SiteTruth, we take the position that one phishing page blacklists the whole second-level domain. Here's the current list of major domains being exploited by active phishing scams.
The free hosting sites and the "short URL" sites show up on the blacklist regularly. After much nagging and some press coverage, most of them are now very aggressive about kicking off phishing pages, and they don't stay on for long. The better ones now read PhishTank and the APWG blacklist automatically and kick off anything that shows up. Currently, Google is in the doghouse, because they've recently entered the "free hosting business" without adequate phishing defenses. See this abuse of Google Spreadsheets.
At the moment, "t35.com", a free hosting service, is the site most abused in this way, by a large margin. I've contacted their people. The problem is that they're being attacked by a program, and they're cleaning up by hand. Right now, they're hosting 545 known phishing pages. Nobody else is even in double digits. "piczo.com" (a social network/free hosting service for teenage girls) was the last big victim, but they're gradually getting the problem under control.
A Draconian blacklisting policy may seem harsh, but it encourages site operators of easily-exploited sites to be very aggressive about dealing with the problem. We're seeing more free hosting sites with a "click here if this is abuse" button on every page. The number of people who have to be educated to deal with the problem in this way is in the hundreds, not the hundreds of millions. So it's a solveable problem.
If you're going to blame the victim, this is the way to go at it.
-
Taking a harder line on phishing-friendly sites
On the phishing front, it's useful to stop blaming the end user, and blame the site that hosted the phishing page.
For some time, I've encouraged taking a harder line on phishing-friendly sites, sites that host phishing pages. I had a paper on this at the 2008 MIT Spam Conference. At SiteTruth, we take the position that one phishing page blacklists the whole second-level domain. Here's the current list of major domains being exploited by active phishing scams.
The free hosting sites and the "short URL" sites show up on the blacklist regularly. After much nagging and some press coverage, most of them are now very aggressive about kicking off phishing pages, and they don't stay on for long. The better ones now read PhishTank and the APWG blacklist automatically and kick off anything that shows up. Currently, Google is in the doghouse, because they've recently entered the "free hosting business" without adequate phishing defenses. See this abuse of Google Spreadsheets.
At the moment, "t35.com", a free hosting service, is the site most abused in this way, by a large margin. I've contacted their people. The problem is that they're being attacked by a program, and they're cleaning up by hand. Right now, they're hosting 545 known phishing pages. Nobody else is even in double digits. "piczo.com" (a social network/free hosting service for teenage girls) was the last big victim, but they're gradually getting the problem under control.
A Draconian blacklisting policy may seem harsh, but it encourages site operators of easily-exploited sites to be very aggressive about dealing with the problem. We're seeing more free hosting sites with a "click here if this is abuse" button on every page. The number of people who have to be educated to deal with the problem in this way is in the hundreds, not the hundreds of millions. So it's a solveable problem.
If you're going to blame the victim, this is the way to go at it.
-
Reasonable idea
That's a good idea. We do something like that at SiteTruth, where we down-rate commercial sites that don't have a real-world contact address on the site. We're looking at user-visible pages, though, not WHOIS. WHOIS data quality is too low.
I'm all in favor of this sort of thing. But don't drop the messages silently; reject them during the SMTP session if you can, or send a mail bounce if you can't. There's much to be said for having a hard-ass attitude about this, but you have to handle the false positives properly.
Anything that sends mail bounces needs to check SPF records. This makes it possible to stop joe-job mail bounce problems. (EXIM mailer people: please finish the implementation of SPF checking and advance it from "experimental", so large ISPs can use it.)
Also, quit whining that putting your real name on your WHOIS registration will get you annoying phone calls, threats, or whatever. I've had my real name and contact info on all my web sites and WHOIS information for a decade, and that's just not happening.
-
34% "bottom feeder" sites in AdWords.
Our own data, at SiteTruth, indicates that about 34% of Google Content Network advertisers, by domain name, are "bottom feeder" sites which we can't associate with a real-world business. This is disappointing, but not surprising. When you see a Google ad, it's not usually from a Fortune 1000 company, after all.
Our data comes from our AdRater plug-in, which rates the advertiser behind each Google ad as it appears on the user's web page. If someone goes to an ad-heavy typosquatting site, we'll see the domains advertised there. (We don't see the typosquatting domain, though; we don't monitor what pages the user views, just the ad domains. We're interested in advertiser behavior, not use behavior.) We collect the domain names of the advertisers, so we have a sizable fraction of Google's customer list, and this is hard data. We're not extrapolating.
(Collecting Google's customer list is a "long tail" kind of thing. The first 25,000 Google advertisers were seen in the first two months; the next 25,000 showed up over about four months. We'll never see them all, but we've probably seen most of them by now. Google probably has somewhere between 50,000 and 100,000 active advertisers, by domain name.)
The numbers indicate that a significant portion of Google's revenue comes from those "bottom feeders". That's why Google can't be very tough on "web spam". They have Matt Cutts claiming that Google tries to stop web spam, but, realistically, they don't try very hard. They can't. It's essential to their business model.
Search Google for "craigslist auto posting tool". Not only are there paid ads for software to put ads on Craiglist using phony accounts, some of them use Google Checkout, so Google gets a cut of what's basically a fraud scheme. ("Automatic CAPTCHA bypass available with integrated Image-to-Text support!") Google's advertiser validation standards are very low.
-
Low-quality phishing software
I've seen that, too. Recently, Stanford University came up on our short list of major sites being exploited by phishers. I was surprised, because Stanford is usually good about stopping that. It was a weird subdomain under "stanford.edu", and at first I thought someone had compromised Stanford's DNS to get their site under the "stanford.edu" domain. But no, it was just some minor machine that had had a break-in.
The directory with the phishing page was readable as a web page and contained the log of captured passwords, so I sent those to Stanford security and Bank of America security. Haven't heard back from either. After the end of the weekend, the site was taken down, and that took Stanford off the blacklist.
We've been reasonably successful at cleaning up that list. We're trying to popularize the idea that one verified phishing URL blacklists the whole domain until the problem is fixed. (The idea behind SiteTruth is to take a hard-line approach and measure the collateral damage so it can be minimized.) The oldest sites on that list are ones which won't respond to complaints by e-mail or phone. In some cases we've sent faxes.
The worst offenders are Piczo and FortuneCity. Piczo is some kind of social network/hosting service for teenage girls, and it's full of phishing pages, mostly for Habbo logins. PhishTank counts 15, and there are probably more. The phony pages are often not in English, and the Piczo abuse department may not recognize a French Habbo phishing page. This may be the next trend in phishing - put your page on a site run by someone unlikely to understand the page. I've seen a phishing page in Greek on an Indian site.
It's getting harder to run a phishing site. Since the end of "domain tasting", the business of high-volume bogus domain registration has tapered off. We haven't seen an "open redirector" on a major site in a while; eBay, Yahoo, and Microsoft Live all used to have at least one. The "url shorteners" are getting very aggressive about killing links to phishing sites. This might be winnable.
-
I'm looking at you, Slashdot
I've mentioned the ad bottleneck before. Slashdot is an especially bad offender. Pages use several ad servers, and they use "document.write" to stall the page load until the ad comes up. Even if you have the ad images blocked, some of the junk JavaScript still needs to run.
Some sites are just slow at serving pages. Behind my SiteTruth system there is a specialized web crawler which looks for a business name and address on each web site. It never looks at more than 20 pages, and it's looking for pages like "About", "Contact", and about 40 other words which might plausibly lead to contact info. This process runs about 5-15 seconds for a well-implemented site. I log sites where it takes more than 45 seconds. About 5-10% of sites run overtime. In the last hour, the slowest site is "www.airsmaxkey.com", at 159 seconds to read 10 pages. (Yes, they're a bottom-feeder. Not only is there no business address on the site (a criminal offense in the European Union), they have logos from Verisign, PayPay, Verified by Visa, and MasterCard SecureCode, none of which are actually clickable to do the claimed verification. Nor does their shopping cart checkout use SSL. The whole site may be a scam. SiteTruth gives them a "Do Not Enter" rating.)
Some of the social networking sites have so much Javascript that Firefox will time out. (Facebook had that problem for a while. They fixed it.)
-
Filtering out the bottom-feeders.
The big search engines remain too "soft" on bottom-feeders. Google once took a harder line. In 2004 and 2005, Google sponsored the Web Spam Summit. Then they had a down quarter and turned to the dark side. Since then, from 2006 to 2009, they've sponsored the Search Engine Strategies conference, the web spammer's convention.
Google has to do this to remain profitable. 35% of AdWords advertisers, by domain, are "bottom-feeders" - sites with no identifiable legitimate business behind them. A significant portion of Google's revenue comes from those bottom-feeders, and the AdWords ads on their sites. If Google filtered out all spam blogs, their revenue would decline.
We, of course, run SiteTruth, as a demo to show that search can have less evil. Try putting some of those "bad" sites into SiteTruth and see how it rates them.
(We get some whining, of course. "I wanna run ads on my blog and I don't wanna say who I am." Tough. You're operating a business, and businesses, by law, don't get to be anonymous. Even in the EU. Deal with it.)
-
Selective ad-blocking for Facebook?
So most of these scam networks block Northern California, to prevent Facebook HQ from seeing them? So that's why I don't see them. I'm a few miles from Facebook HQ. I've completely missed this phenomenon.
I'd applied SiteTruth to Google ads, trying to warn users about the "bottom feeders" with no identifiable legitimate business behind the ad. Myspace is mostly Google ads, so that's covered. Google ads in general are about 35% "bottom feeders" (we track this), but on Myspace, the percentage is much higher. From the article, Facebook has a similar problem, but it's mostly in the form of Facebook-specific ads, games, etc. We're not catching those.
Maybe it's time to do that.
-
Selective ad-blocking for Facebook?
So most of these scam networks block Northern California, to prevent Facebook HQ from seeing them? So that's why I don't see them. I'm a few miles from Facebook HQ. I've completely missed this phenomenon.
I'd applied SiteTruth to Google ads, trying to warn users about the "bottom feeders" with no identifiable legitimate business behind the ad. Myspace is mostly Google ads, so that's covered. Google ads in general are about 35% "bottom feeders" (we track this), but on Myspace, the percentage is much higher. From the article, Facebook has a similar problem, but it's mostly in the form of Facebook-specific ads, games, etc. We're not catching those.
Maybe it's time to do that.
-
Popular with phishers
Geocities was very popular with phishers who needed hosting on a domain too popular to blacklist. We maintain a list of major domains being exploited by active phishing scams, and Geocities is in the #2 position for length of time on the list. Over the last few months, the number of phishing sites hosted on Geocities has slowly declined. Today, on Geocities' last day, there is only one left.
With Geocities out of action, Piczo.com (hosting/social networking for teens) and Fortunecity.com (general-purpose free hosting) become the top hosting services favored by phishers. Most of the Piczo phishing sites seem to be aimed at getting Habbo login credentials. There is apparently a whole racket which breaks into Habbo accounts to steal virtual furniture.
(We finally have all the big players off that list. When we started, Yahoo, Microsoft, Google, and eBay were all on that list. They've all been fixed. The "short URL" sites are now all very aggressive about killing off phishing links; they don't want to get on spam blacklists. Most of the remaining sites on the list are modest sites run by people who have no idea what's going on with their site. The oldest entry on that list, hoseo.ac.kr, is a Korean university. Someone broke into their email system last year and put a phishing site on port 8080. Their webmaster mailbox is full, but we've tried to reach them by other means and may eventually reach someone with a clue.)
-
Analyzing online anonymity.
There are three issues with "online anonymity". One is anonymous businesses, the second is the ability to create an unlimited number of new identities at very low cost, and the third is actual identification of end users.
Anonymous businesses, that is, web sites with commercial intent which don't identify their ownership, are already illegal in many jurisdictions. At SiteTruth, we treat anonymous businesses (where there's no postal mailing address on the web site) as "bottom feeders", and move them to the bottom of search results. Google has a bias against "private registration" domains, but that only kicks in if the site otherwise looks like a junk site. There's not much controversy about this; it's accepted law that a business has to identify itself properly.
The ability to create an unlimited number of new identities causes various forms of trouble. The ability to get vast numbers of free Gmail accounts ("automatically create Gmail Accounts in seconds flat without breaking a sweat") is a windfall for spammers and has destroyed vast sections of Craigslist. The ability to register large numbers of domains with phony domain registration has created a well-known range of problems. Gradually, that's being tightened down. "Domain Tasting" is now dead, now that registrars have to eat the loss if they register and release a domain within 5 days. Phony WHOIS information remains a problem, but could be fixed. When you register a domain, you should get a postal mail piece with the code that enables the domain.
End user identification is the controversial issue. The music industry would like it, but, after all, the music industry is a dinky business compared to the Internet. IBM, HP, Dell, Microsoft, Yahoo, and Google are each bigger than the entire music industry. Other than for email sending, there's other big interest behind end user identification.
-
Analyzing online anonymity.
There are three issues with "online anonymity". One is anonymous businesses, the second is the ability to create an unlimited number of new identities at very low cost, and the third is actual identification of end users.
Anonymous businesses, that is, web sites with commercial intent which don't identify their ownership, are already illegal in many jurisdictions. At SiteTruth, we treat anonymous businesses (where there's no postal mailing address on the web site) as "bottom feeders", and move them to the bottom of search results. Google has a bias against "private registration" domains, but that only kicks in if the site otherwise looks like a junk site. There's not much controversy about this; it's accepted law that a business has to identify itself properly.
The ability to create an unlimited number of new identities causes various forms of trouble. The ability to get vast numbers of free Gmail accounts ("automatically create Gmail Accounts in seconds flat without breaking a sweat") is a windfall for spammers and has destroyed vast sections of Craigslist. The ability to register large numbers of domains with phony domain registration has created a well-known range of problems. Gradually, that's being tightened down. "Domain Tasting" is now dead, now that registrars have to eat the loss if they register and release a domain within 5 days. Phony WHOIS information remains a problem, but could be fixed. When you register a domain, you should get a postal mail piece with the code that enables the domain.
End user identification is the controversial issue. The music industry would like it, but, after all, the music industry is a dinky business compared to the Internet. IBM, HP, Dell, Microsoft, Yahoo, and Google are each bigger than the entire music industry. Other than for email sending, there's other big interest behind end user identification.
-
Analyzing online anonymity.
There are three issues with "online anonymity". One is anonymous businesses, the second is the ability to create an unlimited number of new identities at very low cost, and the third is actual identification of end users.
Anonymous businesses, that is, web sites with commercial intent which don't identify their ownership, are already illegal in many jurisdictions. At SiteTruth, we treat anonymous businesses (where there's no postal mailing address on the web site) as "bottom feeders", and move them to the bottom of search results. Google has a bias against "private registration" domains, but that only kicks in if the site otherwise looks like a junk site. There's not much controversy about this; it's accepted law that a business has to identify itself properly.
The ability to create an unlimited number of new identities causes various forms of trouble. The ability to get vast numbers of free Gmail accounts ("automatically create Gmail Accounts in seconds flat without breaking a sweat") is a windfall for spammers and has destroyed vast sections of Craigslist. The ability to register large numbers of domains with phony domain registration has created a well-known range of problems. Gradually, that's being tightened down. "Domain Tasting" is now dead, now that registrars have to eat the loss if they register and release a domain within 5 days. Phony WHOIS information remains a problem, but could be fixed. When you register a domain, you should get a postal mail piece with the code that enables the domain.
End user identification is the controversial issue. The music industry would like it, but, after all, the music industry is a dinky business compared to the Internet. IBM, HP, Dell, Microsoft, Yahoo, and Google are each bigger than the entire music industry. Other than for email sending, there's other big interest behind end user identification.
-
Google needs to clean up their own act first,
Google has a malware hosting problem of their own.
Google Spreadsheets can be abused to create phony login pages. Here's one for "Free Habbo credits", designed to collect Habbo logins. It's been reported via the usual "Google abuse" mechanism, repeatedly, and it's still up. It's been up since October 28, 2008.
We track major domains being exploited by active phishing scams. ("Major" here means only that it's in Open Directory, with about 1.5 million domains.) There are 39 exploited domains today. Only 7 have been on that list since 2008. The most abused site is Piczo.com, which is a hosting service/social network/shopping site for teenagers.
Just about everybody else has cleaned up their act. 18 months ago, that list had 174 entries, including Yahoo, eBay, Microsoft Live, and TinyURL. All those companies have become more aggressive about checking for phishing scams that were injected into their domain. Google's cluelessness in this area ought to be embarrassing to someone.
-
Some kinds of phishing down a bit
This may be having an effect. I'm seeing a small decline in major domains being exploited by phishing scams. That monitors phishing attacks which use major domains to give themselves convincing-looking URLs.
In the year and a half we've been monitoring this, the number of sites being exploited has dropped from 174 to today's value of 37. We nag sites that have problems to tighten up their security. It's working. Ebay used to have a security hole which allowed creating URLs under "ebay.com" that redirected elsewhere. That's been fixed. The "short URL" companies are now much more aggressive in detecting phishing and kicking off those URLs. Bugs at Yahoo and Microsoft Live have been fixed. Geocities had problems, but they're shutting down at the end of the month.
Now if Google would just kick off this phony Habbo login page implemented using Google Spreadsheets, all the biggest names would be OK. If anyone from Google is reading this, please pass that along to someone with a clue. (Yes, it's been reported via the usual "Google abuse" mechanism.)
-
Reasons not to use WHOIS "privacy" services
Reality check:
- In the European Union and in California, anonymous businesses are illegal.
- The listed registrant owns the domain. If you're using a "privacy service", you don't own the domain; you're just leasing it from the privacy service. Customers of RegisterFly, the domain registrar that collapsed, found this out the hard way. Many customers lost domains in that collapse.
- Google considers "private registration" as a factor in determining whether a site meets their "quality guidelines". Google can't be as tough on this as they should be, though, because Google's revenue model, AdWords, requires a large number of ad-heavy sites. Bing could be tougher; it's too soon to tell.
We take an even harder line on anonymous businesses at SiteTruth, considering them "bottom feeders".
Realistically, putting your real name and address in WHOIS info doesn't hurt you unless you're a crook. My real name and address are on all my domains, and I get maybe one phone call every two years, perhaps a letter or two a year, that seem to come from WHOIS data. I had one threat, back in the 1990s; he's out of business and I'm still here. Any e-mail spam is being filtered out by the usual filters. If you're paranoid, get a P.O. box; that's legal.
-
Reasons not to use WHOIS "privacy" services
Reality check:
- In the European Union and in California, anonymous businesses are illegal.
- The listed registrant owns the domain. If you're using a "privacy service", you don't own the domain; you're just leasing it from the privacy service. Customers of RegisterFly, the domain registrar that collapsed, found this out the hard way. Many customers lost domains in that collapse.
- Google considers "private registration" as a factor in determining whether a site meets their "quality guidelines". Google can't be as tough on this as they should be, though, because Google's revenue model, AdWords, requires a large number of ad-heavy sites. Bing could be tougher; it's too soon to tell.
We take an even harder line on anonymous businesses at SiteTruth, considering them "bottom feeders".
Realistically, putting your real name and address in WHOIS info doesn't hurt you unless you're a crook. My real name and address are on all my domains, and I get maybe one phone call every two years, perhaps a letter or two a year, that seem to come from WHOIS data. I had one threat, back in the 1990s; he's out of business and I'm still here. Any e-mail spam is being filtered out by the usual filters. If you're paranoid, get a P.O. box; that's legal.
-
Reasons not to use WHOIS "privacy" services
Reality check:
- In the European Union and in California, anonymous businesses are illegal.
- The listed registrant owns the domain. If you're using a "privacy service", you don't own the domain; you're just leasing it from the privacy service. Customers of RegisterFly, the domain registrar that collapsed, found this out the hard way. Many customers lost domains in that collapse.
- Google considers "private registration" as a factor in determining whether a site meets their "quality guidelines". Google can't be as tough on this as they should be, though, because Google's revenue model, AdWords, requires a large number of ad-heavy sites. Bing could be tougher; it's too soon to tell.
We take an even harder line on anonymous businesses at SiteTruth, considering them "bottom feeders".
Realistically, putting your real name and address in WHOIS info doesn't hurt you unless you're a crook. My real name and address are on all my domains, and I get maybe one phone call every two years, perhaps a letter or two a year, that seem to come from WHOIS data. I had one threat, back in the 1990s; he's out of business and I'm still here. Any e-mail spam is being filtered out by the usual filters. If you're paranoid, get a P.O. box; that's legal.
-
Rent our botnet!
This looks like an attempt to monetize a botnet. What, exactly, do the people running their "client" get out of this? Do they know they're sucking bandwidth, and possibly being billed for it, on behalf of someone else?
I run a web spider of sorts. And I know the people who run a big search engine. Reading the web sites isn't the bottleneck. Analyzing the results and building the database is. Outsourcing the reading part doesn't buy you much. If this just did a crawl, it would be of very limited value. That's not what it does.
What they're really doing is offering a service that lets their customers run the customer's Java code on other people's machines in the botnet. That's worrisome. There are some security limits, which might even work. Supposedly, all the Java apps can do is look at crawled pages and phone results home. Right.
This thing uses the Plura botnet. "Plura® is a grid computing system. We contract with affiliates, who are owners of web pages, software, and other services, to distribute our grid computing code. We utilize the excess resources of peripheral computers that are browsing the internet when such browsing leads to a web page of one of our affiliates. That web page has imbedded code that allows the visitor to participate in the grid computing process. We also utilize embedded code in software and other services to allow such participation." Not good.
The main infection vector is apparently the Digsby chat client, which comes bundled with various crapware. The Digsby feature list does not mention that Plura is in their package.
This thing needs to be treated as hostile code by firewalls and virus scanners.
-
Phishing vs. blacklists vs. whitelists
The trouble with phishing blacklists is that if you take a hard enough line to make them work, there's collateral damage. Blacklisting by URL is useless; most attackers with a clue use a different URL in each email. Even blacklisting by full domain is no longer enough; many attackers use a bogus subdomain for each phishing e-mail.
If you take a hard line and blacklist at the second-level domain, blacklists are more effective. We measure the collateral damage of doing that. We (as SiteTruth) maintain an updated list of major domains being exploited by phishing scams. This is a list of domains that are both in PhishTank with a hostile URL, and OpenDirectory, as "major". Today, there are only 37 domains on the list, which is about as low as it's ever been. The high was around 175, back in 2008. This matters because the big-name sites are likely to be whitelisted, and phishers look for exploits that will let them use a big-name domain to evade filters.
We nag sites into fixing security holes which allowed some phishing site to exploit them. Microsoft, Yahoo, and eBay have cleaned up their act. Only a few major sites are still on the list. Google is on the list because someone figured out a way to use a Google Docs spreadsheet to host a phishing site. Piczo.com, a free hosting service now hosting 103 phishing URLs, just doesn't seem to care. The other sites with more than one entry tend to be dying hosting services: Geocities, FortuneCity, RoadRunner.
The problem of big-name sites being exploited by phishers is coming under control. It's probably safe to blacklist by second-level domain now. (If only Google gets their act together and deals with that spreadsheet exploit.)
-
Google needs web spam to profit.
Google can't solve this problem because their business model requires web spam.
Google is in the advertising business, not the search business. Search is a traffic builder for the ads. Google's customers are their advertisers, not their search users. They have to maximize ad revenue. The problem is that more than a third of Google's advertisers are web spammers, broadly defined. All those "landing pages", typosquatters, spam blogs, and similar junk full of Google ads are revenue generators for Google. Every time someone clicks on an AdWords ad, Google makes money, no matter what slimeball is running the ad. Google can't crack down too hard, or their revenue will drop substantially. Google does have some standards, but they're low.
Google went over to the dark side around 2006. In 2004 and 2005, Google sponsored the Web Spam Summit, devoted to killing off web spammers. From 2006, Google sponsored the Search Engine Strategies conference, where the "search engine optimization" people meet. That was a big switch in direction, and a sad one.
As we demonstrate with SiteTruth, it's not that hard to get rid of most web spam if you're willing to be a hardass about requiring a legit business behind each commercial web site. Google can't afford to do that. It would hurt their bottom line.
However, cleaning up web search results with browser plug-ins is a viable option. Stay tuned.
-
Re:If it wouldn't pop up everywhere it shouldn't
There's so much certificate misuse. A typical mistake is getting a cert for, say, "*.slashdot.org", and then serving it for "slashdot.org". That will cause a reject. Then there are U.S. Government certificate authorities, too many of them. Try, for example, USMC Doctrine Division. The CA is "DOD CA-13". DoD alone has root CAs "CA-5", through "CA-18", and not all browsers know all of them.
This is a headache for SiteTruth, which uses certificates as a indication of web site validity and a source of business names and addresses. Only certs that are valid, using the Firefox cert file as authority, are accepted. There are more rejects than there should be.
-
Re:It's funny, and a bit disturbing...
I kept hearing things about some site called "Google", so I tried running it through SiteTruth. Turns out it's some shady, fly-by-night company.
Yes, Google is in the doghouse again. Google is hosting some phishing sites, which were reported to PhishTank. SiteTruth blacklists any domain with a hit in PhishTank. On any given day, about 50 to 100 well-known domains (out of the 1.5 million in OpenDirectory) are on the blacklist, generally because of sloppy security. Microsoft, Yahoo, and eBay used to be on the phishing blacklist, but after some nagging by us and The Register, they've mostly plugged the security holes involved. The blacklist is updated every 3 hours, so companies that clean up their act quickly don't stay on the list for long.
Domains on the blacklist are usually 1) free hosting services, 2) URL redirectors like TinyURL, 3) DSL providers with weak abuse departments, and 4) sites with a software bug that lets other sites use them as a redirector. Some companies in those categories are good at quickly cleaning out such abuses; others just don't seem to care. In each category, there are plenty of companies who don't have such problems, so there's no reason to give anybody a free pass.
It says something about a company's abuse department if they're on that list for more than a day or two.
-
Re:It's funny, and a bit disturbing...
I kept hearing things about some site called "Google", so I tried running it through SiteTruth. Turns out it's some shady, fly-by-night company.
Yes, Google is in the doghouse again. Google is hosting some phishing sites, which were reported to PhishTank. SiteTruth blacklists any domain with a hit in PhishTank. On any given day, about 50 to 100 well-known domains (out of the 1.5 million in OpenDirectory) are on the blacklist, generally because of sloppy security. Microsoft, Yahoo, and eBay used to be on the phishing blacklist, but after some nagging by us and The Register, they've mostly plugged the security holes involved. The blacklist is updated every 3 hours, so companies that clean up their act quickly don't stay on the list for long.
Domains on the blacklist are usually 1) free hosting services, 2) URL redirectors like TinyURL, 3) DSL providers with weak abuse departments, and 4) sites with a software bug that lets other sites use them as a redirector. Some companies in those categories are good at quickly cleaning out such abuses; others just don't seem to care. In each category, there are plenty of companies who don't have such problems, so there's no reason to give anybody a free pass.
It says something about a company's abuse department if they're on that list for more than a day or two.
-
Re:It's funny, and a bit disturbing...
Why doesn't an attempt to legally incorporate a new business include a "do any of your officers have a background in crime, particularly white collar crime?" check?
That's a real problem. People barred from involvement in the securities industry keep slipping back in. Bar owners barred from holding a liquor license often end up doing some deal as a "silent partner".
I get complaints from "web businesses" who want to operate anonymously, because SiteTruth down-rates them for that. (It's a criminal offense to run a business anonymously in many jurisdictions.)
-
Well, duh.
only to find their page delivery delayed by slow-loading ads.
Well, duh. I've been complaining about this for the past year. Too much ad code is using "document.write()", often for no really good reason. Browsers can load content from multiple sites in parallel, and not wait for ad content, unless Javascript is used to prevent that. All too often, Javascript is used in just that way. (As on, well, Slashdot. Earth to Slashdot: your Javascript is embarrassingly slow. Get someone with a clue.)
One of the more painful things I have to do for AdRater is to recognize dynamically loaded ad content. Google ads are loaded using at least five completely different code styles. So I actually have to look at other people's ad-serving code in some detail. It's not fun. Fortunately, one generic mechanism handles most of the cases; I don't have to track their code changes in detail.
Most of this doesn't seem to be intended to get around ad-blocking software, and isn't successful at that. It's usually either tracking-related, concerned with displaying the ad in a different CSS context than that of the surrounding content, or just the result of ineptly cutting and pasting JavaScript from multiple sources.
-
Mod parent up - blind search test is quite useful.
When you try the blind search test, the results look very similar. All the mainstream search engines are doing about equally well. There was a period in 2007 when Yahoo was substantially ahead of the others, because they had about fifty special-case recognizers for things like celebrities and movies, but now everybody has that. (And nobody noticed that Yahoo was better for the six months they had a technical edge, anyway.)
Try heavily-spammed searches like "London hotels". All the big guys are still being fooled by ad-heavy redirector sites. It's possible to do better against link spammers, but the big guys aren't trying very hard to do so. Google used to be against "search engine optimization", but some time in 2007 they went over to the dark side and started sponsoring SEO conferences. It's inevitable; Google makes their money from AdWords. Search is just a traffic builder.
-
They just re-invented Greasemonkey
I think they just re-invented Greasemonkey. But not well.
At least with Greasemonkey, there's a well-defined language. It's all Javascript. This thing seems to have some horrible mess of intermixed Javascript, CSS, and HTML. Plus it has JQuery built in, and a special symbol ("$") for it. (For a moment, I thought I was reading Perl.)
Having done some non-trivial work with Greasemonkey, I'm not sure this thing is a step up.
-
Notes on blocking Google ads
blocks AdSense ads
Now that's an interesting competitive tactic for Microsoft, which doesn't make much of its money from online advertising. Blocking as many ad sites as possible would be a useful and popular browser feature. Not only would the user not have to look at the ads, web browsing would be two or three times faster. Notice how often your browser stalls because the page renderer is waiting for some ad site. Perhaps "family filter" is Microsoft's foray into ad-blocking.
Our AdRater plug-in evaluates AdSense ads and labels them, but doesn't block them. We collect statistics on AdSense advertisers. Over a third of AdSense advertisers are sites that don't clearly identify who owns them. Google's validation of their advertisers is very weak. One could make a good argument for blocking a significant fraction of them on quality grounds alone.
-
This could backfire
Much as I like Adbusters, this is a headache.
Right now, Google ad URLs are relatively straightforward to recognize and decode. If Google sees this as a real threat, they may start obfuscating them and using elaborate gimmickry with Javascript, like the stuff one sees in hostile web pages. Then they'll be much tougher to deal with. The easy approaches to ad blocking will stop working.
We recognize Google ad URLs in AdRater, which is a Firefox plug-in, and we put a translucent rating icon atop each ad. Google ad links are currently rather straightforward to decode, so we don't have to follow them, just examine them. For some of Google's competitors, you can't tell where the ad link is going without clicking on it. We've considered a plug-in which follows encoded ad links in the browser, but it would look like click fraud, even though it has a legitimate purpose. So far, we've refrained from doing that. If Google tries obfuscating their ad URLs, we'll have to actually traverse them to find the advertiser site for rating purposes. That increases everyone's overhead.
-
Taking a harder line on certs.
There are really three tiers of SSL certs being sold:
- "Domain control only validated" certs. This means the cert issuer got an answer from an e-mail sent to the domain. This is the "QuickSSL" tier.
- "Location and business identiti validated" certs. What SSL certs were supposed to mean. The cert issuer actually checked out the business for existence. At this tier, there's often a "relying party" guarantee.
- "Extended validation" certs. The cert issuer had to meet some audited standards to issue the cert. Mostly used by banks.
Current browsers don't distinguish between #1 and #2. They should. "Domain control only validated" certs are enough to secure some social networking site or blog, but not good enough to send someone a credit card number. If they're taking your money, the cert should contain enough info to allow you to find and sue them.
Our SiteTruth system distinguishes between #1 and #2, because we're looking for business identity. It's a useful way to filter out the "bottom feeders".
The problems with bogus SSL cert issuance seem to be, so far, confined to the "Domain control only validated" certs. This is an additional good reason to distinguish between them and the better tiers.
-
It's not that hard to write a clear privacy policy
Our AdRater plug-in has similar privacy issues. It's a plug-in that "phones home" to get information about the advertisers whose ads appear on a site. Here's what we tell users:
AdRater "phones home", but tells us as little as possible. AdRater sends the domain name associated with each advertisement you see to SiteTruth. Thus, we can tell what advertisers have reached you, but cannot tell what web pages you have been viewing. We can't tell if you click on an ad. AdRater does not use "cookies" or any other user identifiable information other than your current IP address.
If we change any of this, the changes will not take effect until you download and install a new version of AdRater.
AdRater does not rate ads on secure pages, so no information about a secure page is ever sent to our servers.
Now that wasn't hard, was it?
For really technical users, we publish the API AdRater uses, so you can check to see that we're telling the truth about what data goes back and forth.
-
Other examples. Google still evil.
That's not a lone example. Search with Google for "craigslist auto posting software". These are all paid Google ads:
- "CL Posting Software www.adsoncraigs.com The worlds Best Selling CraigsIist software. Works with new CAPTCHA!"
- "Craigs Works Must Try Us webtrafficus.com We do the work no software To Buy Best Service All Ads Guaranteed Up"
- TopPost Inc. www.toppost.com The Leader in Posting Services 866-895-6888 -- info@toppost.com
- Buy Craiglist accounts Phone verified accounts, hassle-free, only 4.95$/account . www.craigsup.com
We track the "bottom feeders" in Google AdWords over at SiteTruth. We consider about 36% of Google's advertisers, out of a set of 20,000 ad domains, to be "bottom-feeders" - no visible business address, or we have other negative info. If you download AdRater, our Greasemonkey script for Firefox, we rate the advertiser behind every Google ad you see and display a rating icon on top of the ad. (Yes, the plugin "phones home". It tells us lots of stuff about the advertiser, which we're interested in, and very little about the user's browsing, which we don't care about. The plugin is open source, so you can check this.)
With the information we have, it's painfully obvious that Google isn't picky about their advertisers. The example in the article is one of many, not a unique exception.
Google CEO Eric Schmidt was quoted last month as saying "The Internet is fast becoming a cesspool" Was he complaining, or boasting? Much of that is Google's doing.