sitetruth.com · Domains · Slashdot Mirror

Minor improvements by Animats · 2010-07-12 04:36 · Score: 4, Interesting · on How To Use HTML5 Today

(Read the "print" version of the article, instead of the "tiny blocks of text spread over many pages of ads" version.)

I have misgivings about HTML5. It gives the page more control, and the user less. That's been a trend in HTML for years, and it's getting worse.

I'm dreading "canvas". Ad blockers need to get smarter. Noticed that popups are winning over Firefox's popup blocking? We're also going to see pages that use 100% of the CPU just for display. We're going to need a browser option for "don't run canvas code for windows that aren't on top.

The "input type" mechanism for forms is lame. There are a number of standard types like "tel", but it's just text with no line breaks. They should have provided for either regular expressions or syntax like the COBOL Picture clause ("CREDIT_CARD_NUMBER PIC 9999-9999-9999-9999").

Dynamically-loaded fonts have been working for some time now in all the mainstream browsers. (IE6 and Firefox 3.5 were the last mainstream browsers not to have it.) We've been playing with that for our steampunk site. Downloadable fonts without anti-aliasing turn out to look ugly for small font sizes, because most of the display-type fonts have too much detail and not enough hinting for small font sizes. (In an annoying piece of Apple incompatibility, the iPad requires fonts in SVG, of all things. Everybody else, including Microsoft, is going to Web Open Font Format.) I'd recommend against using this feature much unless you have a good sense of typography. (Bad example: our steampunk search engine.)

Our system says "don't go there" by Animats · 2010-07-06 05:15 · Score: 3, Interesting · on The Unstoppable 'Tech Support' Scam

The actual site mentioned is thenerdsupport.com

I ran them through our SiteTruth system. Here's what comes out. "Rating: "Site ownership unknown or questionable. No Location. ... This certificate identifies the domain only, not the actual business. No street address found on the site."

Compare the SiteTruth results for Geek Squad. Street addresses found, found in the US business directory, found in Open Directory.

It's not that hard to sort out the phony business sites from the real ones. You have to check business databases, not just the Web, for business legitimacy. If you just look at the web, you get bogus results like this: McAfee SiteAdvisor: "We tested this site and didn't find any significant problems." The site itself doesn't try to attack the user, so McAfee says it's good to go.

Our system says "don't go there" by Animats · 2010-07-06 05:15 · Score: 3, Interesting · on The Unstoppable 'Tech Support' Scam

The actual site mentioned is thenerdsupport.com

I ran them through our SiteTruth system. Here's what comes out. "Rating: "Site ownership unknown or questionable. No Location. ... This certificate identifies the domain only, not the actual business. No street address found on the site."

Compare the SiteTruth results for Geek Squad. Street addresses found, found in the US business directory, found in Open Directory.

It's not that hard to sort out the phony business sites from the real ones. You have to check business databases, not just the Web, for business legitimacy. If you just look at the web, you get bogus results like this: McAfee SiteAdvisor: "We tested this site and didn't find any significant problems." The site itself doesn't try to attack the user, so McAfee says it's good to go.

"London" is a heavily spammed term by Animats · 2010-06-30 05:10 · Score: 4, Informative · on Regular Domains Have More Malware Than Porn Sites

"London", as a keyword, is a heavy spam target. I used to use "London Hotels" as a test case for SiteTruth's web spam detector. Google used to do badly on that search. (Since they started handling travel destinations as a special case, the first 10 Google results are now either paid ads or results from the business search engine.)

Most delay is ad-related. by Animats · 2010-06-23 17:26 · Score: 4, Informative · on Google Shares Insights On Accelerating Web Sites

Most real-world page load delay today seems to be associated with advertising. Merely loading the initial content usually isn't too bad, although "content-management systems" can make it much worse, as overloaded databases struggle to "customize" the content. "Web 2.0" wasn't a win; pulling in all those big CSS and JavaScript libraries doesn't help load times.

We do some measurement in this area, as SiteTruth reads through sites trying to find a street address on each site rated. We never read more than 21 pages from a site, and for most sites, we can find a street address within 45 seconds, following links likely to lead to contact information. Only a few percent of sites go over 45 seconds for all those pages. Excessively slow sites tried recently include "directserv.org" (a link farm full of ads), "www.w3.org" (embarrassing), and "religioustolerance.org" (an underfunded nonprofit). We're not loading images, ads, Javascript, or CSS; that's pure page load delay. It's not that much of a problem, and we're seeing less of it than we did two years ago.

He doesn't mention fonts by Animats · 2010-06-23 05:39 · Score: 4, Informative · on How HTML5 Will Change the Web

The current versions of all the major browsers can now dynamically download fonts. We can finally stop putting display text in images. Opera, Safari, Chrome, Firefox (3.6 or greater) and IE are all on board with this. By IE 9, they'll even be using the same font format, Web Open Font Format. (Except for the iPad, which, for some weird reason, currently requires fonts in SVG format. But even the iPad understands "@font-face")

Few sites are using this capability yet. We are, as a demo. Try our steampunk search engine with authentic Victorian fonts.

No, Google doesn't have a real search API. by Animats · 2010-05-11 04:17 · Score: 2, Informative · on Scroogle Has Been Blocked

Google once had a real search API. It was SOAP-based. But they discontinued it years ago.

Google's AJAX search API is, by design, very limited. All you can really do is create a little search widget, and perhaps add some fields of your own. The term prohibits doing much beyond that. "You are allowed to use the API only to display, and to make such uses as are necessary for You to display, Google Search Results on your Property. The API does not provide You with the ability to access, and You are not allowed to access, other underlying Google Services or data. Subject to the limitations and conditions described below, " ... "You agree that You will not, and You will not permit your users or other third parties to: (a) modify or replace the text, images, or other content of the Google Search Results, including by (i) changing the order in which the Google Search Results appear, (ii) intermixing Search Results from sources other than Google, or (iii) intermixing other content such that it appears to be part of the Google Search Results; or (b) modify, replace, obscure, or otherwise hinder the functioning of links to Google or third party websites provided in the Google Search Results. " Given those restrictions, you can't write Scroogle using that API.

We have a SiteTruth search page which uses the Google AJAX API. We're prohibited from re-ordering the entries or removing any of them. Since the whole point of SiteTruth is to re-order search results by business legitimacy, and we don't do that for the Google results, the Google results are inferior to the ones from other search engines. So our primary search page uses Yahoo/Bing.

Re:The problem: low standards in search engines. by Animats · 2010-04-26 07:18 · Score: 1 · on Several Link-Spam Architectures Revealed

Re SiteTruth complaints: (We have a blog for that.)

Non-commercial web sites aren't rated at all. However, the presence of an ad link marks a site as "commercial", as does being in ".com". Our "commercial intent" detection is rather simplistic. We really should have a classifier system doing that. Yahoo search R&D, back when they had search R&D, built one of those, but never did much with it. We've been reluctant to use machine learning techniques, though, because they reduce the transparency of the system. At present, SiteTruth doesn't rely on "security by obscurity". Adding a classifier system would change that.

Credit rating information is useful because, for businesses, you can get business size information. Annual sales and number of employees are worth knowing, and displaying to the user in search results. (We'll be doing something in that area soon.) There's a guy in Brooklyn, NY, who took pictures of camera stores that advertise on line or for mail order. There are companies with giant warehouses and loading docks, and there are, well, "marginal locations". It's very funny. Search engines need info like that.

As for specific sites:

Glaxosmithkline: We give them a yellow "?", which means we think they're legit, but don't have third-party verification that the domain is tied to the company. In our hard-ass view, that's an OK rating. SSL certs and BBBonline links provide such third-party verification. They did match our database. We weren't able to parse "Registered office: 980 Great West Road, Brentford, Middlesex, TW8 9GS, United Kingdom.", unfortunately; we only recognize multi-line postal addresses, usable on an envelope, at present.
Vodaphone All the country sites have SSL certs, but the main ".com" site does not. It does have the address "Vodafone Group Plc / Vodafone House / The Connection / Newbury / Berkshire / RG14 2FN / England" on multiple lines, which we pulled out of the source HTML as a possible address, but did not parse successfully. Still, they got a yellow "?", and were matched to the UK business database.
Oxfam gets a green checkmark, and the system was able to pull four business addresses from their web site.

Re:The problem: low standards in search engines. by Animats · 2010-04-26 07:18 · Score: 1 · on Several Link-Spam Architectures Revealed

Re SiteTruth complaints: (We have a blog for that.)

Non-commercial web sites aren't rated at all. However, the presence of an ad link marks a site as "commercial", as does being in ".com". Our "commercial intent" detection is rather simplistic. We really should have a classifier system doing that. Yahoo search R&D, back when they had search R&D, built one of those, but never did much with it. We've been reluctant to use machine learning techniques, though, because they reduce the transparency of the system. At present, SiteTruth doesn't rely on "security by obscurity". Adding a classifier system would change that.

Credit rating information is useful because, for businesses, you can get business size information. Annual sales and number of employees are worth knowing, and displaying to the user in search results. (We'll be doing something in that area soon.) There's a guy in Brooklyn, NY, who took pictures of camera stores that advertise on line or for mail order. There are companies with giant warehouses and loading docks, and there are, well, "marginal locations". It's very funny. Search engines need info like that.

As for specific sites:

Glaxosmithkline: We give them a yellow "?", which means we think they're legit, but don't have third-party verification that the domain is tied to the company. In our hard-ass view, that's an OK rating. SSL certs and BBBonline links provide such third-party verification. They did match our database. We weren't able to parse "Registered office: 980 Great West Road, Brentford, Middlesex, TW8 9GS, United Kingdom.", unfortunately; we only recognize multi-line postal addresses, usable on an envelope, at present.
Vodaphone All the country sites have SSL certs, but the main ".com" site does not. It does have the address "Vodafone Group Plc / Vodafone House / The Connection / Newbury / Berkshire / RG14 2FN / England" on multiple lines, which we pulled out of the source HTML as a possible address, but did not parse successfully. Still, they got a yellow "?", and were matched to the UK business database.
Oxfam gets a green checkmark, and the system was able to pull four business addresses from their web site.

Re:The problem: low standards in search engines. by Animats · 2010-04-26 07:18 · Score: 1 · on Several Link-Spam Architectures Revealed

Re SiteTruth complaints: (We have a blog for that.)

Non-commercial web sites aren't rated at all. However, the presence of an ad link marks a site as "commercial", as does being in ".com". Our "commercial intent" detection is rather simplistic. We really should have a classifier system doing that. Yahoo search R&D, back when they had search R&D, built one of those, but never did much with it. We've been reluctant to use machine learning techniques, though, because they reduce the transparency of the system. At present, SiteTruth doesn't rely on "security by obscurity". Adding a classifier system would change that.

Credit rating information is useful because, for businesses, you can get business size information. Annual sales and number of employees are worth knowing, and displaying to the user in search results. (We'll be doing something in that area soon.) There's a guy in Brooklyn, NY, who took pictures of camera stores that advertise on line or for mail order. There are companies with giant warehouses and loading docks, and there are, well, "marginal locations". It's very funny. Search engines need info like that.

As for specific sites:

Glaxosmithkline: We give them a yellow "?", which means we think they're legit, but don't have third-party verification that the domain is tied to the company. In our hard-ass view, that's an OK rating. SSL certs and BBBonline links provide such third-party verification. They did match our database. We weren't able to parse "Registered office: 980 Great West Road, Brentford, Middlesex, TW8 9GS, United Kingdom.", unfortunately; we only recognize multi-line postal addresses, usable on an envelope, at present.
Vodaphone All the country sites have SSL certs, but the main ".com" site does not. It does have the address "Vodafone Group Plc / Vodafone House / The Connection / Newbury / Berkshire / RG14 2FN / England" on multiple lines, which we pulled out of the source HTML as a possible address, but did not parse successfully. Still, they got a yellow "?", and were matched to the UK business database.
Oxfam gets a green checkmark, and the system was able to pull four business addresses from their web site.

Attacks against hosting providers by Animats · 2010-04-26 05:24 · Score: 1 · on Massive Number of GoDaddy WordPress Blogs Hacked

We noticed another attack against a hosting provider recently, but it wasn't GoDaddy; it was ThePlanet, or at least someone who uses their IP block. A number of phishing sites suddenly appeared on our list, and we noticed they all mapped to the same server. Multiple domains on the same server were all hosting the same phishing attack.

Annoyingly, the domain registration for the server's main domain ("websitewelcome.com") was "private". That's actually part of HostGator's system; there's no reason it should have "private registration". It just makes it harder to find the responsible party.

The problem: low standards in search engines. by Animats · 2010-04-25 05:21 · Score: 1 · on Several Link-Spam Architectures Revealed

These guys are doing good work, but really, all they're doing is checking for some specific types of black-hat SEO. This is inherently a losing battle, because there's active opposition. It's a "negative file" approach - making a list of the bad guys. Credit cards once worked that way; merchants were sent daily lists of canceled or stolen credit cards. Back then, getting a credit card was tough; the customer had to be a good customer of the bank. Not until credit card transactions were validated remotely against a "positive file" that checked the actual account could everyone have one. Web search is still in the "negative file" era.

As I point out occasionally, the main search engines have very low standards for business legitimacy. It's an ongoing, and losing, battle to filter out the totally bogus sites. But if you insist on some minimal standard of business legitimacy for a commercial web site, you kick out most of the "bottom feeders" with no business address, and along with them, most of the total phonies. We do this at SiteTruth, which exists to demonstrate that it's possible. SiteTruth tries to find some indication that a domain maps to a real-world business. If it can't, the site is moved down in search engine position. That's enough to move most "bottom feeder" downward, below the legit ones. It's not always successful in finding the business behind the site, but it looks harder than the average user would, looking through the site's "About", "Help", "Contact", etc. pages for a mailing address. If a search engine takes a hard line on this, the junk sites can be kicked out.

Once you have a business address for a web site, there are extensive resources for finding out more about the business. It's easy to get annual sales and number of employees if you know what database to buy. Corporate registration information and D/B/A name information is available. Business credit rating info is available in bulk for a fee. Crank that info into search engine positioning and you've got hard data driving search. Rating web sites by looking only at the web is a process easy to manipulate. Use info from the real world, and it's much harder.

Phony mailing addresses do show up, but that's usually associated with phishing sites. Not showing a business address is a misdemeanor in some jurisdictions, but common. Using the address of another business is felony fraud and identity theft. That gets law enforcement attention. So only outright criminals try that. To catch that, we fetch the entire PhishTank database every few hours and blacklist the entire domain for a single phishing entry. That's draconian, but if you're running a site that lets users upload entire pages, it's your job to kick the phishers off. Most of the innocent victims there are free hosting services with weak abuse departments. If you're in the free hosting business or the URL redirection business, you need a strong abuse department, or you will be pwned. Right now, "t35.com" is getting hit hard. By now, most free hosting sites with a clue automatically check PhishTank and the APWG list to see if they're on it. "t35.com" is still doing it by hand, and they're losing the battle.

So why doesn't Google do this? Google's business model depends on those ad-heavy "bottom feeder" sites. About 36% of Google's "content network" domains are "bottom feeders". When organic search takes you to the right place on the first try, Google doesn't make any money. But if you're led through an ad-heavy site, the Google cash register clicks. Google's business model thus takes them to the dark side. Google would take a big financial hit if they did even some basic legitimacy checking on their advertisers. Search Google for "craigslist auto posting tool", which brings up five Google ads for companies offering to spam Craigs

The problem: low standards in search engines. by Animats · 2010-04-25 05:21 · Score: 1 · on Several Link-Spam Architectures Revealed

These guys are doing good work, but really, all they're doing is checking for some specific types of black-hat SEO. This is inherently a losing battle, because there's active opposition. It's a "negative file" approach - making a list of the bad guys. Credit cards once worked that way; merchants were sent daily lists of canceled or stolen credit cards. Back then, getting a credit card was tough; the customer had to be a good customer of the bank. Not until credit card transactions were validated remotely against a "positive file" that checked the actual account could everyone have one. Web search is still in the "negative file" era.

As I point out occasionally, the main search engines have very low standards for business legitimacy. It's an ongoing, and losing, battle to filter out the totally bogus sites. But if you insist on some minimal standard of business legitimacy for a commercial web site, you kick out most of the "bottom feeders" with no business address, and along with them, most of the total phonies. We do this at SiteTruth, which exists to demonstrate that it's possible. SiteTruth tries to find some indication that a domain maps to a real-world business. If it can't, the site is moved down in search engine position. That's enough to move most "bottom feeder" downward, below the legit ones. It's not always successful in finding the business behind the site, but it looks harder than the average user would, looking through the site's "About", "Help", "Contact", etc. pages for a mailing address. If a search engine takes a hard line on this, the junk sites can be kicked out.

Once you have a business address for a web site, there are extensive resources for finding out more about the business. It's easy to get annual sales and number of employees if you know what database to buy. Corporate registration information and D/B/A name information is available. Business credit rating info is available in bulk for a fee. Crank that info into search engine positioning and you've got hard data driving search. Rating web sites by looking only at the web is a process easy to manipulate. Use info from the real world, and it's much harder.

Phony mailing addresses do show up, but that's usually associated with phishing sites. Not showing a business address is a misdemeanor in some jurisdictions, but common. Using the address of another business is felony fraud and identity theft. That gets law enforcement attention. So only outright criminals try that. To catch that, we fetch the entire PhishTank database every few hours and blacklist the entire domain for a single phishing entry. That's draconian, but if you're running a site that lets users upload entire pages, it's your job to kick the phishers off. Most of the innocent victims there are free hosting services with weak abuse departments. If you're in the free hosting business or the URL redirection business, you need a strong abuse department, or you will be pwned. Right now, "t35.com" is getting hit hard. By now, most free hosting sites with a clue automatically check PhishTank and the APWG list to see if they're on it. "t35.com" is still doing it by hand, and they're losing the battle.

So why doesn't Google do this? Google's business model depends on those ad-heavy "bottom feeder" sites. About 36% of Google's "content network" domains are "bottom feeders". When organic search takes you to the right place on the first try, Google doesn't make any money. But if you're led through an ad-heavy site, the Google cash register clicks. Google's business model thus takes them to the dark side. Google would take a big financial hit if they did even some basic legitimacy checking on their advertisers. Search Google for "craigslist auto posting tool", which brings up five Google ads for companies offering to spam Craigs

Getting their attention by Animats · 2010-04-11 05:18 · Score: 4, Interesting · on Why Responsible Vulnerability Disclosure Is Painful and Inefficient

It's hard getting the attention of some vendors. I see vulnerabilities in a slightly different context - hacked web sites hosting phishing pages. We distribute a list of major domains being exploited by active phishing scams. This is obtained by processing PhishTank data, and we do this because we want to reduce the collateral damage from a tough blacklist system. At any given time, there are about 30 to 80 domains on the list.

Some sites get themselves off the list quickly. By now, most of the better free hosting services and short-URL services are automatically checking PhishTank and the APWG blacklist to see when they've been hit. Today, if you run a service where anybody can put up a page that could be used for phishing (i.e. it's not full of your own headers and banners), you need automation to deal with attacks. I've been in contact with the abuse guy at "t35.com", which is a free hosting service. They've recently been hit by a flood of phishing attacks, with several hundred new reports in PhishTank per day. The attacks were coming in faster than the abuse guy could clean them out. They're now gaining on the problem, but haven't squashed it yet. Take-away lesson: automate this.

The ones near the top of the list have been there for a while. Note the dates, which are the date that the oldest phishing report still online and active appeared in PhishTank. Some just need help. Typically, these are small organizations like churches and nonprofits that have had a break-in and were partially taken over by a phishing site. I send them the Anti-Phishing Working Group's "What To Do if your Site Has Been Hacked". Sometimes I give them a phone call. They deserve sympathy.

Then there are the hard cases. These are sites with no visible contact address, or a clueless abuse department. At the moment, Google Sites and Google Spreadsheets are being used for phishing. Google is new to the free hosting business, and the phishers have discovered some tricks that Google can't yet handle. While Google puts a "report abuse" link on their site pages, it's possible to set up a file for downloading on Google Sites, and an HTML page can be served that way, without Google's abuse checking. There's also an exploit of Google Spreadsheets. That one is an example of Habbo Hotel phishing. We've reported these to Google several times, but they haven't been fixed yet.

We've been seeing a new type of attack recently - a phishing operation breaks into a shared hosting server and plants phishing pages on multiple domains on a single server. One of these hit one of the mysterious "*.websitewelcome.com" servers, which has "cloaked domain registration" and no useful default web page. These seem to be associated with "ThePlanet.com", but whether ThePlanet operates them, is providing wholesale hosting, is providing colocation, or is just the upstream connectivity provider is not clear.

Hiding the contact information of a hosting provider is legally unwise. The hosting provider may lose the "safe harbor" protection of the the DMCA. The "safe harbor" provision for "Information Residing on Systems or Networks At Direction of Users" only applies if "the service provider has designated an agent to receive notifications of claimed infringement... by making available through its service, including on its website in a location accessible to the public, and by providing to the Copyright Office, substantially the following information: the name, address, phone number, and electronic mail address of the agent." So when the RIAA or the MPAA come calling, a likely event for a hosting service, they get

Businesses are not entitled to anonymity by Animats · 2010-04-08 12:04 · Score: 1 · on Proposal To Limit ISP Contact Data Draws Fire

Neither WHOIS information nor IP address block allocation (ARIN's remit) should be private. Neither businesses nor anonymous web sites are entitled to anonymity in most of the developed world. Europe, in fact, is tougher on this than the US. Europe has the European Privacy Directive, but that's for individuals acting in their private capacity. Businesses come under the European Directive on Electronic Commerce.

1. In addition to other information requirements established by Community law, Member States shall ensure that the service provider shall render easily, directly and permanently accessible to the recipients of the service and competent authorities, at least the following information:
(a) the name of the service provider;
(b) the geographic address at which the service provider is established;
(c) the details of the service provider, including his electronic mail address, which allow him to be contacted rapidly and communicated with in a direct and effective manner;

"Service provider" here means web site owner/operator. So even in an area with strong privacy laws, businesses don't have the right to run anonymous web sites.

California has a similar law for sites that accept credit cards. It's a criminal offense in California to accept credit cards from an anonymous web site.

At SiteTruth, our demo search site, we use this requirement to filter out "bottom-feeder" sites from search results. If it looks commercial, and we can't figure out who owns the site after trying about five different approaches, it's down-rated, and we move this down in search results. This puts teeth into fighting "search engine spam".

Sites can put up phony address info, of course, but that's a felony in many jurisdictions. It's generally treated as fraud, and if it's someone else's address, identity theft. That's a line most "bottom feeders" don't want to cross. Also, much such fraud is reported to sites like PhishTank, so there are red flags to check.

If you want to put up a personal site to express your political opinions, fine. But if it's selling something, it can't be anonymous. Deal with it.

Businesses are not entitled to anonymity by Animats · 2010-04-08 12:04 · Score: 1 · on Proposal To Limit ISP Contact Data Draws Fire

Neither WHOIS information nor IP address block allocation (ARIN's remit) should be private. Neither businesses nor anonymous web sites are entitled to anonymity in most of the developed world. Europe, in fact, is tougher on this than the US. Europe has the European Privacy Directive, but that's for individuals acting in their private capacity. Businesses come under the European Directive on Electronic Commerce.

1. In addition to other information requirements established by Community law, Member States shall ensure that the service provider shall render easily, directly and permanently accessible to the recipients of the service and competent authorities, at least the following information:
(a) the name of the service provider;
(b) the geographic address at which the service provider is established;
(c) the details of the service provider, including his electronic mail address, which allow him to be contacted rapidly and communicated with in a direct and effective manner;

"Service provider" here means web site owner/operator. So even in an area with strong privacy laws, businesses don't have the right to run anonymous web sites.

California has a similar law for sites that accept credit cards. It's a criminal offense in California to accept credit cards from an anonymous web site.

At SiteTruth, our demo search site, we use this requirement to filter out "bottom-feeder" sites from search results. If it looks commercial, and we can't figure out who owns the site after trying about five different approaches, it's down-rated, and we move this down in search results. This puts teeth into fighting "search engine spam".

Sites can put up phony address info, of course, but that's a felony in many jurisdictions. It's generally treated as fraud, and if it's someone else's address, identity theft. That's a line most "bottom feeders" don't want to cross. Also, much such fraud is reported to sites like PhishTank, so there are red flags to check.

If you want to put up a personal site to express your political opinions, fine. But if it's selling something, it can't be anonymous. Deal with it.

Major domains being exploited by Animats · 2010-03-20 04:29 · Score: 4, Informative · on Naming and Shaming "Bad" ISPs

We've been doing something like this at SiteTruth for two years. We have the list of major domains being exploited by active phishing scams. This is simply a list of domains that are both in PhishTank (about 100,000 entries) and Open Directory (about 1.5 million entries). Today, 84 domains are in both. There's been a surge; it was 54 two days ago.

Domains are on this list for one of several reasons.

They had a break-in, and didn't clean it up. Generally, the sites with this problem for long periods are ones without effective contact information, so there's no easy way to tell them about their problem.
They have an open redirector. Those are rare now, but were common two years ago. Yahoo, eBay, and Microsoft Live all used to have open redirectors. After much nagging, and some press coverage, the big players have plugged that hole.
They're a hosting service, especially a free hosting service. Free hosting services need to be very aggressive about checking themselves for exploits. The smarter players now read the PhishTank and APWG feeds automatically, to detect abuses of their own systems. Right now, "t35.com" is suffering from a massive attack, with 227 pages in PhishTank. Their problem is that they're being attacked by a program, but are cleaning up by hand. Every day they kick off hundreds of phishing pages, but they can't keep up. The previous site with the worst problems was "piczo.com" (some kind of social network/hosting service for teenage girls), but they've been gaining on the problem.
They're an ISP There are a few ISPs with phishing sites they just never seem to kick off. Most of the active ones were kicked off long ago. In fact, other than ISPs which are also hosting services, we show only one entry in this category, and it's a DSL line on RoadRunner that redirects to a dead page.
They're a "short URL" service. These are popular as a way to get phishing URLs past spam filters. The "short URL" services have become much more aggressive about kicking off phishing URLs over the last year.

While this is to some extent a "blame the victim" approach, it's more effective than "phishing education" aimed at end users. Hundreds of webmasters have to be educated, not hundreds of millions of end users.

Taking a harder line on phishing-friendly sites by Animats · 2010-03-16 14:50 · Score: 2, Interesting · on Users Rejecting Security Advice Considered Rational

On the phishing front, it's useful to stop blaming the end user, and blame the site that hosted the phishing page.

For some time, I've encouraged taking a harder line on phishing-friendly sites, sites that host phishing pages. I had a paper on this at the 2008 MIT Spam Conference. At SiteTruth, we take the position that one phishing page blacklists the whole second-level domain. Here's the current list of major domains being exploited by active phishing scams.

The free hosting sites and the "short URL" sites show up on the blacklist regularly. After much nagging and some press coverage, most of them are now very aggressive about kicking off phishing pages, and they don't stay on for long. The better ones now read PhishTank and the APWG blacklist automatically and kick off anything that shows up. Currently, Google is in the doghouse, because they've recently entered the "free hosting business" without adequate phishing defenses. See this abuse of Google Spreadsheets.

At the moment, "t35.com", a free hosting service, is the site most abused in this way, by a large margin. I've contacted their people. The problem is that they're being attacked by a program, and they're cleaning up by hand. Right now, they're hosting 545 known phishing pages. Nobody else is even in double digits. "piczo.com" (a social network/free hosting service for teenage girls) was the last big victim, but they're gradually getting the problem under control.

A Draconian blacklisting policy may seem harsh, but it encourages site operators of easily-exploited sites to be very aggressive about dealing with the problem. We're seeing more free hosting sites with a "click here if this is abuse" button on every page. The number of people who have to be educated to deal with the problem in this way is in the hundreds, not the hundreds of millions. So it's a solveable problem.

If you're going to blame the victim, this is the way to go at it.

Taking a harder line on phishing-friendly sites by Animats · 2010-03-16 14:50 · Score: 2, Interesting · on Users Rejecting Security Advice Considered Rational

On the phishing front, it's useful to stop blaming the end user, and blame the site that hosted the phishing page.

For some time, I've encouraged taking a harder line on phishing-friendly sites, sites that host phishing pages. I had a paper on this at the 2008 MIT Spam Conference. At SiteTruth, we take the position that one phishing page blacklists the whole second-level domain. Here's the current list of major domains being exploited by active phishing scams.

The free hosting sites and the "short URL" sites show up on the blacklist regularly. After much nagging and some press coverage, most of them are now very aggressive about kicking off phishing pages, and they don't stay on for long. The better ones now read PhishTank and the APWG blacklist automatically and kick off anything that shows up. Currently, Google is in the doghouse, because they've recently entered the "free hosting business" without adequate phishing defenses. See this abuse of Google Spreadsheets.

At the moment, "t35.com", a free hosting service, is the site most abused in this way, by a large margin. I've contacted their people. The problem is that they're being attacked by a program, and they're cleaning up by hand. Right now, they're hosting 545 known phishing pages. Nobody else is even in double digits. "piczo.com" (a social network/free hosting service for teenage girls) was the last big victim, but they're gradually getting the problem under control.

A Draconian blacklisting policy may seem harsh, but it encourages site operators of easily-exploited sites to be very aggressive about dealing with the problem. We're seeing more free hosting sites with a "click here if this is abuse" button on every page. The number of people who have to be educated to deal with the problem in this way is in the hundreds, not the hundreds of millions. So it's a solveable problem.

If you're going to blame the victim, this is the way to go at it.

Reasonable idea by Animats · 2010-03-02 07:51 · Score: 0, Flamebait · on Detecting Anonymously Registered Domains

That's a good idea. We do something like that at SiteTruth, where we down-rate commercial sites that don't have a real-world contact address on the site. We're looking at user-visible pages, though, not WHOIS. WHOIS data quality is too low.

I'm all in favor of this sort of thing. But don't drop the messages silently; reject them during the SMTP session if you can, or send a mail bounce if you can't. There's much to be said for having a hard-ass attitude about this, but you have to handle the false positives properly.

Anything that sends mail bounces needs to check SPF records. This makes it possible to stop joe-job mail bounce problems. (EXIM mailer people: please finish the implementation of SPF checking and advance it from "experimental", so large ISPs can use it.)

Also, quit whining that putting your real name on your WHOIS registration will get you annoying phone calls, threats, or whatever. I've had my real name and contact info on all my web sites and WHOIS information for a decade, and that's just not happening.

34% "bottom feeder" sites in AdWords. by Animats · 2010-02-18 05:20 · Score: 1 · on Google Makes $500M a Year On Typos

Our own data, at SiteTruth, indicates that about 34% of Google Content Network advertisers, by domain name, are "bottom feeder" sites which we can't associate with a real-world business. This is disappointing, but not surprising. When you see a Google ad, it's not usually from a Fortune 1000 company, after all.

Our data comes from our AdRater plug-in, which rates the advertiser behind each Google ad as it appears on the user's web page. If someone goes to an ad-heavy typosquatting site, we'll see the domains advertised there. (We don't see the typosquatting domain, though; we don't monitor what pages the user views, just the ad domains. We're interested in advertiser behavior, not use behavior.) We collect the domain names of the advertisers, so we have a sizable fraction of Google's customer list, and this is hard data. We're not extrapolating.

(Collecting Google's customer list is a "long tail" kind of thing. The first 25,000 Google advertisers were seen in the first two months; the next 25,000 showed up over about four months. We'll never see them all, but we've probably seen most of them by now. Google probably has somewhere between 50,000 and 100,000 active advertisers, by domain name.)

The numbers indicate that a significant portion of Google's revenue comes from those "bottom feeders". That's why Google can't be very tough on "web spam". They have Matt Cutts claiming that Google tries to stop web spam, but, realistically, they don't try very hard. They can't. It's essential to their business model.

Search Google for "craigslist auto posting tool". Not only are there paid ads for software to put ads on Craiglist using phony accounts, some of them use Google Checkout, so Google gets a cut of what's basically a fraud scheme. ("Automatic CAPTCHA bypass available with integrated Image-to-Text support!") Google's advertiser validation standards are very low.

Low-quality phishing software by Animats · 2009-12-07 08:19 · Score: 3, Interesting · on Hackers vs. Phishers

I've seen that, too. Recently, Stanford University came up on our short list of major sites being exploited by phishers. I was surprised, because Stanford is usually good about stopping that. It was a weird subdomain under "stanford.edu", and at first I thought someone had compromised Stanford's DNS to get their site under the "stanford.edu" domain. But no, it was just some minor machine that had had a break-in.

The directory with the phishing page was readable as a web page and contained the log of captured passwords, so I sent those to Stanford security and Bank of America security. Haven't heard back from either. After the end of the weekend, the site was taken down, and that took Stanford off the blacklist.

We've been reasonably successful at cleaning up that list. We're trying to popularize the idea that one verified phishing URL blacklists the whole domain until the problem is fixed. (The idea behind SiteTruth is to take a hard-line approach and measure the collateral damage so it can be minimized.) The oldest sites on that list are ones which won't respond to complaints by e-mail or phone. In some cases we've sent faxes.

The worst offenders are Piczo and FortuneCity. Piczo is some kind of social network/hosting service for teenage girls, and it's full of phishing pages, mostly for Habbo logins. PhishTank counts 15, and there are probably more. The phony pages are often not in English, and the Piczo abuse department may not recognize a French Habbo phishing page. This may be the next trend in phishing - put your page on a site run by someone unlikely to understand the page. I've seen a phishing page in Greek on an Indian site.

It's getting harder to run a phishing site. Since the end of "domain tasting", the business of high-volume bogus domain registration has tapered off. We haven't seen an "open redirector" on a major site in a while; eBay, Yahoo, and Microsoft Live all used to have at least one. The "url shorteners" are getting very aggressive about killing links to phishing sites. This might be winnable.

I'm looking at you, Slashdot by Animats · 2009-11-30 05:47 · Score: 2, Interesting · on Are Ad Servers Bogging Down the Web?

I've mentioned the ad bottleneck before. Slashdot is an especially bad offender. Pages use several ad servers, and they use "document.write" to stall the page load until the ad comes up. Even if you have the ad images blocked, some of the junk JavaScript still needs to run.

Some sites are just slow at serving pages. Behind my SiteTruth system there is a specialized web crawler which looks for a business name and address on each web site. It never looks at more than 20 pages, and it's looking for pages like "About", "Contact", and about 40 other words which might plausibly lead to contact info. This process runs about 5-15 seconds for a well-implemented site. I log sites where it takes more than 45 seconds. About 5-10% of sites run overtime. In the last hour, the slowest site is "www.airsmaxkey.com", at 159 seconds to read 10 pages. (Yes, they're a bottom-feeder. Not only is there no business address on the site (a criminal offense in the European Union), they have logos from Verisign, PayPay, Verified by Visa, and MasterCard SecureCode, none of which are actually clickable to do the claimed verification. Nor does their shopping cart checkout use SSL. The whole site may be a scam. SiteTruth gives them a "Do Not Enter" rating.)

Some of the social networking sites have so much Javascript that Firefox will time out. (Facebook had that problem for a while. They fixed it.)

Filtering out the bottom-feeders. by Animats · 2009-11-27 06:21 · Score: 4, Informative · on Massive Badware Campaign Targets Google's "Long Tail"

The big search engines remain too "soft" on bottom-feeders. Google once took a harder line. In 2004 and 2005, Google sponsored the Web Spam Summit. Then they had a down quarter and turned to the dark side. Since then, from 2006 to 2009, they've sponsored the Search Engine Strategies conference, the web spammer's convention.

Google has to do this to remain profitable. 35% of AdWords advertisers, by domain, are "bottom-feeders" - sites with no identifiable legitimate business behind them. A significant portion of Google's revenue comes from those bottom-feeders, and the AdWords ads on their sites. If Google filtered out all spam blogs, their revenue would decline.

We, of course, run SiteTruth, as a demo to show that search can have less evil. Try putting some of those "bad" sites into SiteTruth and see how it rates them.

(We get some whining, of course. "I wanna run ads on my blog and I don't wanna say who I am." Tough. You're operating a business, and businesses, by law, don't get to be anonymous. Even in the EU. Deal with it.)

Selective ad-blocking for Facebook? by Animats · 2009-11-12 06:12 · Score: 2, Interesting · on Mafia Wars CEO Brags About Scamming Users

So most of these scam networks block Northern California, to prevent Facebook HQ from seeing them? So that's why I don't see them. I'm a few miles from Facebook HQ. I've completely missed this phenomenon.

I'd applied SiteTruth to Google ads, trying to warn users about the "bottom feeders" with no identifiable legitimate business behind the ad. Myspace is mostly Google ads, so that's covered. Google ads in general are about 35% "bottom feeders" (we track this), but on Myspace, the percentage is much higher. From the article, Facebook has a similar problem, but it's mostly in the form of Facebook-specific ads, games, etc. We're not catching those.

Maybe it's time to do that.

Selective ad-blocking for Facebook? by Animats · 2009-11-12 06:12 · Score: 2, Interesting · on Mafia Wars CEO Brags About Scamming Users

So most of these scam networks block Northern California, to prevent Facebook HQ from seeing them? So that's why I don't see them. I'm a few miles from Facebook HQ. I've completely missed this phenomenon.

I'd applied SiteTruth to Google ads, trying to warn users about the "bottom feeders" with no identifiable legitimate business behind the ad. Myspace is mostly Google ads, so that's covered. Google ads in general are about 35% "bottom feeders" (we track this), but on Myspace, the percentage is much higher. From the article, Facebook has a similar problem, but it's mostly in the form of Facebook-specific ads, games, etc. We're not catching those.

Maybe it's time to do that.

Popular with phishers by Animats · 2009-10-26 06:59 · Score: 1 · on Geocities Shutting Down Today

Geocities was very popular with phishers who needed hosting on a domain too popular to blacklist. We maintain a list of major domains being exploited by active phishing scams, and Geocities is in the #2 position for length of time on the list. Over the last few months, the number of phishing sites hosted on Geocities has slowly declined. Today, on Geocities' last day, there is only one left.

With Geocities out of action, Piczo.com (hosting/social networking for teens) and Fortunecity.com (general-purpose free hosting) become the top hosting services favored by phishers. Most of the Piczo phishing sites seem to be aimed at getting Habbo login credentials. There is apparently a whole racket which breaks into Habbo accounts to steal virtual furniture.

(We finally have all the big players off that list. When we started, Yahoo, Microsoft, Google, and eBay were all on that list. They've all been fixed. The "short URL" sites are now all very aggressive about killing off phishing links; they don't want to get on spam blacklists. Most of the remaining sites on the list are modest sites run by people who have no idea what's going on with their site. The oldest entry on that list, hoseo.ac.kr, is a Korean university. Someone broke into their email system last year and put a phishing site on port 8080. Their webmaster mailbox is full, but we've tried to reach them by other means and may eventually reach someone with a clue.)

Analyzing online anonymity. by Animats · 2009-10-17 05:15 · Score: 1 · on Kaspersky CEO Wants End To Online Anonymity

There are three issues with "online anonymity". One is anonymous businesses, the second is the ability to create an unlimited number of new identities at very low cost, and the third is actual identification of end users.

Anonymous businesses, that is, web sites with commercial intent which don't identify their ownership, are already illegal in many jurisdictions. At SiteTruth, we treat anonymous businesses (where there's no postal mailing address on the web site) as "bottom feeders", and move them to the bottom of search results. Google has a bias against "private registration" domains, but that only kicks in if the site otherwise looks like a junk site. There's not much controversy about this; it's accepted law that a business has to identify itself properly.

The ability to create an unlimited number of new identities causes various forms of trouble. The ability to get vast numbers of free Gmail accounts ("automatically create Gmail Accounts in seconds flat without breaking a sweat") is a windfall for spammers and has destroyed vast sections of Craigslist. The ability to register large numbers of domains with phony domain registration has created a well-known range of problems. Gradually, that's being tightened down. "Domain Tasting" is now dead, now that registrars have to eat the loss if they register and release a domain within 5 days. Phony WHOIS information remains a problem, but could be fixed. When you register a domain, you should get a postal mail piece with the code that enables the domain.

End user identification is the controversial issue. The music industry would like it, but, after all, the music industry is a dinky business compared to the Internet. IBM, HP, Dell, Microsoft, Yahoo, and Google are each bigger than the entire music industry. Other than for email sending, there's other big interest behind end user identification.