sitetruth.net · Domains · Slashdot Mirror

Re:Plug-ins by Animats · 2013-12-10 08:28 · Score: 1 · on Firefox 26 Arrives With Click-To-Play For Java Plugins

A third-party web application our company uses encountered Javascript problems in Firefox 24. Waiting for five minutes until Firefox 25 showed up fixed the problem again.

That's reality. I had to post this for one of my Firefox add-ons:

"Due to Firefox Bug 886329, "drop-down list in Jetpack add-on breaks entire UI", the preferences menu in Ad Limiter is not working in Firefox version 23 only. It worked in Firefox 22, and is fixed in Firefox 24, which is now available. We suggest not using Firefox 23."

Re:Why Google search needs to suck by Anonymous Coward · 2011-11-04 22:54 · Score: 0 · on Google Tweaks Algorithm As Concern Over Bing Grows

Search market share seems not to be affected much by search quality. In early 2008, Yahoo was the first search engine to add a group of special purpose subengines for weather, stocks, celebrities, and such. Nobody noticed. Yahoo's market share did not improve. After about six months, Google copied that idea.

Your timeline is wrong; Google was already doing this in 2006:
http://googlesystem.blogspot.com/2006/07/google-onebox-results.html

Also, first does not imply better anyway, as the iphone showed vs what existed when it launched in 2007. One would not say "Phone market share is not affected by quality, as the Blackberry launched years ahead of the iphone yet regular consumers never noticed."

Because users aren't that sensitive to search quality, Google can optimize search results for revenue.

Faulty premise => unsupported claim

Google has a monkey on their back: the bottom-feeder sites that exist for AdSense traffic. 94% of Google revenue is ads. 30% of that is AdSense. We measure 36% of AdSense domains as "bottom feeders". If Google fixed their search quality problems, their revenue would drop maybe 10%.

You are making a huge unsupported assumption that 36% of sites == 36% of search results == 36% of revenue. Considering mega-sites such as Wikipedia and Facebook come up very often as results, I doubt the first equality is anywhere near accurate. Given the margins at the bottom of the market, I'd also doubt the second part holds.

Your sitetruth site appears to have a pretty broad definition of "bottom feeder":
"We look for third-party ads, and if we find any, the site is evaluated as "commercial", which means we expect to find a real-world company behind the site. We then check out the real-world company, which is what SiteTruth does. This makes most of the anonymous junk sites, spam blogs, and other "bottom feeders" move down in our results."

IOW anything not in your whitelist of 11K URLs but with ads must therefore be a "bottom feeder". Some inevitably are bad, but others certainly are not. Today I found two useful mathematical references to some web searches; one was a 1996-looking page and the other was a blog. Both had the answer I was looking for. Both had ads. Both are rated by your system with a big red circle.

Bing doesn't have that problem.

Bing has rather significant partnerships with major display advertising networks; namely Yahoo and Facebook. Of course they, like Google, realize that people really do leave if you consistently give them crappy results (except for the fraction of the population which is incapable of changing any default). Total traffic is the big multiplier on the left of any revenue calculation, and all of the major engines know that and won't willingly jeopardize that.

Why Google search needs to suck by Animats · 2011-11-04 05:52 · Score: 1 · on Google Tweaks Algorithm As Concern Over Bing Grows

Search market share seems not to be affected much by search quality. In early 2008, Yahoo was the first search engine to add a group of special purpose subengines for weather, stocks, celebrities, and such. Nobody noticed. Yahoo's market share did not improve. After about six months, Google copied that idea. Now all the search engines have similar "verticals", often offering their own in-house content.

Because users aren't that sensitive to search quality, Google can optimize search results for revenue. Google has a monkey on their back: the bottom-feeder sites that exist for AdSense traffic. 94% of Google revenue is ads. 30% of that is AdSense. We measure 36% of AdSense domains as "bottom feeders". If Google fixed their search quality problems, their revenue would drop maybe 10%.

Bing doesn't have that problem. They run ads on search result pages, but their third-party program only started recently and is little used. Bing is probably driving more revenue to Google AdSense sites than to their own third-party ads. Bing could get much tougher on web spam if they chose. Until recently, they've mostly tried to match Google's results, but lately they've been going beyond that.

Incidentally, adding "social" inputs makes search worse, not better. Social inputs are too heavily spammed.

We run a "scraper". by Animats · 2011-04-13 07:26 · Score: 1 · on 'Scrapers' Dig Deep For Data On Web

Our SiteTruth system does some "scraping". We're looking for the name and address of the company behind the web site, so we can check the business out. We also look for ad links and a few other things, like BBBonline seals, which we check. We use a user agent name of SiteTruth.com site rating system. We don't look very deeply into a site; if after examining the most likely 20 pages, we haven't found out who runs the site, we figure they're not going to tell us. The site is down-rated accordingly.

Our experience is that 0.1% of sites have a "robots.txt" file that tells us to not look at any pages at all. We don't look at those sites, and their SiteTruth rating information says "Blocked". Total exclusion of crawlers is rare. Most sites want some visibility.

One of the more amusing uses of a "robots.txt" file used to be seen on Marchex (the "What you need, when you need it" domainer) pages. The site wasn't blocked from crawling, but the link to the page that told you about Marchex was. That, we suspect, was to keep search engines from noticing that all those domains were really one business. That didn't help Marchex much. Marchex (NASDAQ: MCHX) is still around, stock way down from the peak and reporting a slight loss this quarter.

We do have one exception to obeying the "robots.txt" file. We look at the home page of the site to see if it's a redirect before looking at the "robots.txt" file. Some sites have both a redirect and a "keep out" robots.txt file on the same domain. This is like posting signs that say "Keep Out" and "Please Use Other Door" on the same entrance. That contradiction was apparently a workaround for an old Google crawler bug. Google would index both "example.com" and "www.example.com" separately, then consider them duplicates, which caused some SEO problems.

Actually logging into sites from a crawler is just wrong. I'm amazed that a deep pocket like Nielsen would do that.

Can Google afford to stop spam? by Animats · 2011-01-21 08:48 · Score: 5, Informative · on Google Fires Back About Search Engine Spam

Google has a dilemma. If their search engine takes you directly to the place you want to go, they don't make any money. For a good analysis of this, see "Google Sucks All the Way to the Bank", by Jill Whalen She is, unfortunately, right. It's essential for Google's success that some of their own ads be more relevant than their search results. Part of their revenue comes from sending users on a side-trip to AdWords-heavy pages. We've measured this, using a browser plug-in which reports AdWords appearances to us. About 36% of domains with AdWords (counting domain names, not traffic) are what we consider "bottom feeders", junk sites with a commercial purpose but no identifiable business behind them.

On the local search front, spam in Google Places is even worse than in their main search results. This, though, appears to be due to ineptitude, not malice. Google added a business search system to Google Maps a year or two ago; that's what Google Places really is. You've been able to go to a Google Maps page and search for businesses for some time now. Few people knew this.

Then, in October 2010, Google merged the map search results into their main search results. "Places" results suddenly got top billing in Google. The "search engine optimization" (SEO) industry swung into action, and began spamming Google Places on a massive scale. (We have a paper on this, which has been mentioned by Techdirt, the New York Observer, etc. It's an amusing read.) Recommendation spamming, which had been going on for a while at a low level, grew substantially once recommendations started affecting Google search results.

This, incidentally, is why Blekko won't work. If they get enough market share to matter, techniques will be developed to spam them into meaninglessness.

Stopping web spam is technically quite possible. We do it by finding the business behind the web site, and doing some automated due diligence. We check business records, SEC filings, BBB ratings, and Dun and Bradstreet to verify business legitimacy. We down-rate most of the junk. We try to err in the down-rating direction, taking the position that it's the job of a company to demonstrate their legitimacy by using their real name and address on their web site, which has to match real-world business records. Our demo site demo site for this shows what search is like if you take a hard line on spam.

Our approach requires more of a hard-ass attitude than Google's business model can perhaps afford. With Bleekko making Google look foolish, though, and Bing slowly improving, Google may have to actually do something that works, even if it cuts into revenue from the spam.

Re:The problem: low standards in search engines. by Animats · 2010-04-26 07:18 · Score: 1 · on Several Link-Spam Architectures Revealed

Re SiteTruth complaints: (We have a blog for that.)

Non-commercial web sites aren't rated at all. However, the presence of an ad link marks a site as "commercial", as does being in ".com". Our "commercial intent" detection is rather simplistic. We really should have a classifier system doing that. Yahoo search R&D, back when they had search R&D, built one of those, but never did much with it. We've been reluctant to use machine learning techniques, though, because they reduce the transparency of the system. At present, SiteTruth doesn't rely on "security by obscurity". Adding a classifier system would change that.

Credit rating information is useful because, for businesses, you can get business size information. Annual sales and number of employees are worth knowing, and displaying to the user in search results. (We'll be doing something in that area soon.) There's a guy in Brooklyn, NY, who took pictures of camera stores that advertise on line or for mail order. There are companies with giant warehouses and loading docks, and there are, well, "marginal locations". It's very funny. Search engines need info like that.

As for specific sites:

Glaxosmithkline: We give them a yellow "?", which means we think they're legit, but don't have third-party verification that the domain is tied to the company. In our hard-ass view, that's an OK rating. SSL certs and BBBonline links provide such third-party verification. They did match our database. We weren't able to parse "Registered office: 980 Great West Road, Brentford, Middlesex, TW8 9GS, United Kingdom.", unfortunately; we only recognize multi-line postal addresses, usable on an envelope, at present.
Vodaphone All the country sites have SSL certs, but the main ".com" site does not. It does have the address "Vodafone Group Plc / Vodafone House / The Connection / Newbury / Berkshire / RG14 2FN / England" on multiple lines, which we pulled out of the source HTML as a possible address, but did not parse successfully. Still, they got a yellow "?", and were matched to the UK business database.
Oxfam gets a green checkmark, and the system was able to pull four business addresses from their web site.

The problem: low standards in search engines. by Animats · 2010-04-25 05:21 · Score: 1 · on Several Link-Spam Architectures Revealed

These guys are doing good work, but really, all they're doing is checking for some specific types of black-hat SEO. This is inherently a losing battle, because there's active opposition. It's a "negative file" approach - making a list of the bad guys. Credit cards once worked that way; merchants were sent daily lists of canceled or stolen credit cards. Back then, getting a credit card was tough; the customer had to be a good customer of the bank. Not until credit card transactions were validated remotely against a "positive file" that checked the actual account could everyone have one. Web search is still in the "negative file" era.

As I point out occasionally, the main search engines have very low standards for business legitimacy. It's an ongoing, and losing, battle to filter out the totally bogus sites. But if you insist on some minimal standard of business legitimacy for a commercial web site, you kick out most of the "bottom feeders" with no business address, and along with them, most of the total phonies. We do this at SiteTruth, which exists to demonstrate that it's possible. SiteTruth tries to find some indication that a domain maps to a real-world business. If it can't, the site is moved down in search engine position. That's enough to move most "bottom feeder" downward, below the legit ones. It's not always successful in finding the business behind the site, but it looks harder than the average user would, looking through the site's "About", "Help", "Contact", etc. pages for a mailing address. If a search engine takes a hard line on this, the junk sites can be kicked out.

Once you have a business address for a web site, there are extensive resources for finding out more about the business. It's easy to get annual sales and number of employees if you know what database to buy. Corporate registration information and D/B/A name information is available. Business credit rating info is available in bulk for a fee. Crank that info into search engine positioning and you've got hard data driving search. Rating web sites by looking only at the web is a process easy to manipulate. Use info from the real world, and it's much harder.

Phony mailing addresses do show up, but that's usually associated with phishing sites. Not showing a business address is a misdemeanor in some jurisdictions, but common. Using the address of another business is felony fraud and identity theft. That gets law enforcement attention. So only outright criminals try that. To catch that, we fetch the entire PhishTank database every few hours and blacklist the entire domain for a single phishing entry. That's draconian, but if you're running a site that lets users upload entire pages, it's your job to kick the phishers off. Most of the innocent victims there are free hosting services with weak abuse departments. If you're in the free hosting business or the URL redirection business, you need a strong abuse department, or you will be pwned. Right now, "t35.com" is getting hit hard. By now, most free hosting sites with a clue automatically check PhishTank and the APWG list to see if they're on it. "t35.com" is still doing it by hand, and they're losing the battle.

So why doesn't Google do this? Google's business model depends on those ad-heavy "bottom feeder" sites. About 36% of Google's "content network" domains are "bottom feeders". When organic search takes you to the right place on the first try, Google doesn't make any money. But if you're led through an ad-heavy site, the Google cash register clicks. Google's business model thus takes them to the dark side. Google would take a big financial hit if they did even some basic legitimacy checking on their advertisers. Search Google for "craigslist auto posting tool", which brings up five Google ads for companies offering to spam Craigs

34% "bottom feeder" sites in AdWords. by Animats · 2010-02-18 05:20 · Score: 1 · on Google Makes $500M a Year On Typos

Our own data, at SiteTruth, indicates that about 34% of Google Content Network advertisers, by domain name, are "bottom feeder" sites which we can't associate with a real-world business. This is disappointing, but not surprising. When you see a Google ad, it's not usually from a Fortune 1000 company, after all.

Our data comes from our AdRater plug-in, which rates the advertiser behind each Google ad as it appears on the user's web page. If someone goes to an ad-heavy typosquatting site, we'll see the domains advertised there. (We don't see the typosquatting domain, though; we don't monitor what pages the user views, just the ad domains. We're interested in advertiser behavior, not use behavior.) We collect the domain names of the advertisers, so we have a sizable fraction of Google's customer list, and this is hard data. We're not extrapolating.

(Collecting Google's customer list is a "long tail" kind of thing. The first 25,000 Google advertisers were seen in the first two months; the next 25,000 showed up over about four months. We'll never see them all, but we've probably seen most of them by now. Google probably has somewhere between 50,000 and 100,000 active advertisers, by domain name.)

The numbers indicate that a significant portion of Google's revenue comes from those "bottom feeders". That's why Google can't be very tough on "web spam". They have Matt Cutts claiming that Google tries to stop web spam, but, realistically, they don't try very hard. They can't. It's essential to their business model.

Search Google for "craigslist auto posting tool". Not only are there paid ads for software to put ads on Craiglist using phony accounts, some of them use Google Checkout, so Google gets a cut of what's basically a fraud scheme. ("Automatic CAPTCHA bypass available with integrated Image-to-Text support!") Google's advertiser validation standards are very low.

Re:wrong assumption by Animats · 2009-12-17 05:48 · Score: 4, Interesting · on Google Says Ad Blockers Will Save Online Ads

If I need "product information", I will find it - ironically - on Google. The difference is that I'll be looking for it, instead of getting it shoved down my throat, willingly or otherwise.

Even from an advertiser perspective, Google's system sucks. On the forums for "search engine optimization", one discovers that ad clicks from Google search results tend to result in sales, while ad clicks from Google ads on non-Google sites (what Google euphemistically calls the "Google Content Network") don't. 50% of ad clicks come from 10% of the user base, and that 10% doesn't buy anything.

Google ads on non-search pages aren't that valuable to advertisers. So why are there so many of them? Because they're opt-out for the advertiser. Many Google advertisers have ads on the "content network" only because they haven't found the hidden button on Google's screens for opting out, as an unhappy Google advertiser reports: "I am running many Google ads and their CTR is around 10%-15% for search page impressions; However the CTR on the content network is 0.02%! I can exclude my ads appearing on certain sites however at the bottom of the URL list it states "Other Domains" which have a total CTR of 0.01% with well over 300,000 impressions in a month! This is driving my overall CTR down massively! If I can not view these sites and choose to exclude them...I need to opt out of all content based placements immediately. How can I do this?"

Also see "Good Reasons to Avoid Content Targeting: "The AdWords user interface misleads new advertisers. Industry consensus suggests that content targeting ought to be used selectively and one should bid lower on content than on search inventory. This is because ads on content inventory tend to convert at a lower rate than ads on search inventory. But when you walk through Google's campaign setup, you find that you've been automatically opted into the content network at the same high bid as your search campaigns."

Much of the "bottom feeder" problem on the Web comes from this one trick of Google's.

We measure some of this at SiteTruth, and some of the results are here.

Google's customer list - public information? by Animats · 2009-12-08 04:55 · Score: 4, Informative · on Google CEO Says Privacy Worries Are For Wrongdoers

What would Google think if someone released their customer list?

We have it. A sample of Google AdWords advertisers:

saarc.autodesk.com
safeguarddd.com
safestepproducts.com
safetyawarenessposters.com
safetyproductsllc.com
safetyrailsource.com
sagemas.com
sagepayservices.com
sagonet.com
saideigama.com

There are about 22,000 Google AdWords customers known to us. Every time Google puts up an AdWords ad, it exposes the identity of the advertiser. Our AdRater browser plug-in rates on-line advertisers as their ads are presented to users. Unlike most plug-ins, we don't monitor user behavior. Instead, we monitor advertiser behavior, which is in some ways more interesting. This doesn't violate Google's terms of service. Every request made of Google was made by a user, not us, during ordinary browsing. We're just watching the ads go by. It's like clipping ads from newspapers to see what your competitors are doing.

As we point out occasionally, about 35% of Google's advertisers are "bottom feeders". Google needs to raise the bar on who can run ads with them. Search Google for "Craigslist auto posting tool" and look at the paid ads. You can buy "Easy Ad Poster Deluxe", a program for spamming Craigslist, through Google Checkout, so Google isn't just advertising it, they're taking a cut of the revenue as well. That's embarrassing for Google, or should be.

Filtering out the bottom-feeders. by Animats · 2009-11-27 06:21 · Score: 4, Informative · on Massive Badware Campaign Targets Google's "Long Tail"

The big search engines remain too "soft" on bottom-feeders. Google once took a harder line. In 2004 and 2005, Google sponsored the Web Spam Summit. Then they had a down quarter and turned to the dark side. Since then, from 2006 to 2009, they've sponsored the Search Engine Strategies conference, the web spammer's convention.

Google has to do this to remain profitable. 35% of AdWords advertisers, by domain, are "bottom-feeders" - sites with no identifiable legitimate business behind them. A significant portion of Google's revenue comes from those bottom-feeders, and the AdWords ads on their sites. If Google filtered out all spam blogs, their revenue would decline.

We, of course, run SiteTruth, as a demo to show that search can have less evil. Try putting some of those "bad" sites into SiteTruth and see how it rates them.

(We get some whining, of course. "I wanna run ads on my blog and I don't wanna say who I am." Tough. You're operating a business, and businesses, by law, don't get to be anonymous. Even in the EU. Deal with it.)

Dumb way to attack Google. by Animats · 2009-11-16 06:08 · Score: 1 · on Mark Cuban's Plan To Kill Google

What a dumb idea.

There are ways in which Google is vulnerable, but that isn't one of them.

Google's real vulnerability is that if organic search is good enough, nobody ever need click on the ads. When organic search takes you to the right place on the first try, Google makes no money. So the organic search results have to suck, just a little, to make the ads look more attractive. Google needs for some of the traffic to go to ad-heavy pages. That's how Google gets much of their revenue. That's where they're vulnerable.

Google advertisers are about 36% "bottom-feeders", sites that don't have an identifiable, real-world business behind them. Most of those are ad sites.

Google needs web spam to profit. by Animats · 2009-07-30 04:38 · Score: 3, Informative · on Google Warns About Search-Spammer Site Hacking

Google can't solve this problem because their business model requires web spam.

Google is in the advertising business, not the search business. Search is a traffic builder for the ads. Google's customers are their advertisers, not their search users. They have to maximize ad revenue. The problem is that more than a third of Google's advertisers are web spammers, broadly defined. All those "landing pages", typosquatters, spam blogs, and similar junk full of Google ads are revenue generators for Google. Every time someone clicks on an AdWords ad, Google makes money, no matter what slimeball is running the ad. Google can't crack down too hard, or their revenue will drop substantially. Google does have some standards, but they're low.

Google went over to the dark side around 2006. In 2004 and 2005, Google sponsored the Web Spam Summit, devoted to killing off web spammers. From 2006, Google sponsored the Search Engine Strategies conference, where the "search engine optimization" people meet. That was a big switch in direction, and a sad one.

As we demonstrate with SiteTruth, it's not that hard to get rid of most web spam if you're willing to be a hardass about requiring a legit business behind each commercial web site. Google can't afford to do that. It would hurt their bottom line.

However, cleaning up web search results with browser plug-ins is a viable option. Stay tuned.

Notes on blocking Google ads by Animats · 2009-04-17 04:48 · Score: 1 · on Microsoft Family Safety Filter Blocks Google

blocks AdSense ads

Now that's an interesting competitive tactic for Microsoft, which doesn't make much of its money from online advertising. Blocking as many ad sites as possible would be a useful and popular browser feature. Not only would the user not have to look at the ads, web browsing would be two or three times faster. Notice how often your browser stalls because the page renderer is waiting for some ad site. Perhaps "family filter" is Microsoft's foray into ad-blocking.

Our AdRater plug-in evaluates AdSense ads and labels them, but doesn't block them. We collect statistics on AdSense advertisers. Over a third of AdSense advertisers are sites that don't clearly identify who owns them. Google's validation of their advertisers is very weak. One could make a good argument for blocking a significant fraction of them on quality grounds alone.

The trouble with "targeted advertising" by Animats · 2009-01-17 04:19 · Score: 4, Interesting · on Technologies To Watch Fail In 2009

"Targeted advertising" has real problems. Ads on search results pages are valuable, because they're presented at the point that the user is actively looking for something. Vaguely relevant ads on other pages (the "Google Content Network" comes to mind) are a distraction, and far less valuable. Clicks on such ads are mostly from the 10% of web users who make 50% of the clicks, but don't buy much. Many advertisers have opted out of the Google Content Network (read Search Engine Watch). As we point out, about 36% of Google Content Network advertisers are "bottom-feeders", junk sites with no verifiable business behind them. There's been a slow decline in contextual advertising, and I expect that to continue, and maybe accelerate. Ad-supported sites will feel the squeeze.

Targeted advertising is effective if the advertiser has the user's buying history. Amazon exploits this successfully; they know exactly what you've bought. But spreading that information around creates privacy problems and loud objections. Merchants aren't keen about letting their competitors know who their best customers are. Payment companies like Visa and PayPay could in theory take that role, but they've been reluctant to do so for fear of regulatory backlash. Payment companies don't currently know what you bought, just who you bought it from. They'd need merchant cooperation to profile their customer base.

What this may mean is a network effect for broad-based online merchants like Amazon. The bigger they get, the better their targeted advertising becomes. Customers don't object, because they're dealing with one company which legitimately knows what they've bought. Amazon may take up the slack as brick-and-mortar stores go under. In consumer electronics, Circuit City, The Good Guys, CompUSA, etc. have all gone under, and Amazon is taking up much of the slack.

Other than search ads, online is doomed by Animats · 2009-01-05 04:48 · Score: 2, Interesting · on How Web Advertising May Go

There are ads that appear with search results, which are valuable to both advertiser and reader. And then there's everything else, which is merely annoying.

Search ads are valuable because they're presented when the user is looking for something and are relevant to the search. At that one moment in time, an ad isn't an interruption of other activity. That's why Google is so successful.

Google ads on other sites, though, are mostly noise. The overall quality of Google contextual advertisers is low. For most serious advertisers, opting out of the Google Content Network, but keeping the search ads, is a good move. Especially since the discovery that 10% of users generate 50% of the clicks, but don't buy much.

Online ads may bring in enough revenue to keep your blog running, but they won't keep your car dealership afloat.

"Don't be evil" by Animats · 2008-09-05 14:35 · Score: 5, Interesting · on Google Turns 10

The problem with Google is that their "don't be evil" claim is hard to take seriously any more. Ads at the right of search results weren't too bad, but then it went downhill. They created the "content-related ad" industry, which resulted in a vast number of "made for AdWords" junk sites and blogs, the "domaining" industry, and a vast amount of crap. Even real advertisers don't like it; the smarter ones opt out of the Google Content Network and stick with the search result ads.

From there it went downhill. Google doesn't do much to qualify their advertisers, and as we point out occasionally, about 35% of them are "bottom feeders", where you can't even identify the real business behind the ad.

Then there's Google Checkout. They accept very marginal businesses. They ought to be doing the kind of validation a bank does of its clients, but clearly, they don't.

Google's real problem is that they went public at the top of their game. Google was #1 in search when they went public, so they couldn't grow in their main business area. They had to expand to justify their high P/E ratio, and none of their expansion areas (YouTube, GMail, etc.) made money. So they had to figure out how to get more revenue per search result. At that point they started to turn to the dark side.

Contact info is better found on the web site. by Animats · 2008-06-22 17:26 · Score: 4, Interesting · on ICANN Asked To Shut Down "Worst" Chinese Registrar

There's been a formal study of bad WHOIS data by the Government Accounting Office, the investigative arm of Congress, titled "Prevalence of False Contact Information for Registered Domain Names", on this topic. They found at least 8% of contact info in WHOIS to be totally bogus. They also, as a test of ICANN, submitted 45 "WHOIS information problem reports", of which 11 resulted in correction and 33 did not. But GAO didn't break down the data by registrar.

We've been interested in this issue at SiteTruth for some time. We take a broader view of "bad" web sites than most; we consider any commercial site that lacks valid business name and address information to be bogus. Over 35% of Google AdWords advertisers fail that test. For advertisers whose ads appear on Myspace, the ratio is much higher.

Originally, we tried to get contact information from WHOIS data, but the data quality was so appallingly bad that we had to develop another approach. We have a system that looks for contact info the way a user would, looking at pages with names like "About", "Contact", and such, trying to find a user-readable street address. We also have some big databases of business addresses to check against. This turns out to work much better than looking at WHOIS data when the goal is to find the business behind the web site.

(You can see this info using our AdRater plug-in for Firefox. Download our plug-in to see the ratings for each Google advertiser as the ads go by. Unless you're already blocking all such ads, of course.)

Statistics for phishing domains are different. by Animats · 2008-06-04 05:26 · Score: 1 · on McAfee Picks the Most Dangerous TLDs

SiteAdvisor is basically an anti-virus program connected to a web spider; it downloads pages and looks for hostile code. This is valuable as a firewall feature, but it doesn't say much about whether a domain is worth visiting.

PhishTank has a list of sites currently involved in phishing scams. Let's take a look at that. At SiteTruth, we have historical PhishTank data in a database, with 40997 phishing attacks recorded. So when we ask the right question (which is "SELECT SUBSTRING_INDEX(domain,".",-1) AS tld, COUNT(*) as cnt FROM domainnegatives GROUP BY SUBSTRING_INDEX(domain,".",-1) ORDER BY cnt DESC LIMIT 20;"), we get

"com",16284
"cn",3787
"net",2866
"tw",2715
"hk",2398
"ru",1065
"org",844
"fr",797
"uk",720
"ph",599
"kg",599
"info",497
"it",495
"de",463
"br",310
"ch",303
"us",282
"pl",282
"jp",279
"at",270

Here, "com" is by far the most popular TLD with phishers. This reflects the desires by phishers to have a plausible-looking domain name. Some phishers, the ones who register domains in bulk, do pick rather bogus-looking domains (like "0001fyg0.com" "00039cscsgrjc.com" "0003s6tw0wqf70l.com" "0003ureb.com" "0004ssen.com" "0004y1x9.com" "00062lku1ekaj.com"). Others have more plausible choices, (like "americaonllinebank.com").

Top-level domain statistics are more of a curiosity than anything else. They don't help you avoid or deal with attacks. We could generate many other similar statistics, and we've posted some on the SiteTruth blog.

There's more interest in this on the ad side by Animats · 2008-04-16 08:00 · Score: 3, Informative · on How Social Networks May Kill Search as We Know It

The use of "social networking" data for search has been discussed before in the search technology community, where it's not well thought of. "Inertia" in search, where your search history affects your later results, turns out to be a pain. Search becomes nonrepeatable, both for the individual and for others. This adds more hassle than the gain provided by "inertia".

Reading both the article and the interview with the Google VP, it's clear that the article exaggerates Google's interest in this area.

Social networking data is taken seriously on the advertising side, where using social networking data for ad selection is already being done by Myspace and their ilk. Amazon and Netflix already have rather good systems for deciding what to recommend to their customers. That's where this really works, where the seller has a big product selection and the user is already prepped to buy something. Myspace isn't doing as well, but then, as we've pointed out before, their advertisers are mostly bottom feeders. Ad rates on Myspace are very low, and it shows.

A key question is who controls the use of the social networking data for ad selection. Not the user, of course; the disagreement is between the social networking sites and the search engines. Look for a battle in that area, perhaps followed by mergers.

Slashdot Mirror

Domain: sitetruth.net

Comments · 20