Domain: sitetruth.com
Stories and comments across the archive that link to sitetruth.com.
Comments · 190
-
That's just one of many "open redirectors"
There are "open redirectors" on many major sites, including Google, AOL, eBay, and Microsoft Live. (Yahoo plugged their hole by giving their open redirector its own, easily blockable, domain.) We mentioned this on Slashdot a few days ago, and someone immediately followed up by using the Google exploit to get through Slashdot's filters.
These open redirectors are regularly exploited by phishing scams. People report them to PhishTank, and over at SiteTruth, we tie them back to the domain responsible and fix blame. PhishTank is too nice about this. They just blacklist the phishing URL. That stopped working a few months back, when phishers started generating random URLs and subdomains for each e-mail. We down-rate the whole base domain.
It's time to take a hard line on this. The Internet used to tolerate open mail relays, which were a nice feature until spammers started exploiting them. Now they're routinely blocked. Open redirectors now need similar treatment.
Beyond simple URL redirectors are exploits of JavaScript redirectors. Efforts are underway to detect and block those.
-
Google hole that allows a similar attack
There's a related hole in Google Maps, an "open redirector", that allows this exploit. Here's an example:
Caution - hostile URL Close the page displayed; don't click on anything on it. .
Note that it fools Slashdot, and most link scanners in spam filters, into accepting the URL as leading to "google.com". But, in fact, it redirects to the "malware-scan.com" hostile site, which will try to install an Active-X control.
We've been finding attacks like this up with SiteTruth, by using PhishTank information to down-rate sites that have open redirectors. We've found open redirectors on Google and AOL. They're actively being exploited.
So we're currently down-rating Google, and AOL.. It may seem drastic to downrate an entire major site because they have a few "minor" exploits. PhishTank itself only blacklists specific hostile URLs. But that's no longer enough. Most modern phishing attacks use a unique URL, and often a unique subdomain, for each user attacked. SiteTruth thus takes a harder line. If a domain hosts something one of the data sources says is an attack, it downrates the whole domain automatically.
It's within the power of the site operator to close such security holes. We encourage them to do so.
-
Google hole that allows a similar attack
There's a related hole in Google Maps, an "open redirector", that allows this exploit. Here's an example:
Caution - hostile URL Close the page displayed; don't click on anything on it. .
Note that it fools Slashdot, and most link scanners in spam filters, into accepting the URL as leading to "google.com". But, in fact, it redirects to the "malware-scan.com" hostile site, which will try to install an Active-X control.
We've been finding attacks like this up with SiteTruth, by using PhishTank information to down-rate sites that have open redirectors. We've found open redirectors on Google and AOL. They're actively being exploited.
So we're currently down-rating Google, and AOL.. It may seem drastic to downrate an entire major site because they have a few "minor" exploits. PhishTank itself only blacklists specific hostile URLs. But that's no longer enough. Most modern phishing attacks use a unique URL, and often a unique subdomain, for each user attacked. SiteTruth thus takes a harder line. If a domain hosts something one of the data sources says is an attack, it downrates the whole domain automatically.
It's within the power of the site operator to close such security holes. We encourage them to do so.
-
Google hole that allows a similar attack
There's a related hole in Google Maps, an "open redirector", that allows this exploit. Here's an example:
Caution - hostile URL Close the page displayed; don't click on anything on it. .
Note that it fools Slashdot, and most link scanners in spam filters, into accepting the URL as leading to "google.com". But, in fact, it redirects to the "malware-scan.com" hostile site, which will try to install an Active-X control.
We've been finding attacks like this up with SiteTruth, by using PhishTank information to down-rate sites that have open redirectors. We've found open redirectors on Google and AOL. They're actively being exploited.
So we're currently down-rating Google, and AOL.. It may seem drastic to downrate an entire major site because they have a few "minor" exploits. PhishTank itself only blacklists specific hostile URLs. But that's no longer enough. Most modern phishing attacks use a unique URL, and often a unique subdomain, for each user attacked. SiteTruth thus takes a harder line. If a domain hosts something one of the data sources says is an attack, it downrates the whole domain automatically.
It's within the power of the site operator to close such security holes. We encourage them to do so.
-
Skipping the blogodreck, here's the real info
Skip the ad-laden overloaded blogodreck site and go directly to StupidFilter. The concept is straightforward - they're training a naive Bayesian classifier, like a spam filter, on a set of text excerpts rated by humans. You can look at random samples from the training set for amusement.
Wikipedia already has some 'bots that do somewhat similar things, looking for totally bogus edits and reverting them. Yahoo's "commercial intent" filter also does something like that, to separate commercial and non-commercial sites. We considered something like that for SiteTruth, where we need to distinguish non-commercial sites so we don't rate them by business criteria.
This approach to filtering will probably need domain-dependent filters. A political site, a social site, a sports site, and a game site all need different training sets. I'd go for a two-stage classifier, one that divided sites into about ten to twenty major categories, and then a stupidity filter trained for each of those categories.
Applying such a filter at blog posting time should be interesting.
And the characters in these books, and plays, and so on, and in real life, I might add, spend hours bemoaning the fact that they can't communicate. I feel that if a person can't communicate the very least he can do is to shut up. - Tom Lehrer.
-
It's a web spammer
OK, let's do some lookups.
First, the USPTO trademark database. "Simpledog" - no hits. "Simple AND dog" - three dead applications for long phrases containing those two words. Definitely not a registered trademark. File your own trademark application if you like. It's easy, the entire process is online, and the fee is a few hundred dollars.
Next, let's try DomainTools.. "GNO, Inc. owns about 22,379 other domains." "1,219,449 other sites hosted on this server." That's a web spammer.
Now let's check domain dispute decisions. Here's Panthers BRHC L.L.C. v. Gregg Ostrick/GNO, Inc. (re "bocaresorts.com" dispute). The owners of a resort hotel in Boca Raton challenged GNO for using "bocaresorts.com" against their trademark "Boca Raton Resort & Club" and domain "bocaresort.com". GNO lost.
Finally, couldn't resist running "simpledog.com" through our SiteTruth system. No street address on the site. No SSL cert. Not in any of the business databases. "Site ownership unknown or questionable."
Yes, that's a web spammer all right. No sign of anything that looks like a trademark or a legitimate business.
-
Re:Businesses are not entitled to "privacy".
Your SSL certificate checker has issues. Even when checking an SSL-enabled URL, with a valid commonName, it breaks because it's the wrong host.
Your check page
Wrong host for SSL certificate. Certificate for "services.corecodec.com", actual host "www.corecodec.com". (Peer certificate commonName does not match host, expected www.corecodec.com, got services.corecodec.com)
There's a reason we don't link to https://www.corecodec.com/ - the SSL cert is appropriate for the URLs we call it under. Disregarding that, pulling a https cert for a different host, then complaining that it's not "valid" is bad practice.
Many sites don't use SSL on their main domain - they often use secure.theirdomain.com, ssl.theirdomain.com, etc. It's still a SSL cert for the domain, what's the problem?
Also, your "address checker" needs some real work too - we get a negative rating because we don't have an address on the site. We have a "Contact Us" link on nearly every page on our sites. From the details, it looks like your address regex could some tweaking - it thinks "Windows Mobile, PocketPC" is an address, but our street address isn't. -
Businesses are not entitled to "privacy".
The actual ICANN report, shows they're deadlocked, all right. See this timeline.
Most of the privacy advocates are referring to the European Directive on Privacy. That only applies to individuals not engaged in business. For businesses, the The European Electronic Commerce Directive (2000/31/EC) applies. And it's very clear. Any "natural or legal person providing an information society service" must disclose name, real-world address, and E-mail address. No exceptions.
California has a similar law. It's more narrowly drawn, only applying to sites that take credit cards, but it's a criminal law - six months in jail for not disclosing the "actual name and address" of the business.
WHOIS policy should take that into account. There's a legal obligation to disclose name and address information for businesses. It's not optional.
Our SiteTruth system is based on these laws. If a web site is selling or advertising something, and we can't find a business name and address for it, its rating is toast. We scan each site for human-readable postal addresses (some people would call this "semantic web" technology). We check commercial business databases. We check SSL certificates. We look at Open Directory. If we can't find a business name and address after doing all that, the site's rating is a red "do not enter" sign, and we kick them down to the bottom of search results. Once we have a business name and address, we have something to look up in business databases, corporation records, business license records, credit ratings, criminal records, etc. Plenty of data is available about businesses once you have a name and address. No more "on the Internet, no one knows if you're a dog". We know.
We haven't found WHOIS data very useful in doing this. WHOIS data quality is awful. Many entries are phony. Mailing addresses on the web site itself tend to be more accurate. Using a phony business address is felony fraud in most jurisdictions, so that's relatively rare, and mostly shows up on phishing sites. So we cross-check with anti-phishing databases to kick those sites out.
It's quite possible to use this approach to check WHOIS information in bulk. If ICANN actually cared about WHOIS data quality, they'd check the data against postal databases and business databases. They don't.
-
Businesses are not entitled to "privacy".
The actual ICANN report, shows they're deadlocked, all right. See this timeline.
Most of the privacy advocates are referring to the European Directive on Privacy. That only applies to individuals not engaged in business. For businesses, the The European Electronic Commerce Directive (2000/31/EC) applies. And it's very clear. Any "natural or legal person providing an information society service" must disclose name, real-world address, and E-mail address. No exceptions.
California has a similar law. It's more narrowly drawn, only applying to sites that take credit cards, but it's a criminal law - six months in jail for not disclosing the "actual name and address" of the business.
WHOIS policy should take that into account. There's a legal obligation to disclose name and address information for businesses. It's not optional.
Our SiteTruth system is based on these laws. If a web site is selling or advertising something, and we can't find a business name and address for it, its rating is toast. We scan each site for human-readable postal addresses (some people would call this "semantic web" technology). We check commercial business databases. We check SSL certificates. We look at Open Directory. If we can't find a business name and address after doing all that, the site's rating is a red "do not enter" sign, and we kick them down to the bottom of search results. Once we have a business name and address, we have something to look up in business databases, corporation records, business license records, credit ratings, criminal records, etc. Plenty of data is available about businesses once you have a name and address. No more "on the Internet, no one knows if you're a dog". We know.
We haven't found WHOIS data very useful in doing this. WHOIS data quality is awful. Many entries are phony. Mailing addresses on the web site itself tend to be more accurate. Using a phony business address is felony fraud in most jurisdictions, so that's relatively rare, and mostly shows up on phishing sites. So we cross-check with anti-phishing databases to kick those sites out.
It's quite possible to use this approach to check WHOIS information in bulk. If ICANN actually cared about WHOIS data quality, they'd check the data against postal databases and business databases. They don't.
-
Browsers are far too forgiving
Browsers are incredibly forgiving of bad HTML. Worse, the definition of "acceptable HTML" is undocumented, both for IE and Firefox. We discovered this writing Sitetruth's parser. We started out with BeautifulSoup, which is supposed to be a "forgiving" HTML parser. By browser standards, it's not; we had to make some improvements. Here are some things that show up in real-world HTML:
- Incorrectly terminated HTML comments These are so widespread that you have to handle them, or entire web pages are sucked into unterminated comments.
- Unescaped spaces in URLs Spaces in URLs are supposed to be escaped, but there are A tags out there using URLs with spaces.
- Unescaped CR/LF within a URLThis is rare, and invalid, but multiline URLs are out there. Usually in hostile code.
- Unicode URLs I've seen a Unicode "Pi" symbol, unescaped, in a URL in a UTF8 document. This was on a phishing site, so it was probably there because it broke some security product.
Part of the reason for the growth in bad HTML is that Adobe seems incapable of making a version of Dreamweaver that consistently generates correct HTML for anything later than HTML 3.2. (Create a moderately complex page in Dreamweaver 8 in HTML 4.x or XHTML mode, and run it through a validator. It will fail.) If the best tools can't get it right, why should anybody else?
Since real world HTML parsing is ambiguous, and bad HTML is widespread, differences between browser parsers and other tools can be exploited as security holes.
-
Filtering out bogus domains - one approach
Does anyone actually buy anything from those bogus domains, or are they all making their money by what is essentially click fraud? Most of them seem to just deliver ads from the usual ad services.
We've been demoing our filter for bogus on-line businesses, SiteTruth, for a while now. Remember "on the Internet, no one knows if you're a dog?" SiteTruth can usually kick the dogs out.
The basic concept is to try to find the business behind the domain. If the web site isn't selling anything and isn't running ads, it's not rated. If it's selling something, there needs to be a business address on the site, preferably one that matches up with business records. So we look through the site for addresses, check SSL certs, look at business directories, do some crunching, and come up with a rating automatically. This is effective against link farms, spam blogs, landing pages, and most of the other trash on the Web.
We use the ratings to reorder search results. We don't block suspicious sites; they just move down in search results. It's a clue stick to apply to suspicious sites - be clear about who's behind the site, or be ignored.
This is an alpha test demo, set up as a search engine web site. The real version will be a browser plug-in. Meanwhile, feel free to try out SiteTruth and complain where appropriate; that's why we're in test. There's a link to the SiteTruth blog on the site if you want to comment. The most interesting searches to try are for heavily spammed keywords, like "herbal viagra" or "london hotels". If your own domains get low ratings, click on the rating icons to find out why. If you're legit, it's usually because the web site has some easy to fix problem.
We've been hearing some grumbling from a few domain owners about this, which indicates we're on the right track. They usually have some long, whiny explanation of why they shouldn't have to disclose the address of their "online business". Tough.
-
Phishing detection by unique URL no longer works.
It's not really enough to just check the URL against some phishing database. The phishing sites now use unique URLs for each phish going out. Some even use unique subdomains. An example is http://onlinesession-949076872.natwest.com.nigy3r.cn.
We've been struggling with this for SiteTruth, which, among other things, uses PhishTank's data. Originally, we used PhishTank's online query API, but that required an exact match on the URL, which was useless. Now we download their entire database every few hours and blacklist the entire base domain (what you buy from a domain registrar) if there's a verified, active phishing site anywhere in the domain.
That seems reasonable enough. But there's collateral damage. So, most days, we have AOL, Microsoft Live, and Yahoo blacklisted. That's because those major sites have "open redirectors" - URLs which will redirect to any specified site. For example,
-
http://r.aol.com/cgi/redir?http://mgw1.haoyisheng.com/icons/asp.html
A convenient, easy to use redirection script popular with phishers. Provides a URL that appears to be on AOL, but isn't. Interestingly, AOL treats as spam any email that uses their own redirector URL. So it's only useful for attacking non-AOL users. -
http://login.live.com/logout.srf?ct=1179231565
&rver=4.0.1532.0&lc=1033&id=64855
&ru=http:%2F%2Fby117w.bay117.mail.live.com%2Fmail%2Flogout.aspx%3Fredirect%3Dtrue
%26logouturl%3Dhttp:%2F%2F62.49.9.117:443/HB.onlineserv.cgi/
The "logout" page for Microsoft Live can be abused, with some effort, to make it appear as if some hostile site is on Microsoft Live. This looks like Microsoft tried "security through obscurity" and failed. -
http://rds.yahoo.com/_ylt=A0Je5VTi9_RDDbAA3TJXNyoA;
_ylu=X3oDMTE2ZXYybGFuBGNvbG8DdwRsA1dTMQRwb3MDMQRzZWMDc3IEdnRpZANpMDIxXzQ3/SIG=15j5u6auo/
EXP=1140214114/**http://hticketing.com/www.bankofamerica.com/sslencrypt218bit/online_banking/
A Yahoo redirector URL intended to create the illusion of a Bank of America site. It may be possible to exploit this as a cross site scripting attack.
These were all active phishing sites an hour or two ago.
Yes, arguably the intelligent user should be able to visually parse the URLs above and realize that they're not really on the sites indicated. Or notice that a redirection took place. But most users don't notice that. Neither do many anti-phishing tools, especially if the attacker combines both techniques described above.
Phishing has reached the point that if you have an open redirector or proxy on your web site, someone will use it to borrow your reputation for their scam. Open redirectors are now like open mail relays - a nice Internet feature that had to be shut down because of exploits.
So fix those open redirectors, people, or expect to be listed as a phishing-friendly site.
-
http://r.aol.com/cgi/redir?http://mgw1.haoyisheng.com/icons/asp.html
-
How aggressive do you want rating systems to be?
How aggressive should systems be about downgrading ratings for web sites? We've been struggling with this for SiteTruth. In addition to SiteTruth's main function, checking business identity, we have some basic phishing checks. We download the PhishTank database every few hours. PhishTank has lists of bad URLs, but now that the smarter phishing sites change URL and even subdomain in each spam e-mail, blocking by URL is no longer effective. So we now flag the entire base domain.
This can have broad effects. Right now, we're blacklisting all of AOL (SiteTruth report) and all of "live.com" (SiteTruth report). Both AOL and Microsoft Live have redirectors which are being actively exploited by phishing sites. We can't tell their safe URLs from their unsafe URLs, so we have to blacklist the whole domain.
When a site with an open redirector plugs the hole, PhishTank will downgrade those "active phishes" to inactive. We'll then pick that up and rerate them within hours. But until they do, they're in the tank. The whole site.
Too harsh? Realistic? Evolution in action? Comments?
-
How aggressive do you want rating systems to be?
How aggressive should systems be about downgrading ratings for web sites? We've been struggling with this for SiteTruth. In addition to SiteTruth's main function, checking business identity, we have some basic phishing checks. We download the PhishTank database every few hours. PhishTank has lists of bad URLs, but now that the smarter phishing sites change URL and even subdomain in each spam e-mail, blocking by URL is no longer effective. So we now flag the entire base domain.
This can have broad effects. Right now, we're blacklisting all of AOL (SiteTruth report) and all of "live.com" (SiteTruth report). Both AOL and Microsoft Live have redirectors which are being actively exploited by phishing sites. We can't tell their safe URLs from their unsafe URLs, so we have to blacklist the whole domain.
When a site with an open redirector plugs the hole, PhishTank will downgrade those "active phishes" to inactive. We'll then pick that up and rerate them within hours. But until they do, they're in the tank. The whole site.
Too harsh? Realistic? Evolution in action? Comments?
-
How aggressive do you want rating systems to be?
How aggressive should systems be about downgrading ratings for web sites? We've been struggling with this for SiteTruth. In addition to SiteTruth's main function, checking business identity, we have some basic phishing checks. We download the PhishTank database every few hours. PhishTank has lists of bad URLs, but now that the smarter phishing sites change URL and even subdomain in each spam e-mail, blocking by URL is no longer effective. So we now flag the entire base domain.
This can have broad effects. Right now, we're blacklisting all of AOL (SiteTruth report) and all of "live.com" (SiteTruth report). Both AOL and Microsoft Live have redirectors which are being actively exploited by phishing sites. We can't tell their safe URLs from their unsafe URLs, so we have to blacklist the whole domain.
When a site with an open redirector plugs the hole, PhishTank will downgrade those "active phishes" to inactive. We'll then pick that up and rerate them within hours. But until they do, they're in the tank. The whole site.
Too harsh? Realistic? Evolution in action? Comments?
-
Corporations have to disclose more than that
Incorrect. Most states that aren't tax dodges (I'm looking at you Delaware) also require some combination of annual reports, corporate bylaws, and principals (often just managing partner).
Yes. I've been data-mining that data for SiteTruth, and it's amusing. No two states have the same format, although there's some similarity. Delaware provides less free info than most states. Some states have deliberate weaknesses in their data. Nevada, for example, doesn't require that changes in corporate officers be reported, so many out of state Nevada corporations have the same guy at a corporation service listed as President.
But that's OK. We're working on recognizing certain common patterns. Like "Incorporation state in low-disclosure states list AND NOT in business directories as having operations in incorporation state AND NOT in SEC Edgar AND NOT registered as foreign corporation in state where doing business IMPLIES slimeball".
Anonymous businesses deserve low search rankings. We're making that happen.
-
Re:XHTML/HTML divergence
I've seen a lot of people wrongly claim that forbidding things like <b><i>...</b></i> is an advantage of XHTML -- are you making the same mistake? Such things are forbidden in HTML too, it's just people tend to get away with things like that with HTML because of its undefined error handling.
I'm looking at it from the side of someone working on a site rating engine that has to parse real-world web pages. There are many constructs that are flatly wrong, yet common enough that they have to be handled. Incorrectly terminated comments show up very frequently, and have to be handled properly. "&" outside an entity is wrong, but almost normal. A </br> without a <br> is treated as a line break by most browsers, rather than being ignored. We see hostile pages that exploit the differences between the standards and what browsers actually understand, using tricks like escaped newlines in URLs. These things are all at the level where the document is turned into a tree, not at the higher levels where we see things like <li> without an enclosing <ul>.
It would be helpful if we could at least insist that pages parse correctly into a tree. Browsers enforce that for XML, which is a good first step towards consistency.
-
The real issues, and how to fix them.
There are only a few major issues:
- Identifying sellers. If you're a seller, you can't be anonymous. That's the law in California and the European Union, but enforcement is weak. We're dealing with that at SiteTruth, where we try to find the business behind the web site. If we can't, we downgrade their search ranking.
- Identifying buyers. That's a problem for the credit card industry. If they really considered it a problem, they'd fix it. They have the tools. One-time credit card numbers, confirmation by cell phone, smart credit cards - solutions are known.
- Spam Spam by legitimate businesses mostly died with CAN-SPAM, because anything clearly identifiable can be easily filtered. Everything left comes from crooks. And not very many different crooks. Notice how few different spams get through your filters. What's left is a law enforcement problem. Someday the main Viagra spammer will be found and arrested, and that problem will shrink. The US SEC is working the pump-and-dump problem.
- Vulnerable clients Make Microsoft financially liable and the problem gets fixed, fast.
-
Highlighting phishing sites is nice, but weak
Just highlighting domains of phishing sites isn't going to be enough. Here's today's list of domains that "sort of look like Paypal". These are after subdomain truncation.
"paypal-checker.com"
"paypal-contact.net"
"paypal-customize.com"
"paypal-erreur2.com"
"paypal-security.com"
"paypal-web-dll-scrnupdateaccount.ici.st"
"paypal-web-scrn-dll-pl-dai-pl-webscrndllfs-wertyu i.ork.pl"
"paypal.powered.at"
"paypal.q.fm"
"paypalaccverify.com"
"paypalcomcgibinwebscrcmd.by.ru"
"paypalcomcgibinwebscrcmm.by.ru"
"paypalcomcgibinwebscre.by.ru"
"paypalconstomers.com"
"paypalct.com"
"paypall.ro"
"paypalmd.com"
"paypalobjects.us"
"paypalsecuritycenter.org"
"paypalverification.org"
"paypel-acc-5.com"
"paypilpal.com"
"paypll-wscr.com"
"paypluspl.com"
These are from PhishTank, which blacklists at the URL level based on manual reports. For SiteTruth", we're in the process of converting to blacklisting phishing sites by the entire base domain. That's because we now see hundreds of entries like "session-624333.nationalcity.com.userpro.tw", which has to be treated as a bad indicator for all of "userpro.tw".
There's collateral damage. There are days when "tinyurl.com" and "notlong.com" get blacklisted, because phishing sites use them. MSN gets complaints about this. Today, anybody running something like "tinyurl" needs to continually check the phishing databases for attempts to abuse their service, or their own reputation is toast.
-
How to do ratings right
We've had to face a similar problem as Avvo with SiteTruth, which rates web sites. The answer seems to have two parts - integrity and transparency. This means looking at information that comes from reliable sources other than the thing being rated, and showing the information from which the ranking is derived.
Avvo is trying to do this. Avvo's information comes partly from external sources, like legal directories and records of disciplinary actions. That's less game-able than traditional web search. And Avvo shows that information, so they have transparency.
Google is slowly coming around to this point of view. Originally, Google rankings were opaque, but now they've put in various "Webmaster Console" features to show some of the information that drives their algorithm.
Google faces the problem that some of their metrics for detecting junk web sites are heuristic, and rely on "security through obscurity". They don't want to say exactly how obscure text can be before it's considered "hidden text", or exactly what they consider a "link farm", or they'll be spammed right up to the allowed limit. So they can't have full transparency. They're inherently limited by the approach of primarily looking at the web site itself, which the site operator can change freely, to rate the site.
Google does look at some external non-Web information, but mostly things like how long a domain has been registered.
Avvo has user ratings of lawyers, which probably aren't that useful. User ratings are most valuable when the universe of raters is much larger than the number of things being rated. So it's good for major movies, where there are tens of new movies and millions of fans, marginal for hotels, and weak for businesses few people have heard of. There aren't enough clients per lawyer to get a statistically valid result, and it's too easy to game when the number of raters is small.
-
Re:Blacklists don't work any more.
SiteTruth rates SiteTruth itself as "Site ownership identified but not verified." (a yellow question mark), which is correct - there's a valid name and address on the web site, but no third party verification of business identity. That's a neutral rating by our standards. The red circle with a bar through it is a bad rating. To get a good rating, a green checkmark, some third party has to verify business identity. A valid BBBonline seal (and yes, we check) or an SSL cert with a name and address will do it. We're working on verification via credit card processors for sites using off-site payment systems. Click on any rating icon for the full explanation of a rating.
The standard we're enforcing is rather low. Insisting that a business have a valid, published, name and address isn't an obstacle to any legitimate business. Yes, the standards we're enforcing are slightly higher than those of California law. They're consistent with the Consumer's Union WebWatch guidelines ("Web sites should clearly disclose the physical location where they are produced, including an address, a telephone number or e-mail address") and the European Electronic Commerce Directive (Member States shall ensure that the service provider (defined as "any natural or legal person providing an information society service") shall render easily, directly and permanently accessible to the recipients of the service and competent authorities, at least the following information: (a) the name of the service provider; (b) the geographic address at which the service provider is established.) There's little support in law for anonymous businesses. If you're running a business without a published, valid name and address, there's something wrong.
All we ask for is a "Contact" or "About" page with a name and address in a format that would work on a mailed envelope. We can find that in HTML text; you don't have to do anything special for SiteTruth.
Name and address is just the first stage. Once we have a solid name and address, we can match it against business databases - state incorporation records, D/B/A names, criminal records, and credit ratings. We're doing some of that now, and will be doing more.
-
Blacklists don't work any more.
Blacklists aren't really working any more. As with spam, where each spam message is now different, and as with viruses, where the smarter ones are different for each copy, the more advanced phishing sites now generate multiple sites, not just one site.
PhishTank is fooled by this. It assumes that a "phish site" is a unique URL. The phishing sites are now wise to that trick; many sites generate a new URL for each user, and some even generate a new domain. Current domains in PhishTank include "session-97701.nationalcity.com.userpro.io", "session-300962.nationalcity.com.userpro.io", "session-5489554.nationalcity.com.userpro.tw", "session-2721837.nationalcity.com.directories.io"
, etc. There are presumably many, many more that no user has reported yet. So the blacklist defense is failing.It's thus too late for approaches based on manual detection. In the early days of spam, we all reported spam sites to SpamCop, which then blocked them. That stopped working years ago. The same has now happened for phishing sites.
The hard line approach is to implement something that prevents putting in credit card or bank information into forms unless the target page has a solid SSL certificate. (And not one those "Instant SSL - Domain Control Only Validated" cheapo certs that mean nothing, either.) It's getting harder to make even that work, with more and more Javascript processing going on in the browser. The browser may not be able to detect that the user is filling in a form.
We (SiteTruth), of course, are trying to promote the idea that you don't want to deal with a website unless the business behind the website can be clearly identified, so we do have a bias here. Nor do we have all the answers. But from the amount of activity in this area of security in the last month, it's becoming clear that some major tightening-up on business legitimacy on the web is needed.
"On the Internet, no one knows if you're a dog" just isn't good enough any more.
-
How we filter this stuff
One of the things we do with SiteTruth is filter out sites like this.
SiteTruth is looking for the name and address of the business behind any web site that's selling something. If we can't find a name and address in a place most users would look, it's an illegal business (see California B&P code section 17538, European Directive on Electronic Commerce, etc.) So they get a rating - a big red circle with a bar through it. And they go to the bottom of the search rankings.
If they do give a name and address, we look it up in business databases, and try to tie it to a corporation or a business license. "Millions of Addresses, Thousands of Sites, One Business" is something we can see - if huge numbers of domains map to one real-world business, that just screams "domain spammer".
We're still in alpha test, so you have to go to our web site to see this, but in time there will be toolbars to squelch this junk at the browser level.
Think of it as "spam filtering 2.0".
-
How we filter this stuff
One of the things we do with SiteTruth is filter out sites like this.
SiteTruth is looking for the name and address of the business behind any web site that's selling something. If we can't find a name and address in a place most users would look, it's an illegal business (see California B&P code section 17538, European Directive on Electronic Commerce, etc.) So they get a rating - a big red circle with a bar through it. And they go to the bottom of the search rankings.
If they do give a name and address, we look it up in business databases, and try to tie it to a corporation or a business license. "Millions of Addresses, Thousands of Sites, One Business" is something we can see - if huge numbers of domains map to one real-world business, that just screams "domain spammer".
We're still in alpha test, so you have to go to our web site to see this, but in time there will be toolbars to squelch this junk at the browser level.
Think of it as "spam filtering 2.0".
-
How we filter this stuff
One of the things we do with SiteTruth is filter out sites like this.
SiteTruth is looking for the name and address of the business behind any web site that's selling something. If we can't find a name and address in a place most users would look, it's an illegal business (see California B&P code section 17538, European Directive on Electronic Commerce, etc.) So they get a rating - a big red circle with a bar through it. And they go to the bottom of the search rankings.
If they do give a name and address, we look it up in business databases, and try to tie it to a corporation or a business license. "Millions of Addresses, Thousands of Sites, One Business" is something we can see - if huge numbers of domains map to one real-world business, that just screams "domain spammer".
We're still in alpha test, so you have to go to our web site to see this, but in time there will be toolbars to squelch this junk at the browser level.
Think of it as "spam filtering 2.0".
-
Who's paying for those clicks?
Somebody is paying for all those clicks, and they're probably not getting much actual business from them. Advertisers are getting fed up with paying for "clicks", just as they did with "banner views" a few years back. The trend is towards paying only for actual sales directly derived from an ad. That's what "Google Checkout" is really about.
It's not hard to filter out typosquatting sites. We do it with SiteTruth, which tries to find the real-world business behind the web site, and down-rates the ones where it can't be found. Almost all the typosquatting sites are anonymous. Some of them have reasonably high Google rankings, because they have inbound links, but as soon as you look behind the facade of the web site, it's clear there's nothing behind them.
With all this "domaining", link-based page rank is no longer meaningful for small and medium business sites. With hundreds of thousands of phony domains, all linking to each other, a growing fraction of business links are just noise. Search engines try to filter out this stuff, but it's like spam filtering; it mostly works, but isn't airtight. With a high volume of junk sites, enough bad links get through to affect ranking.
The other two web-based sources of credibility, user-provided ratings and blogs, are also collapsing. Blog spam is a huge problem. Not only do existing blogs get spammed, millions of automatically created dummy blogs full of spam have been created. Until recently, user provided ratings had some credibility, but now there's a Collactive, which has a sort of spam engine for ratings, Digg, Reddit, and such. (Their slogan: "It's good to be popular").
Amusingly, in this world of spam, Usenet, where spam began, has become almost spam-free.
-
What's a bank? What's a legitimate business?
I posted "What's a bank?" previously, with some examples of ambiguous cases. If the criteria for some ".bank" domain are broadened to financial service businesses generally, it's even worse. That pulls in mortgage brokers, which range from major firms like Provident to the "Lenders compete from your business" spammer. Then there are the "offshore" operators, the "High Yield Investment Program" people, hedge funds of varying degrees of legitimacy, and armies of "affiliates" and "resellers". Expecting domain registrars, who have a terrible reputation as verification services, to sort this out is asking too much.
We've been struggling with this issue for SiteTruth, where we try to rate businesses for "legitimacy". Simply trying to associate the name and address of a legitimate business with a web site is enough to filter out a huge number of marginal web businesses. But it's not a solid protection against more determined fraud operations. We check against third-party sources for identity verification, which helps. We give the highest rating only to sites for which we have some source of third-party confirmation (a valid SSL cert with a name and address, a BBBOnline seal, etc.)
The Online Better Business Bureau is probably the best verification service right now. Their seal of approval actually means something. (But click on it to check that the BBB site says the seal is valid. We check that automatically with SiteTruth, and there are definitely sites out there using the BBBonline graphic that aren't entitled to do so.)
The PhishTank people have a user-reported list of "phishing sites", but it's always behind. Worse, it's by URL, not domain, so sites that generate a new URL for each spam escape that check.
There have been several previous attempts at "identify your business as legitimate by paying us money". This ".bank" scheme falls into that category. Before that, "High Assurance" certificates were touted as a similar scheme. There are several companies selling "seals of approval"; there's "ValidatedSite.com", the "International Bureau of Certified Website Merchants", "Guardian ECommerce", and the "International Chamber of E-Commerce". Most of the certificate authorities have some kind of seal program, too. This ".bank" thing is the same idea, at a higher price point.
-
Re:Web page aliasing issues.
Just for the record
SiteTruth
for http://sitetruth.com/
Site rating
Rating: "Site ownership identified but not verified."
From http://www.sitetruth.com/about.html (Web site, medium confidence)
SiteTruth
999 Woodland Avenue
Menlo Park, CA 94025
* postalcode = 94025
* location = Menlo Park
* countrycode = US
* statecode = CA
Secure certificate
No valid certificate.
Domain www.sitetruth.com
SSL error during certificate validation (certificate verify failed)
No usable certificate available.
Contents of web site
Site "SiteTruth - know who you're dealing with"
Examined 2 web pages in 1.0 seconds.
1. http://www.sitetruth.com/
2. About
Street addresses found:
* http://www.sitetruth.com/about.html (Web site, medium confidence)
SiteTruth
999 Woodland Avenue
Menlo Park, CA 94025
o postalcode = 94025
o location = Menlo Park
o countrycode = US
o statecode = CA
PhishTank fraud database report
Not in PhishTank database.
Information from secondary sources
Commercial site.
Rating history
Rating Date Site IP address Rated by
Site ownership unknown or questionable. 2007-04-15 05:11:47 69.64.67.33 SiteTruth 0.00
Site ownership identified but not verified. 2007-04-14 13:31:24 69.64.67.33 SiteTruth 0.00
and a big question mark at the top of the generated report that means:
"Site ownership not clearly verified, or some issues exist with the business." -
Re:Web page aliasing issues.
Just for the record
SiteTruth
for http://sitetruth.com/
Site rating
Rating: "Site ownership identified but not verified."
From http://www.sitetruth.com/about.html (Web site, medium confidence)
SiteTruth
999 Woodland Avenue
Menlo Park, CA 94025
* postalcode = 94025
* location = Menlo Park
* countrycode = US
* statecode = CA
Secure certificate
No valid certificate.
Domain www.sitetruth.com
SSL error during certificate validation (certificate verify failed)
No usable certificate available.
Contents of web site
Site "SiteTruth - know who you're dealing with"
Examined 2 web pages in 1.0 seconds.
1. http://www.sitetruth.com/
2. About
Street addresses found:
* http://www.sitetruth.com/about.html (Web site, medium confidence)
SiteTruth
999 Woodland Avenue
Menlo Park, CA 94025
o postalcode = 94025
o location = Menlo Park
o countrycode = US
o statecode = CA
PhishTank fraud database report
Not in PhishTank database.
Information from secondary sources
Commercial site.
Rating history
Rating Date Site IP address Rated by
Site ownership unknown or questionable. 2007-04-15 05:11:47 69.64.67.33 SiteTruth 0.00
Site ownership identified but not verified. 2007-04-14 13:31:24 69.64.67.33 SiteTruth 0.00
and a big question mark at the top of the generated report that means:
"Site ownership not clearly verified, or some issues exist with the business." -
Re:Web page aliasing issues.
Just for the record
SiteTruth
for http://sitetruth.com/
Site rating
Rating: "Site ownership identified but not verified."
From http://www.sitetruth.com/about.html (Web site, medium confidence)
SiteTruth
999 Woodland Avenue
Menlo Park, CA 94025
* postalcode = 94025
* location = Menlo Park
* countrycode = US
* statecode = CA
Secure certificate
No valid certificate.
Domain www.sitetruth.com
SSL error during certificate validation (certificate verify failed)
No usable certificate available.
Contents of web site
Site "SiteTruth - know who you're dealing with"
Examined 2 web pages in 1.0 seconds.
1. http://www.sitetruth.com/
2. About
Street addresses found:
* http://www.sitetruth.com/about.html (Web site, medium confidence)
SiteTruth
999 Woodland Avenue
Menlo Park, CA 94025
o postalcode = 94025
o location = Menlo Park
o countrycode = US
o statecode = CA
PhishTank fraud database report
Not in PhishTank database.
Information from secondary sources
Commercial site.
Rating history
Rating Date Site IP address Rated by
Site ownership unknown or questionable. 2007-04-15 05:11:47 69.64.67.33 SiteTruth 0.00
Site ownership identified but not verified. 2007-04-14 13:31:24 69.64.67.33 SiteTruth 0.00
and a big question mark at the top of the generated report that means:
"Site ownership not clearly verified, or some issues exist with the business." -
Re:Web page aliasing issues.
Just for the record
SiteTruth
for http://sitetruth.com/
Site rating
Rating: "Site ownership identified but not verified."
From http://www.sitetruth.com/about.html (Web site, medium confidence)
SiteTruth
999 Woodland Avenue
Menlo Park, CA 94025
* postalcode = 94025
* location = Menlo Park
* countrycode = US
* statecode = CA
Secure certificate
No valid certificate.
Domain www.sitetruth.com
SSL error during certificate validation (certificate verify failed)
No usable certificate available.
Contents of web site
Site "SiteTruth - know who you're dealing with"
Examined 2 web pages in 1.0 seconds.
1. http://www.sitetruth.com/
2. About
Street addresses found:
* http://www.sitetruth.com/about.html (Web site, medium confidence)
SiteTruth
999 Woodland Avenue
Menlo Park, CA 94025
o postalcode = 94025
o location = Menlo Park
o countrycode = US
o statecode = CA
PhishTank fraud database report
Not in PhishTank database.
Information from secondary sources
Commercial site.
Rating history
Rating Date Site IP address Rated by
Site ownership unknown or questionable. 2007-04-15 05:11:47 69.64.67.33 SiteTruth 0.00
Site ownership identified but not verified. 2007-04-14 13:31:24 69.64.67.33 SiteTruth 0.00
and a big question mark at the top of the generated report that means:
"Site ownership not clearly verified, or some issues exist with the business." -
"Web 2.0" redesigns bad, async uses better
Tribe.net redesigned their home page to use "Web 2.0" around the beginning of 2007. Now users could drag the various boxes around, rearrange the home page, and choose which elements they wanted. (Except for the ads, of course, which were immovable.) The main effect was that "Tribe.net bug reports" became one of the most active groups. Tribe's traffic ratings in Alexa continued to slide.
There are uses for the asynchrony of XMLHttpRequest, though. Try our search and rating box. We have a site rating engine which rates sites on demand, and it takes takes about 8 to 30 seconds for sites it hasn't looked at yet. We needed a way to present this to the user without stalling the user's browsing.
So we needed a truly asynchronous web page, and we have one. When you enter text into the box and click the big "Search" button, the site gets all the results it can get from the databases immediately, and updates the page. The sites for which ratings aren't yet available show as rotating "busy" icons, which are replaced over the next few seconds as the server reads the target web site, rates it, and sends the ratings back to the browser.
If we did this with stock HTML, the whole thing would feel so sluggish as to be useless. But with a dynamic page, the user gets useful results immediately, which improve over the next 8 to 30 seconds. The user's browsing isn't stalled. In fact, if you enter something new into the search box while updates are in progress, outstanding XMLHttpRequest requests are aborted, and you can do a new search without waiting for old ratings to complete.
Few "Web 2.0" sites seem to support as much asynchrony. Google Maps is probably the best known site that really is asynchronous in a useful way.
-
Not hard to improve search if not selling ads
It's not all that hard to improve search. The problem is improving search when you're really in the business of selling ads.
With Yahoo, this is painfully obvious. Yahoo has a good search engine, but their home page and search result pages are so ad-heavy that they're annoying to use. Google has so far resisted the temptation to run picture ads, but there's heavy pressure for them to do so, from both investors and advertisers. The smaller search entrants tend to have more ads; they need the revenue.
As a technology demo, we have SiteTruth, a consumer-oriented search and rating system, in alpha test. We rate commercial web sites based on "business legitimacy". That starts with finding the business name and address of the organization behind the web site. That bypasses most "affiliates", "doorway pages", and similar junk sites, and gets you to the actual site selling something. As we take this further, we'll be validating incorporation status, business licenses, business credit ratings, and the other things one checks when checking out a business. You know, all that stuff you're supposed to check, but nobody ever does.
Then we push the sites with bad ratings to the bottom of the search results and off into obscurity, where they usually belong.
Site rating has been tried before, but usually based on user recommendations. User recommendations are too easy to fake; most of the people who write them are interested parties. (Our neighborhood video store is offering a free rental if you say good stuff about them on Yelp.) And coverage is usually narrow; there are more sites than people rating them.
-
Business records, yes
I'd like to have the name and address of every legitimate business in the United States, for web site legitimacy validation. I've purchased databases which contain an approximation to that information, but that's mostly phone book data, not Government data.
More business records need to be easily available. This varies from state to state now. Corporation records are usually freely available, although a few states (notably Delaware) charge for address information. Every US state has their own format; I've yet to find two states using the same output format on their web site. That's a hassle, but can be overcome.
D/B/A name and business license data is even harder to get. It's public record information, and you can get it from data brokers, but it's fairly expensive and not current.
It's easier for some major countries outside the US. The UK has centralized business registration at Companies House. You can get this kind of information for all the G-7 countries (although not for Russia) and most of the major exporting countries, including China.
-
We need better business validation
Without better certification standards, it won't help.
The SSL certificate industry has created something of a mess. In the beginning, it was reasonably hard to get an SSL certificate; you actually had to demonstrate business existence. Standards have since declined considerably.
We've been doing some automatic SSL certificate checking, and we keep finding dirty laundry. State name instead of ZIP code in the "postal code" field. Even incorrect corporate registration numbers in "extended validation" certificates. And this is in certificates where the information has supposedly been validated by the issuer. One major certificate issuer, asked about this, replied "That's what the customer put there", which gives a hint as to the amount of "checking" going on.
"Domain only" certificates, with no business address, have essentially no value. They shouldn't even turn on the lock icon in browsers.
"Extended validation" certificates actually have what ought to be a decent validation system, but they're incredibly overpriced. $1000 per year is overpriced, considering that all they're doing is validating corporate identity.
It's not that hard to do this right. The way it should work is that, when someone signs up for a SSL certificate of any kind, they have to give the business identity of the business. That's looked up in the appropriate government records, and a passcode is sent by mail to the address associated with the business. For a corporation, the address for service of process is used, which gets it to the company's attorneys. Issuance of the SSL certificate should only happen once that passcode has been entered. This is cheap to do. You need a physical mailing operation, but that can be outsourced easily to any major direct mail firm. For Extended Validation certificates, use FedEx or registered mail, so delivery confirmation comes back.
In fact, domain registration should work like that. When you register a domain, you should get postal mail back with an authorization code, and the domain doesn't go into DNS until that authorization code is input. If you're in a hurry, you can pay extra and get the authorization code sent by FedEx Overnight. This should add about $3 to the cost of registering a domain, and the Whois data would get much better.
If we can get the certificate mess under control, the next step is something in the browser's user interface that prevents putting a credit card number, recognized by its format, into a form field unless the page is secure. That might be worth putting in Firefox.
Meanwhile, over at SiteTruth, we're trying to attack this problem via search rating: lack of valid business identity + selling something = low ranking. We're still at the proof of concept stage, but it looks promising.
-
What is web spam? Ads from phony businesses.
This is good work by Microsoft. They've tracked down a few big-time web spammers, all the way up the food chain. But there are more.
We've been working on the web spam problem, from a different angle. Our starting point is the legal requirement that a business cannot be anonymous. Every legitimate business must have an identifiable person or corporation behind it. (See CA B&P code sec. 17358, ("disclosure of
... legal name and address information shall appear on ... the first screen displayed ... (or) on the screen on which a buyer may place the order for goods or services ...") the European Directive on Electronic Commerce ("the service provider shall render easily, directly and permanently accessible to the recipients of the service and competent authorities, at least the following information: (a) the name of the service provider; (b) the geographic address at which the service provider is established...")Given that basis, our solution to web spam is straightforward: if we can't find a valid business name and address on a web site that's selling or advertising, it's not a legitimate business. Of course, if there is a name and address, it should match business license data, corporate registration data, fictitious name filings, and similar records of business existence.
So we have a system that parses web pages in some detail, looking for addresses. If a web site has a name and address on it that obeys postal addressing rules, we can usually find it. We have access to some business databases, and we're adding more. We look at some other info, like SSL certs and BBB seals, which has some credibility. Thus, we can check for legitimacy.
Our goal is to feed this into search engine rankings, so that non-legitimate businesses fall out of visibility.
"Doorway pages" and "affilates" with no business behind them aren't legitimate businesses, so they're toast. Completely phony addresses won't work, either; they won't match business records. Stealing the name address of a legitimate business is felony identity theft, which is a place you don't want to go. (Also, sometimes, we can detect and report that.)
An early version of this is already running at SiteTruth.com. If you're responsible for a commercial web site, run it through the Detailed SiteTruth analysis, for Webmasters and see what SiteTruth finds. If SiteTruth can't find your business name and address, you might want to fix that. The day will come when it affects your search placement.
This is the alpha test phase for SiteTruth; there's more coming.
Web spam used to be a safe tactic. That was then. This is now.
-
What is web spam? Ads from phony businesses.
This is good work by Microsoft. They've tracked down a few big-time web spammers, all the way up the food chain. But there are more.
We've been working on the web spam problem, from a different angle. Our starting point is the legal requirement that a business cannot be anonymous. Every legitimate business must have an identifiable person or corporation behind it. (See CA B&P code sec. 17358, ("disclosure of
... legal name and address information shall appear on ... the first screen displayed ... (or) on the screen on which a buyer may place the order for goods or services ...") the European Directive on Electronic Commerce ("the service provider shall render easily, directly and permanently accessible to the recipients of the service and competent authorities, at least the following information: (a) the name of the service provider; (b) the geographic address at which the service provider is established...")Given that basis, our solution to web spam is straightforward: if we can't find a valid business name and address on a web site that's selling or advertising, it's not a legitimate business. Of course, if there is a name and address, it should match business license data, corporate registration data, fictitious name filings, and similar records of business existence.
So we have a system that parses web pages in some detail, looking for addresses. If a web site has a name and address on it that obeys postal addressing rules, we can usually find it. We have access to some business databases, and we're adding more. We look at some other info, like SSL certs and BBB seals, which has some credibility. Thus, we can check for legitimacy.
Our goal is to feed this into search engine rankings, so that non-legitimate businesses fall out of visibility.
"Doorway pages" and "affilates" with no business behind them aren't legitimate businesses, so they're toast. Completely phony addresses won't work, either; they won't match business records. Stealing the name address of a legitimate business is felony identity theft, which is a place you don't want to go. (Also, sometimes, we can detect and report that.)
An early version of this is already running at SiteTruth.com. If you're responsible for a commercial web site, run it through the Detailed SiteTruth analysis, for Webmasters and see what SiteTruth finds. If SiteTruth can't find your business name and address, you might want to fix that. The day will come when it affects your search placement.
This is the alpha test phase for SiteTruth; there's more coming.
Web spam used to be a safe tactic. That was then. This is now.
-
What is web spam? Ads from phony businesses.
This is good work by Microsoft. They've tracked down a few big-time web spammers, all the way up the food chain. But there are more.
We've been working on the web spam problem, from a different angle. Our starting point is the legal requirement that a business cannot be anonymous. Every legitimate business must have an identifiable person or corporation behind it. (See CA B&P code sec. 17358, ("disclosure of
... legal name and address information shall appear on ... the first screen displayed ... (or) on the screen on which a buyer may place the order for goods or services ...") the European Directive on Electronic Commerce ("the service provider shall render easily, directly and permanently accessible to the recipients of the service and competent authorities, at least the following information: (a) the name of the service provider; (b) the geographic address at which the service provider is established...")Given that basis, our solution to web spam is straightforward: if we can't find a valid business name and address on a web site that's selling or advertising, it's not a legitimate business. Of course, if there is a name and address, it should match business license data, corporate registration data, fictitious name filings, and similar records of business existence.
So we have a system that parses web pages in some detail, looking for addresses. If a web site has a name and address on it that obeys postal addressing rules, we can usually find it. We have access to some business databases, and we're adding more. We look at some other info, like SSL certs and BBB seals, which has some credibility. Thus, we can check for legitimacy.
Our goal is to feed this into search engine rankings, so that non-legitimate businesses fall out of visibility.
"Doorway pages" and "affilates" with no business behind them aren't legitimate businesses, so they're toast. Completely phony addresses won't work, either; they won't match business records. Stealing the name address of a legitimate business is felony identity theft, which is a place you don't want to go. (Also, sometimes, we can detect and report that.)
An early version of this is already running at SiteTruth.com. If you're responsible for a commercial web site, run it through the Detailed SiteTruth analysis, for Webmasters and see what SiteTruth finds. If SiteTruth can't find your business name and address, you might want to fix that. The day will come when it affects your search placement.
This is the alpha test phase for SiteTruth; there's more coming.
Web spam used to be a safe tactic. That was then. This is now.
-
Google doesn't, but it's possible
I'd thought Google would be doing that by now. I've been implementing something that has to read arbitrary web pages (see SiteTruth) and extract data, and I've been considering how to deal with JavaScript effectively.
Conceptually, it's not that hard. You need a skeleton of a browser, one that can load pages and run Javascript like a browser, builds the document tree, but doesn't actually draw anything. You load the page, run the initial OnLoad JavaScript, then look at the document tree as it exists at that point. Firefox could probably be coerced into doing this job.
It's also possible to analyze Flash files. Text which appears in Flash output usually exists as clear text in the Flash file. Again, the most correct approach is to build a psuedo-renderer, one that goes through the motions of processing the file and executing the ActionScript, but just passes the text off for further processing, rather than rendering it.
Ghostscript had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language. It has variables, subroutines, and an execution engine. You have to run PostScript programs to find out what text out.
OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.
Sooner or later, everybody who does serious site-scraping is going to have to bite the bullet and implement the heavy machinery to do this. Try some other search engines. Somebody must have done this by now.
Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.
-
Our answer for search - SiteTruth
We hadn't planned to announce this quite yet, but this is a good opportunity.
We have a new answer to search - SiteTruth. It's working, but not yet open to the public.
Other search engines rate businesses based on some measure of popularity - incoming links or user ratings. SiteTruth rates businesses for legitimacy.
What determines legitimacy? The sources anti-fraud investigators tell you to check, but nobody ever does. Corporate registrations. Business licenses. Better Business Bureau reports. The contents of SSL certificates. Business addresses. Business credit ratings. Credit card processors. All that information is available. It's a data-mining problem, and we've solved it. The process is entirely automated.
Most of the phony web sites, doorway pages, and other junk on the web have no identifiable business behind them. Try to find out who really owns them, and you can't. When we can't, we downgrade their ranking. With SiteTruth, you can create all the phony web sites you want, but they'll be nowhere the beginning of any search result.
Creating a phony company, or stealing the identity of another company, is possible, but it's difficult, expensive and involves committing felonies. Thus, SiteTruth cannot be "gamed" without committing a felony. This weeds out most of the phonies.
SiteTruth only rates "commercial" sites. If you're not selling anything or advertising anything, SiteTruth gives you a neutral or blank rating. If you're engaged in commerce, you can't be anonymous. In many jurisdictions, it's a criminal offense to run a business without disclosing who's behind it. That's the key to SiteTruth.
Our tag line: "SiteTruth - Know who you're dealing with."
The site will open to the public in a few months. Meanwhile, we're starting outreach to the search engine optimization community to get them ready for SiteTruth. We want all legitimate sites to get the highest rating to which they're entitled. An expired corporate registration or seal of trust hurts your SiteTruth ranking, so we want to remind people to get their paperwork up to date.
The patent is pending.