I would say the school system has already done half of the job for Microsoft.
It's not the school system. American History teachers are still emphasizing the Bill of Rights, one class period a day, three or four days out of the school year, just as they always have. The 4 or 5 hours a day of TV that teenagers are soaking up, 365 days a year, aren't so patriot^Wcourteous.
IMO, it's the TV media - and the government who's pulling the media's marionette strings - who are to blame. Yet the media is supposedly "liberal." Whatever.
Please don't blame the education system or the teachers. They aren't at fault here.
Who are you quoting? And if you can name a source for your "statistic" who funded the poll and for what purpose? I am skeptical of most polls because their objective isn't always stated up front, their samples of the population are too small, and the questions can sometimes be misleading.
How about this, "One in three U.S. high school students say the press ought to be more restricted, and even more say the government should approve newspaper stories before readers see them, according to a survey being released today." 112,003 high school students were surveyed, that doesn't seem like too small a population to me.
Hindsight is 20/20 of course, but it seems there should have been the suspicion that someone who can discover, investigate and report on a newsworthy phenomenon every 2.5 days for 5 years straight might be cutting corners somewhere.
I take it you don't subscribe to the newspaper. Some of those journalists would probably kill for the free time afforded by a 2.5 day turnaround.
Back in 2002, Wired (the print version) decided to review one of my freeware apps. I was contacted by two different people. One was a Wired graphics person, who wanted screenshots that I - being a coder, not a graphics monkey - was unfortunately unable to provide in whatever format they wanted, so a picture of my app didn't appear in the mag. Second was someone from Conde Nast who sent a form letter saying, in a nutshell, "We're reviewing your stuff. It'll be in issue 10.7. Wanted to let you know. Thanks."
I actually had an exchange with the graphics chick, because I really wanted to give her the screenshots she was looking for. Language barrier between gfxchic and coderdude screwed things up, she couldn't describe in terms I could figure out how to take the shot that she wanted. I wound up writing her a release saying that if I can't do it, she's more than welcome to; the review went to press with no images. (At which point I started wondering, why is the graphics person asking me to provide a screenshot? Can't she just download the damn app and take her own shots? But I digress.)
Anyway, all of this was over a one paragraph software review. A nothing story and they were begging for my help. Most places would just have reviewed it without contacting me.
I don't fault Wired, per se, for Ms. Delio's journalistic transgressions. Jayson Blair was able to defraud the freaking NYT for awhile; Wired rather pales in comparison. I consider Wired to be more the victim than the perp here.
You think that Internet commerce will break down if someone can sniff your credit card number. But then, when you go to a restaurant, you hand over your physical credit card to some waiter you don't know from Adam.
This analogy is horribly flawed in both the attack vector and the viability of attack.
When you go to a restaurant and hand your credit card to the waiter, the waiter swipes your card and returns it to you. There is the opportunity for the waiter himself, and potentially one or two other people who may either witness or take part in the swiping, to retain your card data. There are between 1 and 3 people who may have the capability to steal your card details, and the likelihood of it actually happening approaches zero.
When you buy something on Amazon, your credit card data passes through 6 or 8 hops on its way to Amazon (assuming you're on broadband, add a few for dialup), and several more hops in transit from Amazon to their credit card processor. You have no control over those routers, you can't see them; at the same time, you have no idea who else does "own" those routers.
Someone lifting credit cards from a physical retail outlet, aside from being incredibly stupid, is almost certain to get caught. As soon as two disputed charges share a previous charge at a common location, the credit card issuer's fraud division is going to open an investigation into the common merchant. More than two, and it becomes obvious where the card numbers are being stolen from. What waiter is going to take that risk?
On the other hand, sniffing a router somewhere between you and Amazon is far less risky and gives a much greater payoff. After all, if that router lies between you and Amazon, it lies between everyone on your ISP and Amazon. Why bother stealing 2 or 3 credit card numbers in person, at a location where you can be traced, when you can anonymously sniff a router somewhere and gather hundreds or thousands of credit cards?
If SSL is broken, it will cripple ecommerce until a replacement comes along. Punching your card number into a website is nothing at all like handing your card to a waiter.
Have you heard of localization (l10n)? Internationalization (i18n)? Whereas we geeks often use numbers to shorten words, in aviation, the letter X is the shortcut.
It seems to have started with the "ics" words, likely because of the "ix" sound. Mechanics became MX, logistics became LX, avionics became AX. It branched out from there; maintenance is now often abbreviated to MX (and is somewhat interchangeable with mechanics), and weather is WX. Not positive whether or not - pun intended - WX originated in aviation.
Got a long word, but not enough paint? Enter the X. In this case, at least, it actually makes sense. Transfer is often shortened to Xfer in any number of industries.
Currently in the open-proxy and comp-sys-ddos (obviously compromised machines) we have listed over 1.3 million machines. I honestly think that we can do better than to have 1.3 million machines which have been responsible for spewing crap since the inception of the AHBL 2 years ago.
Are you saying that there are 1.3 million positive hosts in the AHBL right now, or that over the past two years, you've had a combined total of 1.3 million hosts? There is a world of difference between these two situations, but I can't tell from your statement which one you meant.
1.3 million hosts over 2 years is only about 1600 hosts per day, which isn't a lovely picture, but doesn't seem all that bad. If you've actually got 1.3 million positives right now, I want rsync access to run a local copy!
I am a hunter. When I hunt animals, I am out in the woods with them. Sometimes I find game and sometimes I do not...it all depends on how quiet I am, if I'm tracking correctly, and how well I know the behavior of the critter I'm looking for. It is NOT a sport.
I am a basketball player. When I play another team, I am out on the court with them. Sometimes I land a bucket and sometimes I do not...it all depends on how alert I am, if I'm responding to players' movements correctly, and how well I know the tactics of the other team. It is NOT a sport.
Yeah, I didn't think it made sense either.
I'm not a big proponent of hunting without using the animals you kill, and I respect the fact that you, too, will only kill what you will use. That doesn't mean that hunting isn't a sport. Even pure marksmanship, with nothing to "kill" but a target, is a sport.
That's not insightful. His site is broken, and Google shows information it shouldn't as a result.
It's not "broken" through Squid. We've got Squid running with caching, and it has no problem handling multiple logins to SA (and Slashdot, and Fark, and Broadbandreports, and Webhostingtalk, etc. etc.) from the same external interface/IP.
There are probably other cases in which the site breaks that hadn't been noticed yet because not that many people used a caching proxy before now.
Not many people use a caching proxy? AOL has managed to run a giant caching proxy farm for years now, serving its tens of millions of members, without the adverse side-effects that Google's accelerator is showing.
What's the difference between this and your ISP? your ISP could do the same stuff people claim google can do (as far as tracking). I would like to know how the hell someone got logged in as someone else.
It's pretty simple, really.
1) Bob installs Google Web Accelerator.
2) Bob visits (let's say) Slashdot, and logs in as username "Bob."
3) Bob loads a couple of pages, maybe posts a message or two, then he goes to sleep. Meanwhile, Google caches all of the pages he visits.
4) George, who also uses Google Web Accelerator, visits the same page a few minutes after Bob did.
5) No new stories have been posted since Bob visited, so as far as Google is concerned, there's no need to update the cache.
6) George sees "Bob's version" of the page, complete with "You are signed in as: Bob" type link, and other customizations.
This hit SomethingAwful pretty hard the other day when GWA first went public. Google was caching a lot of pages that admins were viewing; then regular non-admin users with GWA were getting the admin versions of pages from the cache. People were able to see each others' private messages, etc. Quite the mess.
I'm going to repost something that I posted here last night. I believe it's relevant to the discussion. Repost as blockquote.
Aside from the privacy issue, I have a few major hangups with GWA. One is that it's going to skew web stats; too early to tell by how much, but it's a given that it will happen. If all of a sudden, 20% of a site's traffic appears to be coming from Google, that throws off the webmaster's ability to accurately judge who comprises their userbase. I at least hope that Google will pass through the correct (well, at least the user-reported) User-Agent and Referer.
Another problem is geotargeting, especially as it relates to fraud prevention. A lot of online credit card processors will reject transactions when the purchaser's IP address doesn't match the area they've used for the billing address. For example, if you enter in credit card details for somebody in Vermont, but your IP address says you're on SBC DSL in Reno, they aren't going to accept the charge. In my case, even if the processor lets one of these through, if I'm suspicious that the IP doesn't match the billing info, I cancel the order and refund the txn.
How is Google's proxy going to factor in? Giving them carte blanche is too great a risk, but prevent all of those users from making purchases as "possible fraud and I can't track down who's really making the order" and you're going to lose money. I generally don't accept any purchases from AOL users, because it's impossible to track down fraud after the fact (AOL, like GWA, uses caching proxies). It's primarily been a non-issue because there aren't many AOL users who are the least bit interested in anything I'm selling, but there are a lot of geeks who love Google. I wonder how GWA will play out in terms of CC fraud.
Further, there are possible copyright and trade secret implications. Let's say that you're browsing LexisNexis. Is Google going to cache the LexisNexis data (which is only supposed to be available to paying members)? Will it wind up in the search index? Maybe you're logged into your company's extranet browsing the latest product plans, not the sort of thing I'd want Google seeing. Well I don't want them seeing any of my non-google.com web traffic, but the things they could theoretically have access to via GWA users, yikes.
It's already been proven that GWA is caching things that it shouldn't be caching. In SomethingAwful's case, normal users were seeing admin pages, even seeing other people's private messages. GWA would cache someone's logged-in instance of a page, then redisplay that copy (complete w
I'm in favor of the laws. It's illegal for a minor to purchase "Hustler" and illegal for an adult to give it to a minor.
You lost me when you compared video games to hardcore pornography. Anyway, I'm not in favor of the laws, for any number of reasons.
1) The "it's just a video game" argument, that's been done to death and I'm not going to waste the time.
2) It makes yet another currently legal thing illegal. In general, I don't like when this happens, even for just a certain class of people.
3) It establishes a minimum age for something; I don't tend to like when this happens, either. Over the last century or so, Americans have accepted the fact that you have to be "at least this tall" to buy four things: alcohol, tobacco, firearms (they even made an agency for these specific yet otherwise unrelated things), and porn.
The first three have been proven - by decades of research - to be phyiscally harmful. Drinking can cause liver problems, drinking and driving kills a lot of people. Tobacco causes emphysema, lung cancer, etc. Firearms can kill people. It makes sense to restrict access to these things. Porn, well, nobody's proven anything; but so many people are so afraid of the body parts that the same God they fear gave them, that it's got a minimum age too. These four things are enough.
4) Minimum ages don't really stop anything in the end, and adding them to more classes of products is just going to create more of a pain in the ass for law-abiding citizens. Kids smoke. Kids drink. Kids have plenty of porn. Little gangbanger thugs are shooting each other every day. And, law or no law, kids are going to wind up with the video games they want, whether their parents approve or not. Everybody's got a hookup, right?
I'm 25, but I look younger and cautious cashiers routinely card me for cigarettes. That's fine, it's tobacco, it's always had a minimum age. Now I'm going to start getting carded when I want to buy a video game, because the cashier's afraid of getting busted? It's a damn video game!
I was at Wal-Mart recently (sigh, I know) and among other things, I grabbed a container of that octane booster/buildup cleaner stuff that you pour in your gas tank. When I went to check out, the cashier wanted to see ID. I was floored. It's not like they card me at the gas station when I purchase ten gallons of flammable and explosive liquid! But heaven forbid somebody under 18 get their hands on 10 ounces of it.
Minimum ages are getting out of hand. It's like the stupid laws requiring retailers to keep Sudafed behind the counter where it can't be shoplifted, and limiting sales to 1 box per customer. Meth cookers will just go and rob a warehouse instead of shoplifting from the corner store. Meanwhile, law-abiding people who get a cold are inconvenienced.
Intercepting Internet traffic is not new. Neither is DoS. But unlike more secure Internet transactions such as your Web connection for online banking, VoIP calls are not encrypted. That makes them susceptible to tapping.
This amazes me, I can't believe that the calls are floating around in raw audio. Would a little encryption add so much overhead that it would bog down the system? Or is this due to CALEA or other laws?
There's always going to be data compromise. One should be careful, and precautions should be kept in place but the long-term answer is that consumers will be protected like they are with credit card theft, the losses will become another background cost and you're going to have to live with the possibility that someone will know what movies you rent.
The problem is that one can't be careful. Before Choicepoint's data compromise went public, I don't think I'd ever heard of them before. I certainly didn't know they had a dossier on me and pretty much every other American. If I don't know who has my information, how can I be careful? What precautions are there against companies I've never heard of compiling and trading (and eventually losing) all the little bits and pieces of data that make me, me?
This is a problem that the average person can't do a damn thing about, except to cross their fingers and hope that their bank or credit card company isn't the next one to "lose" information.
Oh man, if only you'd posted two years ago, I could have likely hooked you up with the best porn organizing system ever developed. You haven't lived until you've built a categorizing system for hundreds of gigs of porn (hundreds of thousands of pics, tens of thousands of movies). 'Course having access to all that porn doesn't hurt either!
At the time I was a partner in a company that envisioned creating the mother of all porn sites. We were dealing with such an amount of content that we had to devise our own way of managing it. The solution I came up with was PHP/MySQL based. There was a database that held ungodly amounts of metadata about every file, and a "sorting" script that allowed some hired grunts to sit around all day literally reviewing thumbnails and sorting each image/movie by quality, assigning categories, etc.
A major HD death (of an unbacked-up HD*...) killed the project, and I've only got some snippets of the code left, and no database schema. As an example, though, here is some of the metadata we collected on every file - this code is not from the final revision, so some fields are missing...
In both cases, "cats" was a string built earlier in the sorting process, like "cat1=5, cat2=17, cat3=40"... Each catN corresponded to an entry in the "categories" table. So you could assign one pic/vid to categories "Asian," "Foot Fetish," "Brunette," whatever fit.
The movie sorter was really one of my better works ever - and it took forever to build. Unreviewed vids went into an "unsorted" dir. The person reviewing content would choose a vid to review. The script used mplayer to rip a few stills from the vid in realtime, the person sorting could choose a couple of stills to act as thumbnails. Once they assigned categories, description, quality, etc. the script would cut 3 different resolution versions (low, med, hi) of the vid, god I wish I still had all the code:/
One thing of vast importance to anyone building any sort of file tracking system - be it for porn, mp3s, whatever - is hashing. We used md5. Content providers would upload or send CD's full of zips or rars, and often times, the same set of pics would show up in multiple places. Being able to compare file hashes was essential in preventing duplicate content from going into the database. I guess for MP3s this would be a little harder, since you might have two copies of the same song ripped by different people...
Maybe someday I'll try to rebuild the DB schema and rewrite the missing portions of the code - which unfortunately is most of it. But if anyone else wants to write a porn organizer, maybe you picked up some pointe
Exactly. Now they can find pages that are rarely linked, yet may be valuable. I wonder if this also might allow them to search the 'deep web'. Imagine a user with this browsing an online chemistry database where the only way to find info is by filling out some text fields on a website. Now Google will be able to find this deep websites by having users due all the grunt work.
This brings up an interesting point. Some content on the web is never linked because it isn't supposed to be found. I realize that robots.txt and.htpasswd are there for a reason, but if I place a file on a website and I don't link to it from anywhere, I should be able to reasonably expect that it will never be found by a search engine.
As for your chemistry database example, that piques my interest even more. Let's say that instead of a chemistry database, you're browsing LexisNexis. Is Google going to cache the LexisNexis data, which is only supposed to be available if you're a paying member?
What about other for-pay sites, where only registered users are supposed to be able to see the content? Let's say you login to site.com and call up pagex.html through the Google proxy. A few minutes later, I call up http://site.com/pagex.html through the Google proxy. Even though I didn't log into the site, will Google show me the content they cached during your visit? Will that content show up as a Google result, or in the Google cache?
There seem to be a lot of potential problems when someone as large as Google decides to set up a public proxy.
Let's assume that Joe Schmoe installs the "web accelerator". Next he downloads child porn. Who's responsible for this? Can he sue Google, claiming they "put it there" ?
I wouldn't imagine so. If Joe Schmoe downloads child porn on his Comcast cable modem connection, he can't sue Comcast, they didn't "put it there." (Well, he can sue anyone he wants, but he's not going to win.) Google didn't "put it there" either - he's the one who requested the URL, after all.
Plus, what about copyrights and such? Will Google be held liable for pushing out outdated pages? How will the servers (from where Google is grabbing pages) get their statistics? And since Google will be sort-of screen-scraping, why does Google object to it themselves?
If Google hasn't been found liable for keeping local caches of billions of copyrighted web pages, then I don't think adding a proxy service is going to change things any. A lot of ISPs (think AOL) already do some level of transparent proxy caching, in fact AOL by default even compresses all.JPGs into some lossy.ART files; it's not like this is really a new idea.
I am sort of curious about the web stats issue. If all of a sudden, 20% of my traffic appears to be coming from Google, that throws off my ability to judge the geographic makeup of my traffic. I at least hope that Google will pass through the correct (well, at least the user-reported) User-Agent and Referer.
Come to think of it, there could be more serious implications with the "geo IP" issue. A lot of online credit card processors will reject transactions if the purchaser's IP address doesn't match the general region they've used for the billing address. In my case, even if the processor lets one of these through, if I'm suspicious that the IP doesn't match the billing info, I cancel the order and refund the txn.
How is Google's proxy going to factor in? Giving them carte blanche is too great a risk, but prevent all of those users from making purchases and you're going to lose money.
Just wait till everyone is using i2p. Then the RIAA can't really do anything about it.
Sure they can. They own the lawmakers, remember? When everyone is using I2P, use of I2P in the US will be made unlawful. Then they won't even have to prove you were transferring a specific file, just that you were speaking a certain protocol.
I will not only make sure her car as OnStar, but I will have one of these handy dandy GPS units surgically implanted in her hip. Then I can track her on Google maps.
IMO, it's the TV media - and the government who's pulling the media's marionette strings - who are to blame. Yet the media is supposedly "liberal." Whatever.
Please don't blame the education system or the teachers. They aren't at fault here.
How about this, "One in three U.S. high school students say the press ought to be more restricted, and even more say the government should approve newspaper stories before readers see them, according to a survey being released today." 112,003 high school students were surveyed, that doesn't seem like too small a population to me.
Back in 2002, Wired (the print version) decided to review one of my freeware apps. I was contacted by two different people. One was a Wired graphics person, who wanted screenshots that I - being a coder, not a graphics monkey - was unfortunately unable to provide in whatever format they wanted, so a picture of my app didn't appear in the mag. Second was someone from Conde Nast who sent a form letter saying, in a nutshell, "We're reviewing your stuff. It'll be in issue 10.7. Wanted to let you know. Thanks."
I actually had an exchange with the graphics chick, because I really wanted to give her the screenshots she was looking for. Language barrier between gfxchic and coderdude screwed things up, she couldn't describe in terms I could figure out how to take the shot that she wanted. I wound up writing her a release saying that if I can't do it, she's more than welcome to; the review went to press with no images. (At which point I started wondering, why is the graphics person asking me to provide a screenshot? Can't she just download the damn app and take her own shots? But I digress.)
Anyway, all of this was over a one paragraph software review. A nothing story and they were begging for my help. Most places would just have reviewed it without contacting me.
I don't fault Wired, per se, for Ms. Delio's journalistic transgressions. Jayson Blair was able to defraud the freaking NYT for awhile; Wired rather pales in comparison. I consider Wired to be more the victim than the perp here.
When you go to a restaurant and hand your credit card to the waiter, the waiter swipes your card and returns it to you. There is the opportunity for the waiter himself, and potentially one or two other people who may either witness or take part in the swiping, to retain your card data. There are between 1 and 3 people who may have the capability to steal your card details, and the likelihood of it actually happening approaches zero.
When you buy something on Amazon, your credit card data passes through 6 or 8 hops on its way to Amazon (assuming you're on broadband, add a few for dialup), and several more hops in transit from Amazon to their credit card processor. You have no control over those routers, you can't see them; at the same time, you have no idea who else does "own" those routers.
Someone lifting credit cards from a physical retail outlet, aside from being incredibly stupid, is almost certain to get caught. As soon as two disputed charges share a previous charge at a common location, the credit card issuer's fraud division is going to open an investigation into the common merchant. More than two, and it becomes obvious where the card numbers are being stolen from. What waiter is going to take that risk?
On the other hand, sniffing a router somewhere between you and Amazon is far less risky and gives a much greater payoff. After all, if that router lies between you and Amazon, it lies between everyone on your ISP and Amazon. Why bother stealing 2 or 3 credit card numbers in person, at a location where you can be traced, when you can anonymously sniff a router somewhere and gather hundreds or thousands of credit cards?
If SSL is broken, it will cripple ecommerce until a replacement comes along. Punching your card number into a website is nothing at all like handing your card to a waiter.
It seems to have started with the "ics" words, likely because of the "ix" sound. Mechanics became MX, logistics became LX, avionics became AX. It branched out from there; maintenance is now often abbreviated to MX (and is somewhat interchangeable with mechanics), and weather is WX. Not positive whether or not - pun intended - WX originated in aviation.
Got a long word, but not enough paint? Enter the X. In this case, at least, it actually makes sense. Transfer is often shortened to Xfer in any number of industries.
1.3 million hosts over 2 years is only about 1600 hosts per day, which isn't a lovely picture, but doesn't seem all that bad. If you've actually got 1.3 million positives right now, I want rsync access to run a local copy!
Yeah, I didn't think it made sense either.
I'm not a big proponent of hunting without using the animals you kill, and I respect the fact that you, too, will only kill what you will use. That doesn't mean that hunting isn't a sport. Even pure marksmanship, with nothing to "kill" but a target, is a sport.
Sorry, but this one is Google's bad.
It's pretty simple, really.
1) Bob installs Google Web Accelerator.
2) Bob visits (let's say) Slashdot, and logs in as username "Bob."
3) Bob loads a couple of pages, maybe posts a message or two, then he goes to sleep. Meanwhile, Google caches all of the pages he visits.
4) George, who also uses Google Web Accelerator, visits the same page a few minutes after Bob did.
5) No new stories have been posted since Bob visited, so as far as Google is concerned, there's no need to update the cache.
6) George sees "Bob's version" of the page, complete with "You are signed in as: Bob" type link, and other customizations.
This hit SomethingAwful pretty hard the other day when GWA first went public. Google was caching a lot of pages that admins were viewing; then regular non-admin users with GWA were getting the admin versions of pages from the cache. People were able to see each others' private messages, etc. Quite the mess.
I'm going to repost something that I posted here last night. I believe it's relevant to the discussion. Repost as blockquote.
Bon soir (hey, it's night for me),
1) The "it's just a video game" argument, that's been done to death and I'm not going to waste the time.
2) It makes yet another currently legal thing illegal. In general, I don't like when this happens, even for just a certain class of people.
3) It establishes a minimum age for something; I don't tend to like when this happens, either. Over the last century or so, Americans have accepted the fact that you have to be "at least this tall" to buy four things: alcohol, tobacco, firearms (they even made an agency for these specific yet otherwise unrelated things), and porn.
The first three have been proven - by decades of research - to be phyiscally harmful. Drinking can cause liver problems, drinking and driving kills a lot of people. Tobacco causes emphysema, lung cancer, etc. Firearms can kill people. It makes sense to restrict access to these things. Porn, well, nobody's proven anything; but so many people are so afraid of the body parts that the same God they fear gave them, that it's got a minimum age too. These four things are enough.
4) Minimum ages don't really stop anything in the end, and adding them to more classes of products is just going to create more of a pain in the ass for law-abiding citizens. Kids smoke. Kids drink. Kids have plenty of porn. Little gangbanger thugs are shooting each other every day. And, law or no law, kids are going to wind up with the video games they want, whether their parents approve or not. Everybody's got a hookup, right?
I'm 25, but I look younger and cautious cashiers routinely card me for cigarettes. That's fine, it's tobacco, it's always had a minimum age. Now I'm going to start getting carded when I want to buy a video game, because the cashier's afraid of getting busted? It's a damn video game!
I was at Wal-Mart recently (sigh, I know) and among other things, I grabbed a container of that octane booster/buildup cleaner stuff that you pour in your gas tank. When I went to check out, the cashier wanted to see ID. I was floored. It's not like they card me at the gas station when I purchase ten gallons of flammable and explosive liquid! But heaven forbid somebody under 18 get their hands on 10 ounces of it.
Minimum ages are getting out of hand. It's like the stupid laws requiring retailers to keep Sudafed behind the counter where it can't be shoplifted, and limiting sales to 1 box per customer. Meth cookers will just go and rob a warehouse instead of shoplifting from the corner store. Meanwhile, law-abiding people who get a cold are inconvenienced.
There are enough minimum ages already.
It's just a damn video game.
This is a problem that the average person can't do a damn thing about, except to cross their fingers and hope that their bank or credit card company isn't the next one to "lose" information.
At the time I was a partner in a company that envisioned creating the mother of all porn sites. We were dealing with such an amount of content that we had to devise our own way of managing it. The solution I came up with was PHP/MySQL based. There was a database that held ungodly amounts of metadata about every file, and a "sorting" script that allowed some hired grunts to sit around all day literally reviewing thumbnails and sorting each image/movie by quality, assigning categories, etc.
A major HD death (of an unbacked-up HD*...) killed the project, and I've only got some snippets of the code left, and no database schema. As an example, though, here is some of the metadata we collected on every file - this code is not from the final revision, so some fields are missing...
For JPGs:
For movies:
In both cases, "cats" was a string built earlier in the sorting process, like "cat1=5, cat2=17, cat3=40" ... Each catN corresponded to an entry in the "categories" table. So you could assign one pic/vid to categories "Asian," "Foot Fetish," "Brunette," whatever fit.
:/
The movie sorter was really one of my better works ever - and it took forever to build. Unreviewed vids went into an "unsorted" dir. The person reviewing content would choose a vid to review. The script used mplayer to rip a few stills from the vid in realtime, the person sorting could choose a couple of stills to act as thumbnails. Once they assigned categories, description, quality, etc. the script would cut 3 different resolution versions (low, med, hi) of the vid, god I wish I still had all the code
One thing of vast importance to anyone building any sort of file tracking system - be it for porn, mp3s, whatever - is hashing. We used md5. Content providers would upload or send CD's full of zips or rars, and often times, the same set of pics would show up in multiple places. Being able to compare file hashes was essential in preventing duplicate content from going into the database. I guess for MP3s this would be a little harder, since you might have two copies of the same song ripped by different people...
Maybe someday I'll try to rebuild the DB schema and rewrite the missing portions of the code - which unfortunately is most of it. But if anyone else wants to write a porn organizer, maybe you picked up some pointe
As for your chemistry database example, that piques my interest even more. Let's say that instead of a chemistry database, you're browsing LexisNexis. Is Google going to cache the LexisNexis data, which is only supposed to be available if you're a paying member?
What about other for-pay sites, where only registered users are supposed to be able to see the content? Let's say you login to site.com and call up pagex.html through the Google proxy. A few minutes later, I call up http://site.com/pagex.html through the Google proxy. Even though I didn't log into the site, will Google show me the content they cached during your visit? Will that content show up as a Google result, or in the Google cache?
There seem to be a lot of potential problems when someone as large as Google decides to set up a public proxy.
I am sort of curious about the web stats issue. If all of a sudden, 20% of my traffic appears to be coming from Google, that throws off my ability to judge the geographic makeup of my traffic. I at least hope that Google will pass through the correct (well, at least the user-reported) User-Agent and Referer.
Come to think of it, there could be more serious implications with the "geo IP" issue. A lot of online credit card processors will reject transactions if the purchaser's IP address doesn't match the general region they've used for the billing address. In my case, even if the processor lets one of these through, if I'm suspicious that the IP doesn't match the billing info, I cancel the order and refund the txn.
How is Google's proxy going to factor in? Giving them carte blanche is too great a risk, but prevent all of those users from making purchases and you're going to lose money.
It's a line from "The Life Aquatic with Steve Zissou."