If you take a look at their publications list, you'll find that the lab has been around for about a year. Tech report OR-2003-002 would have been submitted for publication to SIGIR in January 2003.
Rosie Jones, Dan Fain. Query Word Deletion Prediction, Overture Research Technical Report OR-2003-002
Some of their papers look quite interesting. Check them out at http://labs.yahoo.com/publications.xml.
What's the use of protecting it if I don't want it? I haven't bought anything involving DRM or copy protection (DVDs, copy protected CDs, e-books, software) for over a year and don't intend to start again in the future. Yay for freedom of choice!
While Free Software developers do "give away" copies of their software, there are still development costs associated with it. At the very least, one could show that by using the unlicensed code the offender effectively stole X hours of development time which would have cost Y dollars per hour for them to develop on their own at the industry average rate. You might also be able to show damages for the time that their developers spent on modifying the software since those lost contributions to the codebase would have to be written by someone else using the before-mentioned cost formula.
The last sentence of your post brought a question to my mind: Why don't copyleft owners ever sue for money? It's not like using unlicensed Linux kernel code is any different than using an unlicensed copy of Windows CE. In both cases you are violating the terms of the license and in both cases you are potentially causing damages against the copyright holder.
Since nobody who seems to have actually read any computer science papers has posted, here are two that immediately come to my mind.
Vannevar Bush. As We May Think. Atlantic Monthly, July, 1945.
This paper put forth the very first ideas about how people can mechanically search for information. While we don't have desks with levers on them, we do have Google.:)
This paper is where Tim Berners Lee proposes what we now know as the world wide web. It's an interesting read if you'd like to see what the original intent of the web was so that you can compare it to what we have today.
A place to look for good old computer science papers is in older issues of Communications of the ACM. There are lots of articles in plain English that you may find of interest. If you are a university student, your school may have a subscription to the ACM Digital Library. If they do, you can read all the issues back to 1958.
Also, you can find a lot of interesting CS publications at Citeseer. They have a page with the top 200 most accessed papers of all times. When I skimmed through it, I saw quite a few titles that may be of interest.
1. [VeriSign Global Registry Services] will not show any preference or provide any special consideration to any ICANN-accredited registrar with regard to Registry Services provided for the.com TLD.
Verisign is showing preference and providing special consideration for an ICANN-accredited registrar with regard to Registry Services provided for the.com TLD by allowing themselves to use the SiteFinder service. They don't allow other registrars to do so, therefore it is special consideration.
What's done in the lab and what can actually be sold are very different things. The senior information retrieval researchers at MSR are *smart* people.
I had the opportunity to hear Susan Dumais' talk on "Stuff I've Seen" at SIGIR this year. SIS is a really interesting piece of software, a personal search engine. Every e-mail you send or receive, every file you create is fed into a search engine residing on your PC. You can then search for things by date, keyword, etc. and easily locate exactly what you're looking for.
Yeah, great search interface! Really inspires my confidence!
The amendment to the Act legalized copying of sound recordings of musical works onto
audio recording media for the private use of the person who makes the copy (referred to as "private copying"). [1]
Audio recording media is defined as "Analog Audio Casette Tapes," "MiniDisc, CD-R Audio and CD-RW Audio" and "CD-R and CD-RW." [2] This does not include hard drives (I recall discussion of extending the levy to hard drives), so therefore your hard drive is not "audio recording media" and thus the Act does not legalize file sharing.
This being said, it would be harder to argue if you immediately burned the downloaded songs to an audio CD, promptly deleting the copy on your hard drive.
What gives Verisign the right to unilaterally make this decision about how the internet will work? As it's been mentioned, it breaks a lot of stuff and from what I've heard (admittedly, I haven't paid a lot of attention), nobody except them seems to want it.
If NYT wanted a security audit of their system, they would have paid someone to do it. Since they did not, they obviously didn't want one. Good intentions or not, Lamo broke the law and deserves to face the consequences of his actions.
I realize that it's "chic to be geek" here with the whole "white hat" hacking stuff, but be realistic. After all, you don't see people doing the physical analogue of white hat hacking. That's B&E.
Sam's article was a very interesting read, but his results need to be taken with a grain of salt.
To show that one piece of software outperforms another, you need to prove statistical significance. This can be done in two ways:
The first method is called the pairwise t-test. What you need to do is to run k tests using different training and test data. For each of these tests, you find the accuracy of the classifier (#success/#trials). The, you form the "t-statistic," t = d/sqrt(sigma_d^2 / k), where d is the difference of the means of the two classifiers, sigma_d^2 is the variance of the difference samples and k is the number of samples. Then, you compare your t-statistic to the Student's distribution with k-1 degrees of freedom. Typically, you want a confidence level of 90% or 95% so you find the number of standard deviations away from the mean for the specific t-test (e.g. the 90% statistic 9-degree of freedom t-test is 1.38). If your t-statistic is greater than the number of standard deviations, then the difference between the two classifiers is statistically significant with X% confidence. Read more about this in Witten and Frank's Data Mining book.
The other method is called Analysis of Variance (ANOVA). I'm not familiar enough with this method to explain it here, but it allows you to choose from a set of experiments which ones really are above the average. Dig around in your statistics books or on the web for more information.
Sam should have made use of either of these techniques when doing his analysis. Since he only ran one experiment per configuration of his classifier, you can draw no real conclusions from the data presented (it's a Student's distribution with 0-degree of freedom... essentially flat!).
Since most of us only have a small number of corpora kicking around (maybe even only one!), you can use a method called "cross validation" to give yourself a larger number of data sets than you actually have. When doing a cross validation, you divide your corpus up into k "folds" and then perform k experiments. In each experiment, you set aside one fold of your data for testing and train on the other k-1 folds. Since you're using different test data each time, each experiment can be considered to be different and then you can use a pairwise t-test to prove statistical significance. There are other methods that you can use such as "leave one out" where you have as many folds as you do pieces of training data and "bootstrapping" where you sample your training data with replacement and test with whatever wasn't sampled for training.
However, cross validation may not be appropriate for incremental learning algorithms if your data is on a timeline (such as e-mail). You can break your corpus up into pieces and do your evaluation on that.
Proving statistical significance is very easy and allows you to be confident in the conclusions that you make in your publications. It's the scientific method!
The trick that helped me get my undergraduate thesis done was simply to unplug myself. I'd turn off my cell phone, pick up my laptop, leave the NIC at home and go out and find a quiet place to work with no distractions.
If you simply can't unplug yourself, turn off your IRC client, instant messaging services and e-mail client. With a little self discipline (no reading/.!), you'll be surprised what you can get accomplished in a day.
I hope you get some useful tips out of people today. Good luck with your studies!
Henry (Now a post-bachelor PhD. student, thanks to this technique.)
This might be a bit off topic, but I want to use SVG for data visualization and have been having trouble finding suitable software.
The SVG implementations I've found so far either have no external user interface with nice things like scrollbars (Adobe/Corel) or can't handle my very large graphics (everything else I've seen).
I've been very disappointed about this lack of good viewers. SVG is well-suited for data visualization and could become a "killer app" with the right software support.
I'm not going to dignify this article or any future articles by reading or commenting on it. I suggest that you all stop feeding the SCO troll and do the same.
You'd think that slashdotters would know a troll when they see one.:) Does he need to redirect www.sco.com to www.goatse.cx to show his true colours?
If you take a look at their publications list, you'll find that the lab has been around for about a year. Tech report OR-2003-002 would have been submitted for publication to SIGIR in January 2003.
Rosie Jones, Dan Fain. Query Word Deletion Prediction, Overture Research Technical Report OR-2003-002
Some of their papers look quite interesting. Check them out at http://labs.yahoo.com/publications.xml.
If a black hatter can read your shadow file, you have bigger problems than protecting your 64-bit hashed password from them.
Aren't there more constructive ways of spending energy than complaining about a guy who is lucky enough to be able to work from his vacation home?
Yeeesh.
P.S. What does Clear Channel have to do with this, anyway?
What's the use of protecting it if I don't want it? I haven't bought anything involving DRM or copy protection (DVDs, copy protected CDs, e-books, software) for over a year and don't intend to start again in the future. Yay for freedom of choice!
While Free Software developers do "give away" copies of their software, there are still development costs associated with it. At the very least, one could show that by using the unlicensed code the offender effectively stole X hours of development time which would have cost Y dollars per hour for them to develop on their own at the industry average rate. You might also be able to show damages for the time that their developers spent on modifying the software since those lost contributions to the codebase would have to be written by someone else using the before-mentioned cost formula.
Lastly, don't forget punitive damages.
The last sentence of your post brought a question to my mind: Why don't copyleft owners ever sue for money? It's not like using unlicensed Linux kernel code is any different than using an unlicensed copy of Windows CE. In both cases you are violating the terms of the license and in both cases you are potentially causing damages against the copyright holder.
Seems as though a few were modded up to my threshold while I was typing my previous post up. Ignore my first paragraph!
Since nobody who seems to have actually read any computer science papers has posted, here are two that immediately come to my mind.
:)
Vannevar Bush. As We May Think. Atlantic Monthly, July, 1945.
This paper put forth the very first ideas about how people can mechanically search for information. While we don't have desks with levers on them, we do have Google.
Tim Berners Lee. Information Management: A Proposal. 1989.
This paper is where Tim Berners Lee proposes what we now know as the world wide web. It's an interesting read if you'd like to see what the original intent of the web was so that you can compare it to what we have today.
A place to look for good old computer science papers is in older issues of Communications of the ACM. There are lots of articles in plain English that you may find of interest. If you are a university student, your school may have a subscription to the ACM Digital Library. If they do, you can read all the issues back to 1958.
Also, you can find a lot of interesting CS publications at Citeseer. They have a page with the top 200 most accessed papers of all times. When I skimmed through it, I saw quite a few titles that may be of interest.
And just what happens when they discover that none of those people are Billy the Kid?
Bye bye tourist dollars.
I shudder to think what happens when he tries to rename his files and directories.
:(
Definitely the biggest problem with CVS.
The is attributed to Revolution Books, a chain of non-profit communist bookstores around the United States.
0 3-30/121.asp
Here is an article posted at the Columbia University school of journalism about the store. http://www.jrn.columbia.edu/studentwork/cns/2003-
Here is an interview with the manager of the store, Joan Hirsch. http://www.furious.com/rev.html
I'm glad to see that there are still some people left who have the backbone to stand up for what they believe in.
Interesting. From the "Registry Code of Conduct":
Verisign is showing preference and providing special consideration for an ICANN-accredited registrar with regard to Registry Services provided for theYou'd think that it would be in the public domain by now, wouldn't you?
Microsoft. Search experts.
What's done in the lab and what can actually be sold are very different things. The senior information retrieval researchers at MSR are *smart* people.
I had the opportunity to hear Susan Dumais' talk on "Stuff I've Seen" at SIGIR this year. SIS is a really interesting piece of software, a personal search engine. Every e-mail you send or receive, every file you create is fed into a search engine residing on your PC. You can then search for things by date, keyword, etc. and easily locate exactly what you're looking for.
Yeah, great search interface! Really inspires my confidence!
If anyone can topple Google, they can.
Forgot my citations. Sorry.
[1] Jay Currie. Blame Canada. http://techcentralstation.com/081803C.html
[2] Copyright Board of Canada. Fact Sheet: Private Copying 1999-2000 Decision. http://www.cb-cda.gc.ca/news/c19992000fs-e.html
To quote Jay Currie (emphasis mine):
Audio recording media is defined as "Analog Audio Casette Tapes," "MiniDisc, CD-R Audio and CD-RW Audio" and "CD-R and CD-RW." [2] This does not include hard drives (I recall discussion of extending the levy to hard drives), so therefore your hard drive is not "audio recording media" and thus the Act does not legalize file sharing.
This being said, it would be harder to argue if you immediately burned the downloaded songs to an audio CD, promptly deleting the copy on your hard drive.
Bills are bundled together and 100-dollar bills are bundled together.
Interesting...
What gives Verisign the right to unilaterally make this decision about how the internet will work? As it's been mentioned, it breaks a lot of stuff and from what I've heard (admittedly, I haven't paid a lot of attention), nobody except them seems to want it.
A network with no single point of failure? Pah!
If NYT wanted a security audit of their system, they would have paid someone to do it. Since they did not, they obviously didn't want one. Good intentions or not, Lamo broke the law and deserves to face the consequences of his actions.
I realize that it's "chic to be geek" here with the whole "white hat" hacking stuff, but be realistic. After all, you don't see people doing the physical analogue of white hat hacking. That's B&E.
Sam's article was a very interesting read, but his results need to be taken with a grain of salt.
To show that one piece of software outperforms another, you need to prove statistical significance. This can be done in two ways:
The first method is called the pairwise t-test. What you need to do is to run k tests using different training and test data. For each of these tests, you find the accuracy of the classifier (#success/#trials). The, you form the "t-statistic," t = d/sqrt(sigma_d^2 / k), where d is the difference of the means of the two classifiers, sigma_d^2 is the variance of the difference samples and k is the number of samples. Then, you compare your t-statistic to the Student's distribution with k-1 degrees of freedom. Typically, you want a confidence level of 90% or 95% so you find the number of standard deviations away from the mean for the specific t-test (e.g. the 90% statistic 9-degree of freedom t-test is 1.38). If your t-statistic is greater than the number of standard deviations, then the difference between the two classifiers is statistically significant with X% confidence. Read more about this in Witten and Frank's Data Mining book.
The other method is called Analysis of Variance (ANOVA). I'm not familiar enough with this method to explain it here, but it allows you to choose from a set of experiments which ones really are above the average. Dig around in your statistics books or on the web for more information.
Sam should have made use of either of these techniques when doing his analysis. Since he only ran one experiment per configuration of his classifier, you can draw no real conclusions from the data presented (it's a Student's distribution with 0-degree of freedom... essentially flat!).
Since most of us only have a small number of corpora kicking around (maybe even only one!), you can use a method called "cross validation" to give yourself a larger number of data sets than you actually have. When doing a cross validation, you divide your corpus up into k "folds" and then perform k experiments. In each experiment, you set aside one fold of your data for testing and train on the other k-1 folds. Since you're using different test data each time, each experiment can be considered to be different and then you can use a pairwise t-test to prove statistical significance. There are other methods that you can use such as "leave one out" where you have as many folds as you do pieces of training data and "bootstrapping" where you sample your training data with replacement and test with whatever wasn't sampled for training.
However, cross validation may not be appropriate for incremental learning algorithms if your data is on a timeline (such as e-mail). You can break your corpus up into pieces and do your evaluation on that.
Proving statistical significance is very easy and allows you to be confident in the conclusions that you make in your publications. It's the scientific method!
Good luck!
Henry
The trick that helped me get my undergraduate thesis done was simply to unplug myself. I'd turn off my cell phone, pick up my laptop, leave the NIC at home and go out and find a quiet place to work with no distractions.
/.!), you'll be surprised what you can get accomplished in a day.
If you simply can't unplug yourself, turn off your IRC client, instant messaging services and e-mail client. With a little self discipline (no reading
I hope you get some useful tips out of people today. Good luck with your studies!
Henry
(Now a post-bachelor PhD. student, thanks to this technique.)
So when good spammers go bad, we have zombie spammers?
Where's my lawnmower? It's choppin' time.
This might be a bit off topic, but I want to use SVG for data visualization and have been having trouble finding suitable software.
The SVG implementations I've found so far either have no external user interface with nice things like scrollbars (Adobe/Corel) or can't handle my very large graphics (everything else I've seen).
I've been very disappointed about this lack of good viewers. SVG is well-suited for data visualization and could become a "killer app" with the right software support.
Whatever you do, please don't post pictures of that on Slashdot. :)
I'm not going to dignify this article or any future articles by reading or commenting on it. I suggest that you all stop feeding the SCO troll and do the same.
:) Does he need to redirect www.sco.com to www.goatse.cx to show his true colours?
You'd think that slashdotters would know a troll when they see one.