We did a study on predicting when a tweet would be retweeted (this paper cites us). The dominant factor is not what you write, but how many followers you have.
Basically, a famous person can write anything and it will be retweeted. An unknown person can write the same tweet and it will be ignored.
Link to paper:
Sasa Petrovic, Miles Osborne and Victor Lavrenko. RT to win! Predicting Message Propagation in Twitter. ICWSM, Barcelona, Spain. July 2011.
http://homepages.inf.ed.ac.uk/...
People here are forgetting the costs associated with flying senior (ie expensive) people around. There is an argument that if you are billing a client for three figure sums a day, you had better ensure that the person flying arrives in good shape so they can work straight from the flight. Sending people coach can be a false economy.
This is a proxy for what marketers call "reach". The more followers you have, the more people will read your posts. Except here the followers are not real and so people buying this SEO snake oil are being ripped-off.
This is the same as any other optimisation task (eg link farms for Page Rank). People will try ti and (eventually) Twitter will work-out how to clamp-down on it.
A major problem with open-access journals is that there is no motivation for them to reject submissions, If anything, the more they publish the more money they make. Likewise, peer reviewers (at least in my field --natural language processing and machine learning) are never paid to review them. This is not a good combination.
I cannot see any reason for journals nowadays. Either publish in conferences (which in some fields are competitive and very tightly reviewed) or better still publish them on arvXiv and have some kind of citation / comment system as a way to crowd-source quality control.
if you want to go to the other extreme look at SIGIR. They have extremely demanding standards for experimentation, along with an associated conservative nature. It is very hard to get something non-incremental (eg using some new dataset) published there.
But I agree, experiments at ACL tend to be quite sloppy.
Being plausible and being reproducible are not sufficient and necessary conditions. Science is a community, with an expectation of what a believable result should look like. This comes from actually understanding the field, including what is written and what is not written down. It is very rare for there to be some genuinely implausible result and Good Science typically seems obvious in hindsight.
Results from query logs and great, but until the raw data is made public, no-one can verify or reproduce these results. Until that is done they remain a curiosity at best.
Experiments being reproduced can be hard if no-one else has the data (this can happen --for example if you are Google and publish results using large fractions of the Web as data) or even if something as trivial as moving it from one site to another requires a lot of effort. This is not really a question of storage costs --it is a question of having the data in the first place and the mechanics of moving it around.
Models are used in Science as idealisations; but if you really really want to model the long tail of effects, then your model becomes the data. And this relates to summary statistics: all they do is capture aspects of the data (it is after all a summary). If you want the whole truth, then you can't summarise.
Fernando Pereira and Peter Novig have a nice paper on this:
http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html [The Unreasonable Effectiveness of Data]
Related to using Big Data in Business is Big Data in Science. Wired ran a nice series of articles looking at this (http://www.wired.com/wired/issue/16-07). This raises all sorts of problems (for example, how can results be reproduced? What if the model of the data is as complex as the data? Are all results obtained with Small Data simply artefacts of sparse counts?).
Well, you really want to think about bias/variance reductions which brings ideas of averaging and using better classifiers together. For example, "bagging" can be thought of as a variance-reduction technique; "boosting" does both if I recall.
I'm actually surprised that this hasn't been done before. You can prove that using multiple models will on average produce better results than using any single model in isolation. For example, each netflix system will make different errors; using multiple systems will tend to average-out these errors and the consensus decision is most likely to be correct.
There is another possibility: automatic search (attempting to find relevant pages for everyone) has reached a plateau in terms of performance and if you want to do better, you will need to employ raters. Clearly Google would like to find "the next best thing" in Search, but that sounds quite uncertain. Employing lots of people is a much surer way to improve results.
For example, in Germany https://en.wikipedia.org/wiki/... and Thailand https://en.wikipedia.org/wiki/...
Basically, a famous person can write anything and it will be retweeted. An unknown person can write the same tweet and it will be ignored.
Link to paper:
Sasa Petrovic, Miles Osborne and Victor Lavrenko. RT to win! Predicting Message Propagation in Twitter. ICWSM, Barcelona, Spain. July 2011. http://homepages.inf.ed.ac.uk/...
People here are forgetting the costs associated with flying senior (ie expensive) people around. There is an argument that if you are billing a client for three figure sums a day, you had better ensure that the person flying arrives in good shape so they can work straight from the flight. Sending people coach can be a false economy.
And I will claim this as a fake first post. #slashdot
There are simple ways to stop this. Whether Twitter does this is another matter.
This is a proxy for what marketers call "reach". The more followers you have, the more people will read your posts. Except here the followers are not real and so people buying this SEO snake oil are being ripped-off.
This is the same as any other optimisation task (eg link farms for Page Rank). People will try ti and (eventually) Twitter will work-out how to clamp-down on it.
Rinse and repeat.
Why is this news?
A major problem with open-access journals is that there is no motivation for them to reject submissions, If anything, the more they publish the more money they make. Likewise, peer reviewers (at least in my field --natural language processing and machine learning) are never paid to review them. This is not a good combination. I cannot see any reason for journals nowadays. Either publish in conferences (which in some fields are competitive and very tightly reviewed) or better still publish them on arvXiv and have some kind of citation / comment system as a way to crowd-source quality control.
if you want to go to the other extreme look at SIGIR. They have extremely demanding standards for experimentation, along with an associated conservative nature. It is very hard to get something non-incremental (eg using some new dataset) published there. But I agree, experiments at ACL tend to be quite sloppy.
Being plausible and being reproducible are not sufficient and necessary conditions. Science is a community, with an expectation of what a believable result should look like. This comes from actually understanding the field, including what is written and what is not written down. It is very rare for there to be some genuinely implausible result and Good Science typically seems obvious in hindsight.
There is an academic statistical machine translation system: http://demo.statmt.org/index.php This is open source. Help improve it!
You can use this to produce spam.
that is the question
There is a vogue for such terms: an improper prior is one that does not sum to one; loss is when probability mass cannot be reached.
do I get my bonus now?
This article
http://www.insidefacebook.com/2010/10/15/as-source-for-current-facebook-employees-google-has-big-lead-on-yahoo-microsoft-oracle/ suggests there are 277 ex-Googlers at Facebook. (There are reduced numbers from other big tech employeres).
--stock options: Facebook is/was pre-IPO. If you want to get rich as an engineer you would work there. You will never get that rich at Google.
--freedom: Google is a large company and it is hard to get stuff done. Facebook is small.
--Google is perceived as no longer being the place where the best work.
Results from query logs and great, but until the raw data is made public, no-one can verify or reproduce these results. Until that is done they remain a curiosity at best.
Experiments being reproduced can be hard if no-one else has the data (this can happen --for example if you are Google and publish results using large fractions of the Web as data) or even if something as trivial as moving it from one site to another requires a lot of effort. This is not really a question of storage costs --it is a question of having the data in the first place and the mechanics of moving it around. Models are used in Science as idealisations; but if you really really want to model the long tail of effects, then your model becomes the data. And this relates to summary statistics: all they do is capture aspects of the data (it is after all a summary). If you want the whole truth, then you can't summarise. Fernando Pereira and Peter Novig have a nice paper on this: http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html [The Unreasonable Effectiveness of Data]
Related to using Big Data in Business is Big Data in Science. Wired ran a nice series of articles looking at this (http://www.wired.com/wired/issue/16-07). This raises all sorts of problems (for example, how can results be reproduced? What if the model of the data is as complex as the data? Are all results obtained with Small Data simply artefacts of sparse counts?).
oh you mean twitter annotations http://techcrunch.com/2010/06/02/twitter-annotations-testing/
Well, you really want to think about bias/variance reductions which brings ideas of averaging and using better classifiers together. For example, "bagging" can be thought of as a variance-reduction technique; "boosting" does both if I recall.
I'm actually surprised that this hasn't been done before. You can prove that using multiple models will on average produce better results than using any single model in isolation. For example, each netflix system will make different errors; using multiple systems will tend to average-out these errors and the consensus decision is most likely to be correct.
+1 insightful
There is another possibility: automatic search (attempting to find relevant pages for everyone) has reached a plateau in terms of performance and if you want to do better, you will need to employ raters. Clearly Google would like to find "the next best thing" in Search, but that sounds quite uncertain. Employing lots of people is a much surer way to improve results.