Elsevier Opens Its Papers To Text-Mining

← Back to Stories (view on slashdot.org)

Elsevier Opens Its Papers To Text-Mining

Posted by samzenpus on Monday February 3, 2014 @07:00AM from the take-a-look dept.

ananyo writes "Publishing giant Elsevier says that it has now made it easy for scientists to extract facts and data computationally from its more than 11 million online research papers. Other publishers are likely to follow suit this year, lowering barriers to the computer-based research technique. But some scientists object that even as publishers roll out improved technical infrastructure and allow greater access, they are exerting tight legal controls over the way text-mining is done. Under the arrangements, announced on 26 January at the American Library Association conference in Las Vegas, Nevada, researchers at academic institutions can use Elsevier's online interface (API) to batch-download documents in computer-readable XML format. Elsevier has chosen to provisionally limit researchers to 10,000 articles per week. These can be freely mined — so long as the researchers, or their institutions, sign a legal agreement. The deal includes conditions: for instance, that researchers may publish the products of their text-mining work only under a license that restricts use to non-commercial purposes, can include only snippets (of up to 200 characters) of the original text, and must include links to original content."

52 comments

Min score:

Reason:

Sort:

start up the algorithms by Anonymous Coward · 2014-02-03 07:05 · Score: 0

Time to disprove some punks.
1. Re:start up the algorithms by i+kan+reed · 2014-02-03 07:14 · Score: 1
  
  What exactly are punks saying that can be deconstructed with statistical sampling of published papers?
  I mean, are there some really dumb people alleging that academics don't use enough words starting with K?
2. Re:start up the algorithms by Anonymous Coward · 2014-02-03 07:22 · Score: 1
  
  Too much E.
3. Re:start up the algorithms by i+kan+reed · 2014-02-03 07:25 · Score: 1
  
  Too much E.
  Well something has to keep those academic raves fun.
4. Re: start up the algorithms by Anonymous Coward · 2014-02-03 11:15 · Score: 0
  
  Aren't these research papers published with funding coming from grants that were originally tax-payer money? Why should I, as a tax payer, have to pay for it again. Where's my report?
Google spamming by Florian+Weimer · 2014-02-03 07:10 · Score: 1

Isn't this called search engine spamming, and several publishing outfits have been doing it for about a decade, with varying degree of success?
1. Re:Google spamming by Anonymous Coward · 2014-02-03 07:19 · Score: 0
  
  Google and Google, what is Google?! How does batch-download of XML documents have anything at all to do with Google?
2. Re:Google spamming by John+Bokma · 2014-02-03 07:38 · Score: 3, Interesting
  
  Several sites that have pay walled PDFs somehow manage to get the contents of those PDFs crawled by Google (probably others as well). Google has rules against this, but somehow those sites get away with this. E.g. if one googles for "some keywords filetype:pdf" (without the quotes) results Google show might give the impression that that the full PDF is available but when clicking one lands on a HTML page which shows the abstract and a "buy this document" link. Access is in the 30+ USD range, so about 2 USD/page or more... One of those sites is Elsevier. Or at least was, can't find an example.
  When this happens to me, I contact one of the authors and end up with the paper anyway, for free, most of the time.
  Another parasite is scribd.
  
  --
  
  Perl Programmer for hire
3. Re:Google spamming by pepty · 2014-02-03 08:48 · Score: 1
  
  Several sites that have pay walled PDFs somehow manage to get the contents of those PDFs crawled by Google (probably others as well). Google has rules against this,.
  Really? I would have thought they would be fine with it; Google Scholar would have been hamstrung from the get go if they didn't present results from paywalled databases, and Google Books is a similar situation for books under copyright.
4. Re:Google spamming by John+Bokma · 2014-02-03 09:04 · Score: 2
  
  The technique is called cloaking. You basically check if a page request is coming from Googlebot or not to decide what to return (or redirect). See: https://support.google.com/web...
  The services you mentioned have different rules, of course.
  
  --
  
  Perl Programmer for hire
5. Re:Google spamming by jafiwam · 2014-02-03 09:51 · Score: 1
  
  The technique is called cloaking. You basically check if a page request is coming from Googlebot or not to decide what to return (or redirect). See: https://support.google.com/web...
  The services you mentioned have different rules, of course.
  Some of those tools use the browser identifier to decide to let them in or not.
  Something that in some browsers, can be modified by the end user....
6. Re:Google spamming by John+Bokma · 2014-02-03 10:07 · Score: 1
  
  Yup, see: http://johnbokma.com/mexit/200... However, this doesn't work if they check for IP ranges.
  
  --
  
  Perl Programmer for hire
7. Re:Google spamming by wiredlogic · 2014-02-03 10:16 · Score: 1
  
  Google will masquerade Googlebot as an ordinary browser to spot check cloaking but it isn't thorough enough to catch everything. With AJAX rendered content it is even harder for them to tell what is and isn't shown to normal users.
  
  --
  I am becoming gerund, destroyer of verbs.
8. Re:Google spamming by c0lo · 2014-02-03 10:26 · Score: 1
  
  Isn't this called search engine spamming, and several publishing outfits have been doing it for about a decade, with varying degree of success?
  While it may be SEO spamming, I'm inclined to see this as an attempt to outsource the cost of indexing.:
  On the line of: "You fools, I have a trove of papers you are drooling for. What about... I'll let you index it by whatever your brilliant minds discover it works the best for you, then I'll use it to increase the value of my trove"
  
  --
  Questions raise, answers kill. Raise questions to stay alive.
9. Re:Google spamming by Anonymous Coward · 2014-02-03 11:15 · Score: 0
  
  When this happens to me, I contact one of the authors and end up with the paper anyway, for free, most of the time.
  When I come across a research paper or article where the research has been funded by a publicly-funded college or university or a government grant I simply search for the paper's title and the institution and retrieve it 99.9% of the time for free. I need not contact the author(s) though I commend your forth rightness.
10. Re:Google spamming by arglebargle_xiv · 2014-02-03 14:11 · Score: 1
  
  The technique is called cloaking.
  When Elsevier are doing it, it's called cloacaing.
11. Re:Google spamming by JaredOfEuropa · 2014-02-03 20:58 · Score: 1
  
  I'd be fine with this if the search results would clearly mark entries sitting behind a paywall or requiring registration to access. I'm sure we've all been frustrated multiple times by the likes of Experts-exchange (who show answers to tech questions in Google but won;t let you at them unless you pay up).
  
  --
  If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
12. Re:Google spamming by Anonymous Coward · 2014-02-03 22:40 · Score: 0
  
  Google makes an exception to the rule you've stated above for academic content. ie even if a paper is paywalled, it'll crawl it. (I work for an academic publisher)
13. Re:Google spamming by Anonymous Coward · 2014-02-04 18:07 · Score: 0
  
  Not entirely true. Google doesn't like it if you show them different content from what you'd show a regular user. At the same time, they offer ways for site owners to have Google index paywalled or otherwise password-protected content.
In others words by dacullen · 2014-02-03 07:10 · Score: 1

1. Please generate as many sales leads as you can 2. Profit!!!
1. Re:In others words by Anonymous Coward · 2014-02-03 07:52 · Score: 0
  
  Elsevier doesn't bother marketing to individuals. They market exclusively to librarians e.g. institutions.
2. Re:In others words by pepty · 2014-02-03 08:50 · Score: 1
  
  They're probably using it as a way to justify the prices the institutions are forced to pay.
IEEE by Anonymous Coward · 2014-02-03 07:10 · Score: 0

Wake me up when I can get all those taxpayer-funded IEEE papers online for free. *grumble*
200 characters by Anonymous Coward · 2014-02-03 07:15 · Score: 4, Funny

Publishing giant Elsevier says that it has now made it easy for scientists to extract facts and data computationally from its more than 11 million online research papers. Other publishers are likely t
Hey IBM! by Floyd-ATC · 2014-02-03 07:19 · Score: 1

Get Watson over here will you?

--
Time flies when you don't know what you're doing
1. Re:Hey IBM! by Anonymous Coward · 2014-02-03 07:43 · Score: 0
  
  Exactly! Indian Business Machines is the way to go!
If the Internet is killing Newspapers by ScottCooperDotNet · 2014-02-03 07:22 · Score: 1

If the Internet is killing newspapers, why isn't it killing this dead tree company?
1. Re:If the Internet is killing Newspapers by dj245 · 2014-02-03 07:30 · Score: 4, Insightful
  
  If the Internet is killing newspapers, why isn't it killing this dead tree company?
  When people stop buying newspapers, they fire the reporters and news correspondants.
  
  When people stop buying scientific journals (and electronic access to such), it doesn't matter. There are still hundreds of professors lined up around the block to try to get published, since it is basically required for them to earn tenure. Anytime you have a barrier to career advancement, the people who own that barrier have a near monopoly and can charge whatever the market will bear. And the market of people trying to advance their career will bear a lot.
  
  --
  Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
2. Re:If the Internet is killing Newspapers by John+Bokma · 2014-02-03 07:41 · Score: 3, Informative
  
  Because news or "news" [1] can be gotten for free on the Internet while peer reviewed scientific papers is a bit harder. My experience is that quite some sites bait Google search results (see my earlier post; you google for pdfs but end up on a landing page which allows you to buy one time access for 30+ USD for a handful of pages). My successful workaround (so far) has been contacting one of the authors for a copy (for personal study).
  [1] a lot of people don't seem to care if it's made up or not
  
  --
  
  Perl Programmer for hire
3. Re:If the Internet is killing Newspapers by Jane+Q.+Public · 2014-02-03 07:41 · Score: 3, Funny
  
  "If the Internet is killing newspapers, why isn't it killing this dead tree company?"
  It isn't a dead tree company, per se. Elsevier publishes as much online as offline. And more than most.
  
  Having said that: they can still die in a fire.
4. Re:If the Internet is killing Newspapers by Anonymous Coward · 2014-02-03 09:48 · Score: 0
  
  My successful workaround (so far) has been contacting one of the authors for a copy (for personal study).
  Yea, that's pretty much what anyone does, even at a research institute, if it isn't part of your library subscription. I don't know anyone that actually pays the $30 unless: (a) they need the data now (note: this has been me one time), or (b) they work for a company with deep pockets that is paying for them (note: this has not been me ever... if you know someone with deep pockets that is hiring, though....). Anyone else is just an idiot.
More access coming to other journals by 1_brown_mouse · 2014-02-03 07:25 · Score: 1

I like this bit from TFA:
Shillum says that Elsevier is ahead of the curve — but that other publishers are likely to follow soon. CrossRef, a non-profit collaboration of thousands of scholarly publishers, will in the next few months launch a service that lets researchers agree to standard text-mining terms and conditions by clicking a button on a publisher’s website, a ‘one-click’ solution similar to Elsevier’s set-up.
I would like to see that.
It would be nicer if... by DeadDecoy · 2014-02-03 07:31 · Score: 3

... publishers removed the paywall to publicly funded literature, or at least made the prices more sane.

Also, while we're on the topic of text mining, would it be possible to get text-only or xml-based articles, with figures attached and cross-references as needed? It's quite annoying to manually convert a pdf when trying to setup an automated analysis over several documents. I know one could setup a shell script to dump it out using the pdftoxml converter, but the output is a bit messy to parse.
1. Re:It would be nicer if... by Anonymous Coward · 2014-02-03 08:24 · Score: 0
  
  The output is a bit messy to parse? Scroll a few lines upwards... voila, Perl programmer for hire. In my experience, they are darn easy to handle, just throw a box of twinkies in the cellar workspace every few hours.
2. Re:It would be nicer if... by DeadDecoy · 2014-02-03 08:39 · Score: 1
  
  There are a few issues with the output of pdftoxml that make it difficult to parse (mostly adobe's fault). For 2-column articles, the columns are interleaved. That means you'll get a little bit of text from column A followed by a little bit of text from column B. The xml tags contain the x/y coordinates, so you can develop some heuristics to cleave out segments of text for one journal. This is not particularly suitable when you want to analyze text across different journal formats, as you'll have to develop a one-off solution for each journal.
  
  It would also be useful to have clearly demarcated sections for the abstract, results, references, etc. Again, you could set BIO (Begin-In-Out) tags based on the section title and formatting style, but you may run into a few false positives if those words are used elsewhere in the text, and the two-column issue mentioned earlier may dump in text from other sections. Finally, there's little distinction between the body of the manuscript and the header/footer information.
  
  Overall, the text is a bit messy. If you're just looking for keywords, then it's not a big deal. If you are trying to extract more complicated syntactic structures within the document, then it becomes a problem.
3. Re:It would be nicer if... by Anonymous Coward · 2014-02-03 09:29 · Score: 0
  
  Someone has to pay for servers, archiving, management, in short general overhead. The audience for academic papers is not broad enough to fund this via ads, so either the author or the reader (or their respective proxies) has to pay. The open access movement broadens readership at the price of restricting publications to those who can afford it - pricing out those from poorer institutions/countries. For some areas (high energy physics, life sciences come to mind) the cost of the research involved makes $2000-3000 a round off in conducting research and open access makes far more sense. In areas where grants are small (humanities for instance) or those working without grants (albeit often in state institutions) that $2000-3000 has a chilling effect on publication and sticking with the subscription paywall might make more sense.
4. Re:It would be nicer if... by Anonymous Coward · 2014-02-03 12:31 · Score: 0
  
  ... I know one could setup a shell script to dump it out using the pdftoxml converter, but the output is a bit messy to parse.
  "A bit messy to parse" is quite an understatement. There is no known general purpose method for reconstructing PDF structure, and the ones that are close enough require extensive knowledge to configure for each class of documents. The authors of the Dolores system claim to have a system that can be taught in a few minutes per system at least for simple elements like titles, subtitles and paragraphs. They don't seem to handle images, though, and tables are most likely out of their scope.
5. Re:It would be nicer if... by RuffMasterD · 2014-02-03 15:16 · Score: 2
  
  Elsevier had a profit margin of 36% on revenues of US$3.2 billion in 2010. They publish about 250,000 articles a year and these are downloaded about 240 million times a year. Their content is written for them, but the authors actually have to pay (public money) for the privilege, and their peer review is free labour. Then the readers have to pay too (usually public money again), and not a cent goes to the author!
  
  Meanwhile Wikipedia's operating cost was $20.1 Million (mostly funded by donations), they had over 3 million articles, and they are one of the most visited sites on the Internet. The content is written for free and massively peer reviewed for free. All their content can be read by anyone, for free.
  
  Elsevier and Wikipedia seem to have similar technical requirements and business models, but one costs WAY more than the other. That difference is pure profit. If anything, Wikipedia should cost more than Elsevier.
  
  --
  Human Rights, Article 12: Freedom from Interference with Privacy, Family, Home and Correspondence
6. Re:It would be nicer if... by Anonymous Coward · 2014-02-03 19:23 · Score: 0
  
  Why? Elsevier has scientific articles, while wikipedia has endless flamewars on how to spell Aluminium. Half of wikipedia is just plain wrong. So is half of scientific articles, but at least they are right as far as we currently know. Free access to scientific articles would be a damn good thing anyway.
7. Re:It would be nicer if... by martin-boundary · 2014-02-04 00:49 · Score: 1
  
  It wouldn't be nicer. It would be the least they should possibly do.
  Publishers like Elsevier are leaches sucking at the teat of scientific institutions, weakening their libraries, which are the cornerstone of humanity's research efforts. The sooner they FOAD the better.
One click? by Anonymous Coward · 2014-02-03 07:33 · Score: 0

Lawyers for Amazon are envisioning enlarging their swimming pools...
LongStrider by Anonymous Coward · 2014-02-03 08:00 · Score: 0

ALA Midwinter was in Philadelphia, PA this year. The upcoming ALA conference this summer will be in Las Vegas.
Elsevier hasn't DIAF yet? by atari2600a · 2014-02-03 08:05 · Score: 1

Soon...once the exclusive contracts and the End User LIcense Agreements expire, the users will revolt. It was foretold in the Scientific Prophecy of Rebirth.
Perl Programmerq by Anonymous Coward · 2014-02-03 08:26 · Score: 0

Oh nevermind, I just noticed that he charges money instead of twinkies. 120 euro, or 163 dollar, per hour. Lordy..
Elsevier is a tree by Mister+Liberty · 2014-02-03 09:17 · Score: 1

that should have been pruned long ago.
Greed by Anonymous Coward · 2014-02-03 10:17 · Score: 1

Haha, back in the 90's, I worked at a company that built some websites for Elsevier. The effort was overseen by a young Dutch woman who came to our offices and wanted to know why we didn't have orange juice and buns for her every morning.
We designed a background image that looked great at normal viewing distances from the screen, but when seen from far away it looked like it really said "GReed-Elsevier". The sites went public, but we were made to change the background about a week after launch.
The data-mining agreement seems to suck by shtrom · 2014-02-03 11:37 · Score: 1

Acording to “Why you and I should NOT sign up for Elsevier’s TDM service“ [0], this is not all that good, as the Text and Data Mining policy is actually overly restrictive. Most notably, it forces you to go through their API to do the work, rather than parsing things locally at your leisure, and imposes conditions on the release of the uncovered data (namely a non-free CC-NC).
[0] http://blogs.ch.cam.ac.uk/pmr/...
VALE AARON by Anonymous Coward · 2014-02-03 15:12 · Score: 0

nuff said really
Free for their definition of free, not yours by ghmh · 2014-02-03 16:59 · Score: 1

Note:
If you have to sign or agree to something in order to access it, it's not free, even if they say otherwise.
1. Re:Free for their definition of free, not yours by Antique+Geekmeister · 2014-02-04 01:03 · Score: 1
  
  Even a "Public Domain" copyrighted work has rules embedded in copyright law, which apply whether you agree or not. Games played entierly without rules get very strange, very quickly, and inevitably wind up with rules evolved very quickly and not necessarily well.
  Having the rules spelled out, in writing, is very helpful to let both sides know what _is_ allowed. This is often far better than the very confusing and potentially dangerous lawsuits involving what is _not_ allowed. Whether these agreements are reasonable is a different question: they do seem pretty aggressive, and restrict the document use far more than even "fair use" restricts it.
Elsevier? Publish elsewhere! by Anonymous Coward · 2014-02-03 21:12 · Score: 0

Why? Just look at the never ending list at https://en.wikipedia.org/wiki/...