Slashdot Mirror


Elsevier Opens Its Papers To Text-Mining

ananyo writes "Publishing giant Elsevier says that it has now made it easy for scientists to extract facts and data computationally from its more than 11 million online research papers. Other publishers are likely to follow suit this year, lowering barriers to the computer-based research technique. But some scientists object that even as publishers roll out improved technical infrastructure and allow greater access, they are exerting tight legal controls over the way text-mining is done. Under the arrangements, announced on 26 January at the American Library Association conference in Las Vegas, Nevada, researchers at academic institutions can use Elsevier's online interface (API) to batch-download documents in computer-readable XML format. Elsevier has chosen to provisionally limit researchers to 10,000 articles per week. These can be freely mined — so long as the researchers, or their institutions, sign a legal agreement. The deal includes conditions: for instance, that researchers may publish the products of their text-mining work only under a license that restricts use to non-commercial purposes, can include only snippets (of up to 200 characters) of the original text, and must include links to original content."

8 of 52 comments (clear)

  1. 200 characters by Anonymous Coward · · Score: 4, Funny

    Publishing giant Elsevier says that it has now made it easy for scientists to extract facts and data computationally from its more than 11 million online research papers. Other publishers are likely t

  2. Re:If the Internet is killing Newspapers by dj245 · · Score: 4, Insightful

    If the Internet is killing newspapers, why isn't it killing this dead tree company?

    When people stop buying newspapers, they fire the reporters and news correspondants.

    When people stop buying scientific journals (and electronic access to such), it doesn't matter. There are still hundreds of professors lined up around the block to try to get published, since it is basically required for them to earn tenure. Anytime you have a barrier to career advancement, the people who own that barrier have a near monopoly and can charge whatever the market will bear. And the market of people trying to advance their career will bear a lot.

    --
    Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
  3. It would be nicer if... by DeadDecoy · · Score: 3

    ... publishers removed the paywall to publicly funded literature, or at least made the prices more sane.

    Also, while we're on the topic of text mining, would it be possible to get text-only or xml-based articles, with figures attached and cross-references as needed? It's quite annoying to manually convert a pdf when trying to setup an automated analysis over several documents. I know one could setup a shell script to dump it out using the pdftoxml converter, but the output is a bit messy to parse.

    1. Re:It would be nicer if... by RuffMasterD · · Score: 2

      Elsevier had a profit margin of 36% on revenues of US$3.2 billion in 2010. They publish about 250,000 articles a year and these are downloaded about 240 million times a year. Their content is written for them, but the authors actually have to pay (public money) for the privilege, and their peer review is free labour. Then the readers have to pay too (usually public money again), and not a cent goes to the author!

      Meanwhile Wikipedia's operating cost was $20.1 Million (mostly funded by donations), they had over 3 million articles, and they are one of the most visited sites on the Internet. The content is written for free and massively peer reviewed for free. All their content can be read by anyone, for free.

      Elsevier and Wikipedia seem to have similar technical requirements and business models, but one costs WAY more than the other. That difference is pure profit. If anything, Wikipedia should cost more than Elsevier.

      --
      Human Rights, Article 12: Freedom from Interference with Privacy, Family, Home and Correspondence
  4. Re:Google spamming by John+Bokma · · Score: 3, Interesting

    Several sites that have pay walled PDFs somehow manage to get the contents of those PDFs crawled by Google (probably others as well). Google has rules against this, but somehow those sites get away with this. E.g. if one googles for "some keywords filetype:pdf" (without the quotes) results Google show might give the impression that that the full PDF is available but when clicking one lands on a HTML page which shows the abstract and a "buy this document" link. Access is in the 30+ USD range, so about 2 USD/page or more... One of those sites is Elsevier. Or at least was, can't find an example.

    When this happens to me, I contact one of the authors and end up with the paper anyway, for free, most of the time.

    Another parasite is scribd.

  5. Re:If the Internet is killing Newspapers by John+Bokma · · Score: 3, Informative

    Because news or "news" [1] can be gotten for free on the Internet while peer reviewed scientific papers is a bit harder. My experience is that quite some sites bait Google search results (see my earlier post; you google for pdfs but end up on a landing page which allows you to buy one time access for 30+ USD for a handful of pages). My successful workaround (so far) has been contacting one of the authors for a copy (for personal study).

    [1] a lot of people don't seem to care if it's made up or not

  6. Re:If the Internet is killing Newspapers by Jane+Q.+Public · · Score: 3, Funny

    "If the Internet is killing newspapers, why isn't it killing this dead tree company?"

    It isn't a dead tree company, per se. Elsevier publishes as much online as offline. And more than most.

    Having said that: they can still die in a fire.

  7. Re:Google spamming by John+Bokma · · Score: 2

    The technique is called cloaking. You basically check if a page request is coming from Googlebot or not to decide what to return (or redirect). See: https://support.google.com/web...

    The services you mentioned have different rules, of course.