Slashdot Mirror


Machine-Learning Algorithm Ranks the World's Most Notable Authors

HughPickens.com writes: Every year the works of thousands of authors enter the public domain, but only a small percentage of these end up being widely available. So how do organizations such as Project Gutenberg choose which works to focus on? Allen Riddell has developed an algorithm that automatically generates an independent ranking of notable authors for any given year. It is then a simple task to pick the works to focus on or to spot notable omissions from the past. Riddell's approach is to look at what kind of public domain content the world has focused on in the past and then use this as a guide to find content that people are likely to focus on in the future.

Riddell's algorithm begins with the Wikipedia entries of all authors in the English language edition (PDF)—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on. This produces a "public domain ranking" of all the authors that appear on Wikipedia. For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell's new ranking clearly suggests that organizations like Project Gutenberg should focus more on digitizing Woolf's work than Amisani's. Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world, the new algorithm picks out TS Eliot as the most highly ranked individual. Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.

36 of 55 comments (clear)

  1. Lets see the last one of these by Crashmarik · · Score: 1, Insightful

    https://medium.com/the-physics...

    Gave us the most influential person in world history was Linnaeus

    Just to be Anglo centric I don't even see William Shakespeare as eligible on the new list.

    Maybe this should be recategorized funny things you can do with computers ?

  2. Of the individuals who died in 1965 by Anonymous Coward · · Score: 3, Informative

    Just to be Anglo centric I don't even see William Shakespeare as eligible on the new list.

    Maybe this should be recategorized funny things you can do with computers ?

    It's only authors who died in 1965. From the SUMMARY:

    Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world,

    1. Re:Of the individuals who died in 1965 by Crashmarik · · Score: 4, Informative

      It's only authors who died in 1965. From the SUMMARY:

      RTFA MAN

      http://publicdomainrank.org/

      Starts at authors who died in 1900. If you going to completely misunderstand the meaning of the point and nitpick on petty details at least get them right.

  3. Do not use algorithms ! by Anonymous Coward · · Score: 2, Insightful

    What a load of crap.

    This is why you get rubbish like the BBC destroying lots of "classic" early TV series (throwing the film into skips). But they made sure there was space for old episodes of Panorama most of which involved cretins of the day talking shite which is irrelevant in a few years.

    The whole point of archiving is that you literally have *no clue whatsoever* what is going to be valuable in the future.

    If you did you would be a stock market billionaire multiple times over.

    1. Re:Do not use algorithms ! by CreatureComfort · · Score: 2

      The trouble is budgets and manpower.

      If you know you don't have the resources to save everything, you have to have some way of prioritizing.

      Personally, I would rather save one or two pieces from as many different authors as possible, rather than trying to get everything of the "most important" authors.

      --
      "Unheard of means only it's undreamed of yet,
      Impossible means not yet done." ~~ Julia Ecklar
    2. Re:Do not use algorithms ! by Bite+The+Pillow · · Score: 1

      Because the BBC was basing their decision on a machine learning algorithm?

      No, wait, you seem to be an illiterate moron who was moderated positively because people agree with your basic premise of "archive anything" without realizing that you have nothing whatsoever to do with the topic at hand.

      And when I say illiterate I mean your prostitute slash sister typed these words for you. And the two people who moderated you positively are on some unknown strain of weed that makes them agree with someone who says 3 words in a row that they agree with.

      And you, Anonymous Cretin, should stop posting when you are high.

  4. Ridiculous and sad by Katatsumuri · · Score: 4, Insightful

    Of the individuals who died in 1965 and whose work will enter the public domain next January

    This says so much about our culture...

    Are there jurisdictions where one could legally and openly operate a Project Gutenberg clone with more recent works?

    1. Re:Ridiculous and sad by CreatureComfort · · Score: 1

      Because saving history is such a load of crap?

      I guess if you never knew it existed, then you can't miss it, right?

      --
      "Unheard of means only it's undreamed of yet,
      Impossible means not yet done." ~~ Julia Ecklar
  5. Bad ranking by aBaldrich · · Score: 2

    I really like G.K. Chesterton, but how can he be ranked higher than Arthur Conan Doyle and Sigmund Freud?

    --
    In soviet russia the government regulates the companies.
  6. Life + 50 years almost everywhere by Katatsumuri · · Score: 5, Interesting

    I quickly checked Wikipedia, and most countries seem to stick with at least "Life + 50yr" term. That is a great achievement of the lobbyists.

    Some island nations seem to have no known copyright legislation, but they are still usually parties to some limiting international treaties, and also have similar restrictions under other names ("unauthorized copying", etc.)

    Seriously, is there no place on Earth with more reasonable terms?

    1. Re:Life + 50 years almost everywhere by seven+of+five · · Score: 1

      Like how much should I have paid for a movie when it was just released or how much should I be paying for it now, 20 years later.

      Check out the $1 videos at any garage sale.

    2. Re:Life + 50 years almost everywhere by tlhIngan · · Score: 2

      I quickly checked Wikipedia, and most countries seem to stick with at least "Life + 50yr" term. That is a great achievement of the lobbyists.

      Some island nations seem to have no known copyright legislation, but they are still usually parties to some limiting international treaties, and also have similar restrictions under other names ("unauthorized copying", etc.)

      Seriously, is there no place on Earth with more reasonable terms?

      You have to realize that most countries are bound by the Berne Convention w.r.t. copyrighted works. This is simply where all signatories have agreed to respect each other's copyright claims. Before that, well, an author can very well find their work pirated and indeed, one of the biggest industries in the New World Colonies was... piracy. Ben Franklin and others who owned printers realized that copyright didn't apply to them, so they promptly began making copies of everything - books, sheet music, etc.

  7. Losing Literature by Mikkeles · · Score: 2

    It may make more sense to concentrate on those lower in the list. The works of highly rated authors are likely to remain available anyway whereas those of lower rated authors are more likely to be lost.
        Admittedly, the loss may be deserved, but I am willing to bet there are some (if not many) that will be more highly appreciated in a century or so.

    --
    Great minds think alike; fools seldom differ.
    1. Re: Losing Literature by misterthirsty · · Score: 1

      Well put

    2. Re:Losing Literature by Gibgezr · · Score: 1

      I agree. The most popular ones may not all need the love and attention of the archivists anywhere near as much as some of the lesser-knowns.

  8. Translation workaround by Katatsumuri · · Score: 1

    What if I translate someone's book, and release my translation into the Public Domain immediately? Would an alternative Project Gutenberg of liberally licensed translations work?

    At least the Berne Convention says that "Translations, adaptations, arrangements of music and other alterations of a literary or artistic work shall be protected as original works without prejudice to the copyright in the original work."

    Of course the translation is not the same thing. Also, it is more complicated than that. The authors (quite reasonably) have some protection and control over translated versions. Still, even if only some parts of the world, and even only for a selected subset of all good books, could wait less than 50 years after the author's death to easily access his works free of charge, I believe that would be a good thing.

    One could imagine both "open source" and "crowdfunding" approaches to building such a library.

    It would be ironic to see the author's native language readers having more restrictions than the rest. Maybe such reduction to absurdity could fuel an argument for a worldwide copyright conventions reform for the digital age.

    But if history is any indication, they would just make tighter restrictions for the translations.

    1. Re:Translation workaround by bws111 · · Score: 1

      Your translation does not make the original copyright invalid, which is what your highlighed phrase means. You still need permission to make the translation in the first place, and if you don't have you have committed copyright infringment. However, if you have a license from the copyright holder, then your new work can be released on whatever terms you and the original copyright holder agreed to.

    2. Re:Translation workaround by Anonymous Coward · · Score: 1

      You still need permission to make the translation in the first place, and if you don't have you have committed copyright infringment.

      You are technically incorrect. Making a translation without the authors permission isn't copyright infringement. Distributing it is.

    3. Re:Translation workaround by david_thornley · · Score: 1

      Your translation will have at least two copyrights applying to it: the original author's and the translator's. It can't be used without licenses from both. It can't be distributed just with a license from the original author, hence the protection as an original work. It can't be distributed just with a license from the translator, since that would be prejudicial to the original author's copyright.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
  9. A riddle, wrapped in a mystery, inside an enigma by Marginal+Coward · · Score: 1

    Riddell's algorithm begins with the Wikipedia entries of all authors in the English language edition (PDF)—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on....Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.

    For folks like Winston Churchill and Malcom X who had notable careers outside of writing, I wonder how they distinguish what part of their Wikipedia stats is due to their writing and what part comes from the rest of their careers?

  10. Stop words? by radtea · · Score: 1

    Glancing at the partial list of topics presented suggests this work won't be too hard to improve on:

    Topic | Characteristic words
    4 | categori of birth death stub date name persondata place metadata
    20 | univers of the faculti colleg at and edu professor alumni
    31 | painter paint of art artist the and in work museum
    35 | he in his was and the to of categori at
    77 | he the his in to was of and on at
    97 | chines china hong kong zh taiwan zhang shanghai wang beij
    100 | the book writer novel fiction of and stori isbn novelist
    149 | of the and in historian univers languag histori studi translat
    160 | she her in the and was to of as with
    168 | the to that in and of ref was had by
    Table 1: Examples of topics derived from text of Wikipedia articles

    --
    Blasphemy is a human right. Blasphemophobia kills.
  11. you think I am making joke... by Thud457 · · Score: 1

    ah. So since Francis Bacon isn't deceased, he's not considered. Got it.

    --

    the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff

  12. Re:No Mention of Asimov by Quirkz · · Score: 1

    Asimov died something like 30 years after 1965. His works are nowhere near public domain yet.

  13. Not an independent machine ranking of the work by wagr · · Score: 1

    This not a machine picking out what authors are worthy of digitizing, it is a computer scanning wikipedia and a few other sites. In other words, it is meta: ranking what regular humans have already ranked by their words and effort to describe. The merit of the critics/reviewers is questionable.

    Deciding what is worth digitizing based on the merit of the work itself is not part of this article. For now, I'll stick with librarians deciding what to focus on.

  14. Where is ... by PPH · · Score: 1

    ... Edward Bulwer-Lytton?

    --
    Have gnu, will travel.
  15. Re:No Mention of Asimov by adonoman · · Score: 1

    He's 5th for the 1990s and 252nd overall.

  16. Where does the "machine-learning" come in? by wonkey_monkey · · Score: 1

    It all sounds fairly standard, as these things go.What has earned it the "machine-learning" distinction?

    --
    systemd is Roko's Basilisk.
  17. Some really weird results by jratcliffe · · Score: 1

    So, based on this algorithm, the #1 priority author would be Sherrilyn Kenyon (who writes paranormal romance), followed by Al Sarrantonio (who writes horror, and puts together a bunch of anthologies), and Muammar Gaddafi (yes, that Muammar Gaddafi). Number six is Gardner Dozois, who's also (like Sarrantonio) an anthologist.

    If this is designed to be popularity-based (e.g. designed to determine what people most want to see get scanned/uploaded/entered/produced by something like Gutenberg, rather than an assessment of the aesthetic/historical value of the works), an algorithm that puts these folks at the top, and puts massively popular authors like Stephen King (867) and Tom Clancy (1883) far down the list, is more that a bit suspect

  18. Where's Bennett? by VorpalRodent · · Score: 1

    Based on his prolific works on Slashdot, I'm wondering where frequent contributor Bennett Haselton is on the list?

    --
    Take it to the limit, everybody to the limit, come on, everybody fhqwhgads.
  19. I'll be interested by JazzHarper · · Score: 1

    when a machine actually reads all these books and starts making comparisons based on content.

  20. *something* in, rubbish out... by serbanp · · Score: 1

    Bram Stoker being #1 in the 1910 decade, way ahead of someone like Mark Twain? In what universe?

    The list is full of mediocrity floating at the top, while profound authors being ranked way lower (Calamity Jane > Chekhov for instance).

    The complete failure of this ranking experiment just shows how true AI is still 20 years in the future (as it has been for the past 50 years)...

  21. circle-jerk by Tom · · Score: 1

    For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell's new ranking clearly suggests that organizations like Project Gutenberg should focus more on digitizing Woolf's work than Amisani's.

    Which will lead to... exactly the thing we started from.

    Wikipedia is a huge circle-jerking effort. If you run this effort over the whole of it, you'll no doubt find out that the "works" of some porn stars are more influential than some of the more obscure philosophers.

    It's not so simple, and while the basic project is interesting, drawing conclusions like "you should focus more on this" are clearly written by imbeciles who don't understand that influence isn't the same as citation count or page rank.

    The pre-sokratian philosophers, for example, like the sophists, probably don't rank so highly because they left little written material, but that exactly is why preserving what we have about them is so important. Among other things they invented rhetoric, made some of the earliest efforts of a systematic approach to ethics, and greatly influenced Sokrates, Plato and Aristotle as well as pretty much every other greek philosopher, though mostly through being their opponents.

    The same is true of arabian scholars who largely go uncredited, but their works created the foundation of much of mathematics.

    And let's not even talk about asia. If you take WP as your frame of reference, you're doomed to failure on cross-cultural awareness. The chinese WP has about 10% the size of the english, but chinese culture goes back more than a thousand years further than western culture.

    It's a cute little project for fun, but generating serious suggestions for serious projects like Guttenberg out of it is shortsighted, stupid and self-referential.

    --
    Assorted stuff I do sometimes: Lemuria.org
  22. Not in America! by SoftwareArtist · · Score: 1

    Every year the works of thousands of authors enter the public domain

    No copyright has expired in the US since 1998, and none will expire until at least 2019. I say "at least", because you can be sure there will be lots of lobbying to extend them even further. I hope the rest of the world is enjoying their public domain... while they still have it.

    --
    "I'm too busy to research this and form an educated opinion, but I do have time to tell everyone my uninformed opinion."
  23. I Like Tom Godwin But... by crunchygranola · · Score: 1

    Take a look at "most important" (highest ranking) deceased author from the 1980s. It is science fiction/fantasy writer Tom Godwin. Number two is Stanton A. Coblentz . Also in the top 20 (in order): Lin Carter, Robert A. Heinlein, Mack Reynolds, Theodore Sturgeon, James Tiptree, Jr., Clifford D. Simak. Forty percent of the top 20 are SF&F authors. Meanwhile we have Tuchman at 101, Sartre at 112, Borges at 254, Tennessee Williams at 439, Toynbee at 526, and so.

    Looking at the 1990s, the top loading by SF&F are equally extreme with Marion Zimmer Bradley No. 1, and William S. Burroughs at 748.

    Now I feel that SF&F authors are under-appreciated by critics and "the academy" in the English-speaking world, dismissing brilliantly inventive writing in English, when they would praise it as "magic realism" if written in Spanish or Portuguese, but this is just nerd/geek fannishness run amok.

    GIGO forever.

    --
    Second class citizen of the New Gilded Age
  24. If I misread the researcher's name as "Riddle" ... by harryjohnston · · Score: 1

    ... does that mean I've read too much Harry Potter?

  25. The Ben Franklin / Copyright "Pirate" connection by Paul+Fernhout · · Score: 1

    "Ben Franklin and others who owned printers realized that copyright didn't apply to them, so they promptly began making copies of everything - books, sheet music, etc."

    I had know that for much of US history there was no respect for foreign copyrights (from other countries). I never saw anyone connect this to Ben Franklin's success before. Interesting!

    Now that I look:
    "Benjamin Franklin, Copyright Pirate"
    http://www.tuxdeluxe.org/node/...

    And:
    "Benjamin Franklin, the first IP pirate?"
    http://arstechnica.com/informa...

    --
    A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.