Machine-Learning Algorithm Ranks the World's Most Notable Authors
HughPickens.com writes: Every year the works of thousands of authors enter the public domain, but only a small percentage of these end up being widely available. So how do organizations such as Project Gutenberg choose which works to focus on? Allen Riddell has developed an algorithm that automatically generates an independent ranking of notable authors for any given year. It is then a simple task to pick the works to focus on or to spot notable omissions from the past. Riddell's approach is to look at what kind of public domain content the world has focused on in the past and then use this as a guide to find content that people are likely to focus on in the future.
Riddell's algorithm begins with the Wikipedia entries of all authors in the English language edition (PDF)—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on. This produces a "public domain ranking" of all the authors that appear on Wikipedia. For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell's new ranking clearly suggests that organizations like Project Gutenberg should focus more on digitizing Woolf's work than Amisani's. Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world, the new algorithm picks out TS Eliot as the most highly ranked individual. Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.
Riddell's algorithm begins with the Wikipedia entries of all authors in the English language edition (PDF)—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on. This produces a "public domain ranking" of all the authors that appear on Wikipedia. For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell's new ranking clearly suggests that organizations like Project Gutenberg should focus more on digitizing Woolf's work than Amisani's. Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world, the new algorithm picks out TS Eliot as the most highly ranked individual. Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.
https://medium.com/the-physics...
Gave us the most influential person in world history was Linnaeus
Just to be Anglo centric I don't even see William Shakespeare as eligible on the new list.
Maybe this should be recategorized funny things you can do with computers ?
Just to be Anglo centric I don't even see William Shakespeare as eligible on the new list.
Maybe this should be recategorized funny things you can do with computers ?
It's only authors who died in 1965. From the SUMMARY:
Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world,
What a load of crap.
This is why you get rubbish like the BBC destroying lots of "classic" early TV series (throwing the film into skips). But they made sure there was space for old episodes of Panorama most of which involved cretins of the day talking shite which is irrelevant in a few years.
The whole point of archiving is that you literally have *no clue whatsoever* what is going to be valuable in the future.
If you did you would be a stock market billionaire multiple times over.
Of the individuals who died in 1965 and whose work will enter the public domain next January
This says so much about our culture...
Are there jurisdictions where one could legally and openly operate a Project Gutenberg clone with more recent works?
I really like G.K. Chesterton, but how can he be ranked higher than Arthur Conan Doyle and Sigmund Freud?
In soviet russia the government regulates the companies.
Laughable!
I quickly checked Wikipedia, and most countries seem to stick with at least "Life + 50yr" term. That is a great achievement of the lobbyists.
Some island nations seem to have no known copyright legislation, but they are still usually parties to some limiting international treaties, and also have similar restrictions under other names ("unauthorized copying", etc.)
Seriously, is there no place on Earth with more reasonable terms?
It may make more sense to concentrate on those lower in the list. The works of highly rated authors are likely to remain available anyway whereas those of lower rated authors are more likely to be lost.
Admittedly, the loss may be deserved, but I am willing to bet there are some (if not many) that will be more highly appreciated in a century or so.
Great minds think alike; fools seldom differ.
What if I translate someone's book, and release my translation into the Public Domain immediately? Would an alternative Project Gutenberg of liberally licensed translations work?
At least the Berne Convention says that "Translations, adaptations, arrangements of music and other alterations of a literary or artistic work shall be protected as original works without prejudice to the copyright in the original work."
Of course the translation is not the same thing. Also, it is more complicated than that. The authors (quite reasonably) have some protection and control over translated versions. Still, even if only some parts of the world, and even only for a selected subset of all good books, could wait less than 50 years after the author's death to easily access his works free of charge, I believe that would be a good thing.
One could imagine both "open source" and "crowdfunding" approaches to building such a library.
It would be ironic to see the author's native language readers having more restrictions than the rest. Maybe such reduction to absurdity could fuel an argument for a worldwide copyright conventions reform for the digital age.
But if history is any indication, they would just make tighter restrictions for the translations.
Riddell's algorithm begins with the Wikipedia entries of all authors in the English language edition (PDF)—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on....Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.
For folks like Winston Churchill and Malcom X who had notable careers outside of writing, I wonder how they distinguish what part of their Wikipedia stats is due to their writing and what part comes from the rest of their careers?
Use Google. Download their ebooks for free from an open index.
It's like they just want to ignore 90% of sci-fi and probably 10-30% of all modern science and technology brought about by people he inspired.
Glancing at the partial list of topics presented suggests this work won't be too hard to improve on:
Topic | Characteristic words
4 | categori of birth death stub date name persondata place metadata
20 | univers of the faculti colleg at and edu professor alumni
31 | painter paint of art artist the and in work museum
35 | he in his was and the to of categori at
77 | he the his in to was of and on at
97 | chines china hong kong zh taiwan zhang shanghai wang beij
100 | the book writer novel fiction of and stori isbn novelist
149 | of the and in historian univers languag histori studi translat
160 | she her in the and was to of as with
168 | the to that in and of ref was had by
Table 1: Examples of topics derived from text of Wikipedia articles
Blasphemy is a human right. Blasphemophobia kills.
ah. So since Francis Bacon isn't deceased, he's not considered. Got it.
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
According to the algorithm the 10 most important authors of all times are:
Abraham Lincoln
Aristotle
Ayn Rand
Alain Connes
Allan Dwan
Andre Agassi
Aldous Huxley
American National Standards Institute
Andrei Tarkovsky
Notice the ANSI in position 9
Also Attila is in position 16...
You can check by downloading the csv file.
This not a machine picking out what authors are worthy of digitizing, it is a computer scanning wikipedia and a few other sites. In other words, it is meta: ranking what regular humans have already ranked by their words and effort to describe. The merit of the critics/reviewers is questionable.
Deciding what is worth digitizing based on the merit of the work itself is not part of this article. For now, I'll stick with librarians deciding what to focus on.
Have gnu, will travel.
It must be the poems about Cats - as the rule the internet
How does frequent contributor Bennett Haselton compare?
It all sounds fairly standard, as these things go.What has earned it the "machine-learning" distinction?
systemd is Roko's Basilisk.
So, based on this algorithm, the #1 priority author would be Sherrilyn Kenyon (who writes paranormal romance), followed by Al Sarrantonio (who writes horror, and puts together a bunch of anthologies), and Muammar Gaddafi (yes, that Muammar Gaddafi). Number six is Gardner Dozois, who's also (like Sarrantonio) an anthologist.
If this is designed to be popularity-based (e.g. designed to determine what people most want to see get scanned/uploaded/entered/produced by something like Gutenberg, rather than an assessment of the aesthetic/historical value of the works), an algorithm that puts these folks at the top, and puts massively popular authors like Stephen King (867) and Tom Clancy (1883) far down the list, is more that a bit suspect
Based on his prolific works on Slashdot, I'm wondering where frequent contributor Bennett Haselton is on the list?
Take it to the limit, everybody to the limit, come on, everybody fhqwhgads.
Seems Tom Clancy comes in 11th for 2013, which seems odd as I haven't heard of any of the top ten.
Then again, that may have just outed myself as being uncultured, good thing this is as AC.
when a machine actually reads all these books and starts making comparisons based on content.
"Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world"
As far as I know, life+50 or life+70 terms count the years from the next after the author's death. This means that next January the works of authors who died in 1964 will enter public domain in life+50 countries (1944 for life+70).
The top 10 of all time:
Sherrilyn_Kenyon 1
Al_Sarrantonio 2
Muammar_Gaddafi 3
Walter_Jon_Williams 4
David_G._Hartwell 5
Gardner_Dozois 6
Mike_Ashley_(writer) 7
Jonathan_Strahan 8
Jan_Brett 9
Terri_Windling 10
Other notables that I saw:
Timothy_Zahn 47
Martin_Luther_King,_Jr. 51
Bram_Stoker 70
H._P._Lovecraft 84
Microsoft 116
George_R._R._Martin 118
R._A._Salvatore 124
Steve_Jobs 150
George_W._Bush 162
Isaac_Asimov 252
Naomi_Novik 268
Mary_Pope_Osborne 289
Lois_McMaster_Bujold 301
Orson_Scott_Card 651
Neil_deGrasse_Tyson 23568
William_Shakespeare 158490
As you can see, the list puts more modern 'crazes' towards the top with real writers below. The top 500-600 all got a maximum score of 94. If you scored 'only' a 93, you could have ended down as far as ~5000. At #158490, William Shakespeare scores an awe-inspiring 68.
Bram Stoker being #1 in the 1910 decade, way ahead of someone like Mark Twain? In what universe?
The list is full of mediocrity floating at the top, while profound authors being ranked way lower (Calamity Jane > Chekhov for instance).
The complete failure of this ranking experiment just shows how true AI is still 20 years in the future (as it has been for the past 50 years)...
For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell's new ranking clearly suggests that organizations like Project Gutenberg should focus more on digitizing Woolf's work than Amisani's.
Which will lead to... exactly the thing we started from.
Wikipedia is a huge circle-jerking effort. If you run this effort over the whole of it, you'll no doubt find out that the "works" of some porn stars are more influential than some of the more obscure philosophers.
It's not so simple, and while the basic project is interesting, drawing conclusions like "you should focus more on this" are clearly written by imbeciles who don't understand that influence isn't the same as citation count or page rank.
The pre-sokratian philosophers, for example, like the sophists, probably don't rank so highly because they left little written material, but that exactly is why preserving what we have about them is so important. Among other things they invented rhetoric, made some of the earliest efforts of a systematic approach to ethics, and greatly influenced Sokrates, Plato and Aristotle as well as pretty much every other greek philosopher, though mostly through being their opponents.
The same is true of arabian scholars who largely go uncredited, but their works created the foundation of much of mathematics.
And let's not even talk about asia. If you take WP as your frame of reference, you're doomed to failure on cross-cultural awareness. The chinese WP has about 10% the size of the english, but chinese culture goes back more than a thousand years further than western culture.
It's a cute little project for fun, but generating serious suggestions for serious projects like Guttenberg out of it is shortsighted, stupid and self-referential.
Assorted stuff I do sometimes: Lemuria.org
Every year the works of thousands of authors enter the public domain
No copyright has expired in the US since 1998, and none will expire until at least 2019. I say "at least", because you can be sure there will be lots of lobbying to extend them even further. I hope the rest of the world is enjoying their public domain... while they still have it.
"I'm too busy to research this and form an educated opinion, but I do have time to tell everyone my uninformed opinion."
Take a look at "most important" (highest ranking) deceased author from the 1980s. It is science fiction/fantasy writer Tom Godwin. Number two is Stanton A. Coblentz . Also in the top 20 (in order): Lin Carter, Robert A. Heinlein, Mack Reynolds, Theodore Sturgeon, James Tiptree, Jr., Clifford D. Simak. Forty percent of the top 20 are SF&F authors. Meanwhile we have Tuchman at 101, Sartre at 112, Borges at 254, Tennessee Williams at 439, Toynbee at 526, and so.
Looking at the 1990s, the top loading by SF&F are equally extreme with Marion Zimmer Bradley No. 1, and William S. Burroughs at 748.
Now I feel that SF&F authors are under-appreciated by critics and "the academy" in the English-speaking world, dismissing brilliantly inventive writing in English, when they would praise it as "magic realism" if written in Spanish or Portuguese, but this is just nerd/geek fannishness run amok.
GIGO forever.
Second class citizen of the New Gilded Age
"Ben Franklin and others who owned printers realized that copyright didn't apply to them, so they promptly began making copies of everything - books, sheet music, etc."
I had know that for much of US history there was no respect for foreign copyrights (from other countries). I never saw anyone connect this to Ben Franklin's success before. Interesting!
Now that I look:
"Benjamin Franklin, Copyright Pirate"
http://www.tuxdeluxe.org/node/...
And:
"Benjamin Franklin, the first IP pirate?"
http://arstechnica.com/informa...
A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.