Counting the World's Books
The Google Books blog has an explanation of how they attempt to answer a difficult but commonly asked question: how many different books are there? Various cataloging systems are fraught with duplicates and input errors, and only encompass a fraction of the total distinct titles. They also vary widely by region, and they haven't been around nearly as long as humanity has been writing books. "When evaluating record similarity, not all attributes are created equal. For example, when two records contain the same ISBN this is a very strong (but not absolute) signal that they describe the same book, but if they contain different ISBNs, then they definitely describe different books. We trust OCLC and LCCN number similarity slightly less, both because of the inconsistencies noted above and because these numbers do not have checksums, so catalogers have a tendency to mistype them." After refining the data as much as they could, they estimated there are 129,864,880 different books in the world.
Dont inclue the copyrighted books OPEN SOURCESSDSDV
Look at textbooks - new editions that are almost indistinguishable from the previous editions have new ISBNs. Do we count every single one as a different book?
And what about self published books? They wouldn't have an ISBN unless they became wildly successful and then maybe not even then.
estimate would be about 130 million, not 129,864,880
I'm almost done reading them all!
how about all the books printed in china, the rest of asia, middle east etc that don't have ISBN's?
I can tell this topic is going to be dominated by people who never had to deal with the internals of a revision-control system, much less a configuration-management system, because the issues are somewhat trivial once you get past your fear of the variables.
Also by people who have never read the article, where it explains in some significant detail how they try to determine what constitutes "a book" for the purposes of their counting.
"This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
In order to count and house all the world's books, we, of course, are going to need a new filesystem. I propose to call it TSPFS. The fundamental unit of the said filesystem is a BLoC, representing 115M books. And of course, 640K BLoCs should be enough for anyone...
That's a stupid estimate. Since they admitted there is so much uncertainty, they should have just said 130 million. (Or better, 0.13 billion to retain the significant digits)
http://rlslog.in/wallpapers/3909-widescreen_40.html
I'm very suspicious about their numerical precision. IF it's an estimate, then they are saying it's 129,864,880 +/- 10. That is, they are pretty sure there aren't 129,864,980 books. I think they should make their estimate something like "we think there are about 130,000,000" or whatever accuracy they actually believe.
Currently hooked on AMP
-=][Interesting][=-
> Various cataloging systems are fraught with duplicates and input errors, and only encompass a fraction of the total distinct titles.
You callin' me a liar?
They should write a book!
Who cares? Does it matter?
I want a list of atrocities done in your name - Recoil
OK so now we can represent every text in the world with a 32 bit key. We just need the world's fanciest decryption algorithm to recover the texts...
If you divide the number of books by the current world population, you get that there are one unique books for every 50 people, or on average one in 50 people wrote a book, including many poor, illiterate and children.
Of course, some book writers have died and many have written more the one book, but I suspect that most books have been written recently and their writers are still alive.
If you only include adults who live a comfortable western lifestyle, it may be as maybe as high as one in 10.
I'm not sure I follow.... How much is that in Libraries of Congress?
Qoh.12 [12] ... Of making many books there is no end,
In the land of the blind, the one-eyed man is king.
Do Ph. D. thesis manuscripts (and other academic writings) count as books ? If so, I bet there's much more than "only" 130e6...
The same checksum they use for UPC codes. Sum up the 10 significant digits. Then take that sum(S) and push up to the next tens unit(T). The difference of T-S = check digit.
E.g. UPC code 54556 39824. Sum is 51. Next tens is 60. 60-51=9 so the check digit is 9. The same basic formula could work for ISBN numbers too.
129,864,880 different books? What is that in Libraries of Congress?
You read the article?
Impostor! Burn the witch!
Check out my sci-fi/humor trilogy at PatriotsBooks.
How about the books that people write and spread around to friends or books published by small in-house printshops, often as promotional material? Books written before ISBN that are still in libraries but no longer published (Bodoni's type specimens come to mind, though it looks like some of these are indeed catalogued by WorldCat)? Books that were printed years ago that we know we lost to the ages (the lost Gospel of Barnabas--not the forged Gospel of Barnabas--comes to mind). What about the books that we never knew existed?
This estimate isn't bad for published works, but it does not adequately answer the question posed, ``Just how many books are out there?''
Look at textbooks - new editions that are almost indistinguishable from the previous editions have new ISBNs. Do we count every single one as a different book?
From TFS : if they contain different ISBNs, then they definitely describe different books
If they're using this method, GP's point is valid. The books are not really new books, they're essentially the same as previous editions but have different ISBNs. In essence, these new editions with new ISBNs are being counted twice (or more) for very small revisions to the same book.
Visit some of the stately homes of England and it will be obvious that there are lots and lots of books that are unlikely to be in very many libraries but which would contain lots of fascinating historical and geographical info. Things like the history of our county, memoirs of my service as a priest in this parish. Many of these homes are operated by the National Trust but often the home and contents is still privately owned. It would take a lot of work to get access to scan this stuff, but I would love to see it done. There are thousands of small local museums and libraries throughout the world with lots of regional information, garnered from the estate of prominent citizens who died. Google has only scratched the surface with their scanning to date.
From TFA: Well, it all depends on what exactly you mean by a “book.” We’re not going to count what library scientists call “works,” those elusive "distinct intellectual or artistic creations.” It makes sense to consider all editions of “Hamlet” separately, as we would like to distinguish between -- and scan -- books containing, for example, different forewords and commentaries. (emphasis mine)
For Google's definition of what constitutes a unique work as used to derive the stated quantity, the use of ISBN as described is perfectly valid. They are OK with "almost the same work" != "the same work".
So their counting methodology would consider "Fundamentals of Math 3rd Ed by I. M. Counting" to be a distinct work from "Fundamentals of Math 4th Ed by I. M. Counting".
In fact, if the publisher released a paperback version, it would be considered another separate work, because the typesetting and page layouts may differ, and might include different forewords, different pages on the index, etc.
It's a separate and distinct work, from Google's point of view, where they are trying to index the works that they want to scan.
Remember, their goal is to capture as much as possible of the entire sum of human writing. A different foreword is a unique work to them.
Of course, you can then disagree with Google's counting methodology, which is fine. If you do, then the number they have reached for their purposes is meaningless to you and you'd better start counting based on your own definition.
It'll take a while, good luck, and let us know what you come up with. :)
"This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
(Bet you thought I was going to say the Bible. Wrong, I'm crazier than that!)
there are 129,864,880 different books in the world
So how many library of congresses is that?
ISBNs suck as identifiers for digital books, especially digital books that are free. There are two problems.
Problem number one is that they cost money. Let's say someone writes up a really nice manual documenting some open-source software. He wants the manual to be free, just like the software. But now if he wants an ISBN, he has to pay money to get the ISBN, which means expending dollars on a book that is not going to be bringing in any dollars. The fact that ISBNs cost money is out of step with the fact that we have this thing called the World Wide Web, which is basically a huge machine for letting people do publishing without the per-copy costs that are associated with print publishing.
The other problem is that ISBNs are supposed to uniquely identify an edition of the book. This makes sense for traditional print publishing, where the economics of production forced people to make discrete editions widely spaced in time. It makes no sense for print on demand or for pure digital publishing. I've written some CC-licensed textbooks. When someone emails me to let me know about a typo or a factual error, I fix it right away in the digital version, and I usually update the print-on-demand version within about 6 months. No way am I going to assign a different ISBN every 6 months.
We can say that ISBNs are for printed books, not for ephemeral web pages, but that doesn't really work. The two overlap. My textbooks exist simultaneously as web pages, pdf files, and printed books. Amazon sells a book for the kindle using one ISBN, assigning a different ISBN to the printed version. Print-on-demand books share some characteristics with printed books (e.g., they're physical objects) and some with the web (can be updated continuously).
By the way, why do you think library catalogs don't show ISBNs? It's because ISBNs are meant as commercial tools, like the barcode on a box of cereal. If google finds ISBNs useful for other purposes than selling copies of books, it's probably because google is trying to deal with a massive number of books using a minimum amount of human labor.
Find free books.
A typical book is in the range of 1-2MB of text, assuming you're representing actual letters, as opposed to scanned images of the text, and ignoring illustrations, pictures, etc. So if there are about 130 million books, that's about 200TB to store them uncompressed, maybe 50TB compressed. If you've got multiple versions that are almost identical (e.g. Third Printing from Paperback Publisher B has a different copyright page than First printing from Hardback Publisher A, and maybe a different cover page illustration and blurbs on the back cover), then the different versions add a percent or two.)
As correlation, Wikipedia says the Library of Congress has about 20 million books (in a collection of 100 million things), and The InterWebs say that the Library of Congress is about 20TB (not clear if that's just books or not.) So that says 130 million books would be about 130TB uncompressed; it fits on the back of the same envelope.
So for about $5000 of computer equipment, your town or school could have its own copy of The Library, with All The Books.
So far, The Internet Archive has digitized about a million books - you could probably fit that onto 1-2 BlueRay disks.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
1)TFA actually acknowledges that the ISBN is very North America-centric, but the other cataloging types are also either N.A-centric or at least western world-centric.
2) The entire article is based on efforts to simply compile a list of books by aggregating and loosely filtering/sorting several other lists. The lists mentioned are, as far as I know, all heavily biased toward 19th and 20th century works. (The article explicitly mentions that one problem is that it doesn't include numerous works not intended for commercial consumption, such as doctoral theses and so on.)
I would argue that the most important works to digitize first is not the low-hanging fruit of works already cataloged and in most cases, existing in multiple copies in multiple locations. (we are at little risk of losing the works of Dan Brown (cited in the article) to the depredations of time during the scope of this project.) To me; the most important works to get digitized are those works where there are only one or two copies, are possibly hundreds of years old and are moldering away forgotten on the back shelves of some monastary or filed and forgotten in the bowels of some museum.
What I'd like to see is Google and a few other digital data industry leaders get together and create a bounty system for old books. Simply put: The Global Translation Movement will pay say a buck a page multipled by the confirmed age of the book in question. (similar pay scales would have to be worked out for those really old "books" that consist of wood tablets, bamboo or papyrus strips and so on.) The project would need to go out of its way to contact old monastaries, nunneries, temples, museums and so forth. A 200 page folio that is 250 years old nets 50,000$ for the monastary that scans it and shares the digital copy with the world. My inspiration for this came from the Islamic Translation Movement of medieval times.
You could do similar bounties for translations as well into four or five of the world's most widespread languages. (Chinese, English and Arabic come to mind.)
If I were some kind of intellectual or academic authority, this is something that I'd seriously pitch at the next Ted Talk...
I need a wheelchair van for my son. Help me get the word out. https://www.gofundme.com/wheelchair-van-for-jj
and there are even more in Lucien's library in the dreaming.
I was not proposing a new method of counting books... I was only supporting the OP in his assertion that their method contains limitations regarding repetition of works with minor differences.
I was mainly responding to those who just said RTFA without seeing basic facts in TFS.
So if I wrote a book about this, should I call it "The 129,864,880 Books That You Must Read Before You Die", or "The 129,864,881 Books That You Must Read Before You Die"?
I am curious about the characterization of ancient texts. Does the ISBN system take account of books written before the ISBN was created? After all, books have been around for a very long time. The printing press made books inexpensive and pervasive, but books existed long before.
Take a famous example, the Gutenberg Bible. Does it have an ISBN number? Now a much more difficult one: How about the Code of Hammurabi, which was "published" on clay tablets? How about the Dead Sea Scrolls, at least the intact ones? And what about some of the Mayan books, which are incredibly rare? How about some of the Egyptian texts, written on papyrus?
It would be interesting to know what qualifies as a "book".
Steven King has written at least that many.
Damn, I first read "Counting the World's Boobs"!
Wish I wasn't so proficient at speed reading ... and multilingual.
Not much left to do after reading them all except to play Grand Theft Auto for the rest of my life I guess.
They seem to define what a book is, but what is it really?
That is one of the things my dad ended up defining for his book cataloging (handling books, magazines, short stories, etc).
For him, he defined that you have a work, which is contained in a volume. (OK, looking at the articles link on what a work is, that seems to be similar to my dad's definition. For him it is the actual text, irregardless in what form).
A volume is a container, containing anywhere from 1-many works(think collections of short stories, one of the reasons he designed his database in that way). Sometimes the same work is in multiple volumes (Think: short story runs in a magazine, later placed into a collected works book. Same work, multiple volumes. Same idea with paperback versus hardback. Same thing, different forms (usually the same thing at least) ) Or, how about the same work from 2 different publishers? (Fellowship of the Ring, we have 2 copies. One from ACE, one from another publisher whose name escapes me at the moment. Assuming they are the same, should we count one or 2? Content being the same, does it matter that we have 2 physically different objects whose contents are the same?)
How do you deal when a publisher makes changes depending on which printing, but doesn't mark it as a new edition?(Maybe a spelling corrections? Would likely end up with the same ISBN)
An example in the article with Hamlet, with different forwards and commentaries. Are they truly unique, or could you break it as different containers, each containing Hamlet + something else? (OK, the containers will be unique, but some of the content inside will not be. Of course this assumes they don't modify the Hamlet(not sure what they might do))
So I guess what I am asking, is the content what we are counting(irregardless of physical container), or is it the physical objects that we are counting? (Not sure how to label the digital works in this case)
Log files are typically very structured low-entropy data. With random natural-language text you seldom get better than 3 or 4 to 1 lossless compression. Image compression can do better, but that's typically already been done to get the JPG/PNG/GIF/etc., and it's typically lossy, and of course video compression is much better because most of an image doesn't change much from frame to frame. But in this case they're trying to OCR the data, so much of that image compressibility has already been replaced (because you're using one byte to represent the letter instead of a bunch of bytes representing black marks on white paper.)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks