War and Nookd — eBook Regex Gone Haywire
PerlJedi tips a story that highlights one of the downsides to ebooks. A blogger who recently read Tolstoy's War and Peace on his Nook stumbled upon some odd phases, such as: "It was as if a light had been Nookd in a carved and painted lantern..." After seeing the word 'Nookd' a few more times, he found a dead-tree version of the book and discovered that the word was supposed to be 'kindled.' Every instance of the word 'kindle' in the ebook had been replaced with 'Nook.'
"The Superior Formatting Publishing version isn’t a Barnes and Noble book, so this isn’t the work of a rogue Nook marketer from B&N. Rather, it’s likely that Superior Formatting Publishing ported its Kindle version of War and Peace over to the Nook — doing a search and replace to make sure that any Kindle references they’d inserted, such as in the advertising at the end of the book about their fine Kindle products, were simply changed to Nook. The unwitting hilarity of a publisher doing a 'find and replace' and accidentally changing the text of a canonical work of Western thought is alarming. Many versions of e-books are from similar outfits, that distribute public domain works formatted for Kindle or Nook at the lowest possible prices. The great democratizing factor of the ebook formats – that anyone can easily distribute – can also mean that readers can never be quite sure that they are viewing the texts as the author intended."
But I went back and searched every kindle and cranny to set every instance of the word back to kindle to fix it.
I'm only human.
My work here is dung.
Such an amazing set of tools such as diff and grep would probably amaze them.
"Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"
Er, NUKE it.
"I accidentally Western Literature, is that bad?"
It's not just intentional malice you need to look out for but also just pure distilled stupidity.
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
s/Raskolnikov/Obama/g
'eBook Regex Gone Haywire'
This is a straight-forward substring replace, not a regular expression. A not-completely-stupid regex would at least have only converted \bKindle\b, although obviously even then human oversight would be necessary.
Spelling mistakes, grammatical errors, and stupid comments are intentional.
You could say it's downright medireview.
sic transit gloria mundi
So, this story is definitely an amusing anecdote, but I feel like TFA has the wrong takeaway. The fact is, while this specific issue is obviously e-book related, the overall problem of poor quality, low cost public domain publications is in no way specific to e-books. There have always been low budget publishing houses that print poorly edited, poorly translated versions of public domain works. Spend some time digging around used book sales, you'll find an endless supply of these, most notably from the 60's and 70's.
sed -i s/wand/wang/g Harry\ Potter*
Don't blame me, I voted for Kodos
It's a dangerous world of low cost ebooks out here. Try this. At least, typos are not intentional.
http://archive.org/details/warandpeace030164mbp
There is also that brand new site called Project Gutenberg, look for it.
Unless it is in Russian. Any translation runs the risk of not being "as the author intended".
Starships were meant to fly, Hands up and touch the sky - Nicky Minaj
There is no way to hide an eBook. If you cannot HIDE it, you cannot OWN it.
That which cannot be hidden will eventually be stolen.
They really shouldn't mess with the clbuttics.
:wq
Part of the problem is the grotesque need to put advertisement inside everything we do, because sweet Jebus help me if we can't find some way to squeeze another penny of profit off a dead author's moldering corpse. Sadly, this problem isn't going away any time soon. How about this, separate the "Work of Art" from the annoying bits. Literally have them be distinct and separate objects. Leave the art alone. Do not touch it. Keep your grubby mitts off my masterpiece you heathen. Dork with your part as much as you like... it is after all your part. This is about sloppy data management and publishers need to begin to understand the nature of data. That is, if they intend to sell books in an electronic format. All you publishers, please have a brief but productive conversation with a few software and IT folk about how you manage data integrity, and ensure your product doesn't A) Get stepped on by stupid stuff B) Get corrupted by lack of proper data safeguards.
The rest as they say, is business as usual... please proceed, nothing to see here.
Just more of the same clbuttic errors.
(Hint: "ass" was one of the 13 words.)
" can also mean that readers can never be quite sure that they are viewing the texts as the author intended."
As an owner of a publishing company I can assure you the authors intentions are almost never the highest priority. Having read thousands of unedited manuscripts, many by very well known modern authors, I can say with confidence that you don't want to know what the authors originally pooped out.
Probably one of the best arguments I've ever heard for reading books on tablets. The book usually weighs somewhere from four to six pounds.
I once saw the same issue when a db dump was edited. A user 'bend' was replaced with 'ainsleyj' globally - hilarity ensued.
But soft, what light through yonder Linux breaks?
It is the east, and Juliet is the Oracle(TM).
Arise, fair Oracle(TM), and kill the envious moon,
Who is already sick and pale with grief
That thou, her maid, art far more fair than she
Yes, and you can Search & Replace "dead tree" with "paper" to make sure that readers view text as originally intended.
Has anybody ever been introduced to the wonderful world of the truly dreadful unauthorized variants of canonical texts that were being hacked out while the ink on those texts was barely dry?
.99 public domain cash-ins are largely shlock(Project Gutenburg isn't world-class critical editions; but they do at least tend to be produced by people who give a damn and aren't just grubbing for cash by releasing quick and dirty repackages); but the quality of the low end of the market for printed works has always been pretty dire. At least, these days, we don't generally see physical problems like crap ink, blunt, used type, or horrid paper stock also being inflicted on the readers in the cheap seats.
Actors and/or audience members cobbling their (often surprisingly good; but not good enough) memory of a new work of Shakespear into a cut-price unauthorized edition, some really trippy stuff in those version... Hack printers buying first editions and setting blunt type as fast and furious as they could, to get their knockoff on the street before the other guy did... Never mind the various editorial mistakes in subsequent prints, bowdlerizations, etc.
Of course, works that started as oral traditions or assembled-by-committee mashes of existing texts are far worse than even the worst horrors of post-gutenburg hackery. Oh, and let's not even talk about the dark history of situations where translation has been needed...
There's a whole industry, in academia, of 'critical editions' that are distinguished in no small part by the editor actually giving a damn about the sources drawn from, attempting to provide the most accurate reproduction of the original, essays and footnotes illuminating the process of choosing between manuscript A and manuscript B, and how to transliterate manuscript C's character names, and whatnot.
Sure,
Buttbuttinate. That is all.
Liberty in your lifetime
We have reentered the realm of scribes. Time to apply textual criticism.
The ridiculous fees you pay to get an ISBN for each type of distribution (one # for each hardcover, paperback, epub, .pdf, .html, etc), or new addition of the work should also include registry of a verifier code generated by Secure Hash Algorithm. A SHA verifier would be simple to validate when the work is in an electronic form. $150 and up per ISBN? DAMN, they should do SOMETHING for you other than enter a row in a DB! Unique descriptive domain names don't even cost that much. So what's the point? A: Distributors won't sell it unless you've paid the ISBN tax.
Furthermore, I wonder if the ISBN #s match between the Kindle and Nook versions? If they do match, then it's actually FRAUD. They essentially created a new "Nook" edition...
Every novel should have an MD5 hash....
someone did a replace all because they were lazy, on a FREE PUBLIC DOMAIN BOOK
its a conspiracy, B&N is not allowing any use of the word kindle in any book they sell. in fact if you go to the store you will find EVERY PHYSICAL BOOK they sell will have the word kindle crossed out and Nook written in
How do we know what the author's intentions are, especially for works whose author has been dead for at least 70 years?
This isn't a unique trait of electronic publishing -- exactly the same sort of thing happens all the time in paper-based books. That's why the second edition doesn't match the first; they corrected typesetting errors, the author fixed mis-edited sections, etc.
If you wanted to argue that many of the ported-to-electronic-version-separately-from-original-publication book release are low-quality I'd agree 100%. Just like many of the ported-to-DVD-from-VHS-release movie releases are low quality. But that's a matter of the amount of effort they put into the conversion process, not a fundamental limitation of the format.
What that means is that your database wasn't normalized properly. In a normalized relational database, the username is stored in only one place, and all other references are through the primary key, which is typically a 32-bit integer userid.
The iwizard of a previous publisher was dawizardd by a similar problem.
Cheap, crummy ebook conversions with no editorial checking. This has been going on for years, and it will continue to be a problem for the foreseeable future.
A physical book is costly to produce. It's costly to stock and ship them as well. Given those costs, the additional cost of doing a little editing is insignificant. Ebooks, on the other hand, open up new depths of low cost publishing. It's one of those perverse, ironic results. You'd think that cutting down the reproduction and stocking costs of a book would free up money for other tasks, but in fact what happens is that editing, design and promotion become an opportunity for cutting what is now a more significant proportion of expenses.
As ebooks become the dominant form of book reading, the opportunity arises for marginal publishers to publish books with expenses cut to the bone. Eventually the role of publishers as mediators between the author and public to disappear, and authors will hire editors, story development consultants and designers themselves. Or perhaps literary agents will take the place of traditional publishers, becoming full service business management services for authors. In any case, expect that a greater proportion of "published" books to be poorly designed and edited.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
There is a Wikipedia article about this issue:
http://en.wikipedia.org/wiki/Scunthorpe_problem
"The problem was named after an incident in 1996 in which AOL's dirty-word filter prevented residents of the town of Scunthorpe, North Lincolnshire, England from creating accounts with AOL, because the town's name contains the substring cunt.[1] Years later, Google's filters apparently made the same mistake, preventing residents from searching for local businesses that included Scunthorpe in their names.[2]"
There is also a stub article about a specific instance of the replacement effect: http://en.wikipedia.org/wiki/Medireview
"Dead tree version"? Really? Is that kind of asshole-ish snark really justified? If you want to read an Amazon-brand Shakespeare-flavored Licensed Advertisement-Delivery System (tm), go right ahead, but there's no reason to poke fun at actual books, which are significantly less likely to have these kinds of glaring mistakes in them.
I don't respond to AC's.
From (my contribution on) the talk page of the article on Romance Languages:
Can anything be done about the automated censorship of the Dante quotation in footnote 12, which now ends: "nam domus nova et dominus meus lo**censored**ur"? The censored part is a "c" followed by a "u" followed by an "n" followed by a "t"; the original can be found, for example, here: http://www.greatdante.net/texts/vulgari/vulgari.html (chapter XI, paragraph 7).
Apparantly, their Automated Puritan can pull lady parts out of the middle of a Latin word.
"Superior Formatting Publishing"'s web site is broken. It consists mostly of "Whoops, looks like there was a problem get the book data from Amazon. Please try again in a moment" and "Amazon API error". Plus a Kindle ad. And "All of our e-books are formatted specifically for the Kindle by an expert in formatting online content using only raw code."
If thery're doing this to copyrighted works, aren't they violating the copyright by making an unauthorized modifications and then distributing it?
Publishing houses are unfathomably bad at editorial workflow. Consider all the official, licensed ebooks with OCR problems. The publishers didn't have a soft copy of their own books. Staggering.
Now consider that managing the editorial workflow is their only value add, and ask yourself if there's a way to short stock on the publishing industry. Direct to consumer can't come soon enough.
The first fail here is paying for an epub copy of War and Peace. Only a shoddy questionable "publishing" company would make you pay for a digital copy of a book that is in the public domain.
It took me 2 seconds to find the book on gutenberg.
http://www.gutenberg.org/ebooks/2600
I remember seeing this same sort of thing in hardcopy in the AD&D "Encyclopedia Magica". There were dozens of places where it described characters taking e.g. 2d6+1 points of dawizard.
You do realize that you can actually post the word "nigga" on slashdot, right?
apparently AKabral is one of many avatars of Ironyman.
oh, and the word being referred to is nigger
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
My favorite still has to be the newspaper story about the Enola Homosexual that dropped an atom bomb on Tokyo.
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
problem solved, you're welcome.
What's disgusting and sad is how much a little human proofreading would have prevented, or at least lessened, this sort of thing. I get disgusted when I see these sort of errors in modern day print, where all it would have taken is a few minutes that these companies don't want to pay for. When are these companies going to realize that when you take the human factor out of the equation, what you are left with is garbage?
Even if you don't ever plan to change a username, a username makes a poor primary key just for performance reasons. In MySQL, for example, primary keys should be kept short because every index will have a copy of every primary key. If your primary key is userid, only the table itself and the index on username will have usernames in it. But if your primary key is username, every index will have usernames in it, as will other tables. Given the long usernames that are possible in popular web applications like MediaWiki (200+ characters = 600+ bytes of UTF-8, compared to 4 bytes for a userid), I can't see any reason to make the username a primary key.
We had a similar experience with a 99-cent Jack London e-book from B&N, White Fang, in which a character's name was replaced throughout with "Barnes and Noblea" - made for interesting bed-time reading for our kids.
You are cordially invited to dine at my estate to discuss this matter. Please dress appropriately, it will be an African-American tie dinner.
If someone browses at -1, they will notice that you definitely can post 'nigger' on Slashdot.
I listen to both RIAA and non-RIAA stuff if I like the music, tangential business/politics nonwithstanding.
Never heard of it
“He’s not deformed, he’s just drunk!”
You'd think that cutting down the reproduction and stocking costs of a book would free up money for other tasks, but in fact what happens is that editing, design and promotion become an opportunity for cutting what is now a more significant proportion of expenses.
Right. That's what happened to newspapers. Newspaper production used to require a huge labor force. Look at all those people. 67 linotypes! A room full of proofreaders to catch typesetting errors. Hundreds of people moving paper around, making printing plates, loading them onto presses, running the presses, handling the printed newspapers. Compared to the army needed to print the papers, the reporting staff was tiny, a small expense. The reporting and editing staff, the composing room, and the printing plant were all in the same building. Any separation would slow things down, and the competition would "scoop" them.
Now compare a modern large newspaper plant. There are people around, but not many. There's essentially no direct labor. All paper and plate handling is mechanized. The files to be printed are created elsewhere and come in over a data connection. The printed newspapers leave in big trucks. Many different papers are printed in the same plant. The plant is far from the reporting and editorial staff, and is run by a separate corporation from the "newspaper".
So, to newspaper management, reporters are now the big labor cost, the first thing to cut.
Obviously this is a serious issue with classic works or anything that was first printed. But it is becoming more popular for authors to bypass the printing and publishing stages and simply release their works as eBooks without the use of a publisher or distributor. These are the books I would prefer to buy and put on my Nook.
Personally I believe that a classic should be read in a classical way (on paper).
A simple lesson from Software Development which I believe would make the world a better place if applied to other fields:
Always review your diffs before pushing upstream!