Slashdot Mirror


XML for Ancients

Andrew writes: "More than 5,000 years ago, the very first information revolution occurred when some unknown research team in Mesopotamia found a way to download and store language through a killer application called "writing.". The cuneiform digital library will have 60,000 texts ready in a couple of years. Using SVG and XML to represent their documents. Similar efforts are underway for hieroglyphics."

34 of 118 comments (clear)

  1. Slightly off topic..... by MisterPo · · Score: 3, Interesting

    I have been working in IT since 1997, yeah I know a mere blink of an eye for some Unix Wizards (ie. beards, strange clothing and their own arcane language). What I have noticed is that every year my handwriting has been getting progressively worse. What with my PDA, laptop, PCs etc. I just have no need to wield a pen no more :)

    Apart from signing my name on credit card chits, the only time I am required to write is for birthday/Christmas and other assorted cards. Its getting so bad now that I start to write a long word and just give up. My once pristine handwriting now looks like a doctors prescription scrawl.

    Any else get this too?

    Po

  2. Is access going to be free? by A+Commentor · · Score: 2, Insightful

    Site appears to be slash-dotted already...

    So.. Are these 5000 year old documents going to be freely available or will the database of texts be copyrighted/restricted?

    --

    Looking for any old 8-bit Heathkit/Zenith software/hardware - http://heathkit.garlanger.com

    1. Re:Is access going to be free? by IHateEverybody · · Score: 2


      Why do I always people saying things like: "Slashdotted already! What a pity... It should have been cached."

      But when I click on the link anyway, the site loads with on problem. This is the rule not the exception. The amount of times I can't get to a link from slashdot is surprisingly low.

      That's because those people are the ones who do the actual slashdotting. Usually by the time normal people like you and me click on the link, somebody at the other end has noticed that their site is down due to a DBS (Denial by Slashdot) attack and has set up a couple of mirrors that that future requests can be redirected to. After all, it's not somebody would lie about a thing like that.

      --
      Does this .sig make my butt look big?
  3. Will we have to revise unicode? by Ukab+the+Great · · Score: 2, Interesting

    With all these ancient language/hieroglyphic texts being archived, I have a feeling that we'll be hitting that 65536 character wall very shortly, since someone in the future might need that Cunieform version of M$ Word (hey, it could happen). Is it time for UTF-32?

    1. Re:Will we have to revise unicode? by dvdeug · · Score: 2

      We've always had UTF-32. Due to some hacks in UTF-16, Unicode can include up to a million characters, more than anyone anticipates needing. Cuniform has already been (very) tenatively allocated to U+12800-U12C80. Apparently, no one has come up with a complete proposal for including cuniform, though.

    2. Re:Will we have to revise unicode? by hwilker · · Score: 2, Informative
      See "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations" for more information on this. It argues that even Unicode 3.1 will not contain enough characters for just East Asian languages, never mind dead, Middle Asian ones.

      The main reason seems to be that in East Asia, there are reduced character sets in daily use which contain only a couple of hundred or thousand glyphs, but to read and study classical texts, the number required quickly goes up into the tens of thousands, for each of a number of languages. Not having these glyphs in the Unicode set would be like asking English-speakers to use alphabets reduced by five or six characters (M and N are similar, X, Q, C and Z could be replaced by one character as well) and dictionaries from which three out of four words have been deleted due to redundancy or age.

      The reason for this mis-design, the article argues, is political: the nationalities in question have never been asked how many characters they would need together -- for each single language, Chinese, Korean, or Japanese, a scholar would say "Sure! 50,000 characters is enough for us!"

      --
      -- H. Wilker
    3. Re:Will we have to revise unicode? by iabervon · · Score: 2

      There are Unicode character sets in the 32-bit range; the first 16 bits is only supposed to be used for current languages in active use. So cuniform, along with linear B, runic, and possibly Tolkien's runes (and, unofficially, klingon), will probably end up in the 0x1xxxx range.

      UTF-8 is actually perfectly sufficient for 32-bit characters. (And you meant UCS-32; UTF-n is an n-bit/character encoding of >n-bit characters, while UCS-n is the n-bit character set).

    4. Re:Will we have to revise unicode? by Apotsy · · Score: 2
      That article is complete crap. I can't believe anyone takes it seriously.

      The author of that article doesn't seem to understnad the fact that Unicode is a character set, not a font. He also doesn't seem to understand how Unicode's surrogate pairs work (which allow for encoding of more than 1 million characters). He doesn't seem to understand that Unicode is an evolving standard (i.e., 3.1 is hardly the final version). And he doesn't seem to understand that UTF-8, UTF-16, UTF-32, etc. are all just different formats, and they actually represent the exact same character set.

      But most importantly, he is flat-out wrong about how and why the decisions were made regarding encoding of East Asian languages. He needs to learn about the history of Han unification for CJK characters. If he did, he would know that linguists and computer scientists from East Asian countries have been involved in Unicode since the beginning. The unification of East Asian characters was done on purpose, and has the full support of linguists, scholars, and computer scientists from those countries.

      If the author of that article had just spent a few minutes reading the a copy of The Unicode Standard, he would not have made those mistakes. He didn't even have to read the whole thing! Just the Introduction and Appendix A would have set him straight on the issues I just mentioned. The fact that he didn't means this guy really shouldn't be doing work for a company with the word "Research" in the title.

      Oh, and even though that page says the article has not been modified since June 4, you can see from the google cache that they have since removed their promise of responding to criticism.

      And one more thing: Since he derides those mean old Westerners on the Unicode committee for being insensitive towards the peoples of East Asian countries, perhaps he should ask himself if it is considered impolite or insensitive to sweepingly refer to such peoples as "Oriental", which he does in the first few paragraphs.

  4. XML, Writing and Jabber by Jucius+Maximus · · Score: 2, Interesting
    "Using SVG and XML to represent their documents. Similar efforts are underway for hieroglyphics."

    They're using XML? They could integrate this with some sort of retrieval language and couple it with Jabber clients. That way you could send some sort of command-line search/retrieval command to the database using a regular Jabber client and have the XML data sent back, since Jabber natively supports the standard.

    1. Re:XML, Writing and Jabber by instinctdesign · · Score: 2, Informative
      Well, the best answer to this one was provided on the hieroglyphics page. I'm not sure if it was slashdotted after I got it (the UCLA one was down just after the first comment was posted) so I'll post the majority here.

      XML is a format which allow both to describe an encoding and to write encoded files. It was chosen for a number of reasons. First, it's easy to extend an XML format. Second, it's easy to parse an XML file, an there are a lot of tools for it: people will be able to manipulate XMLMCD files without being graduate in Computer Science. Third, XML is being used for a growing number of applications --- for instance web browsers. Fourth, there's a user community for XML in the philological world : two interesting examples are the Text Encoding Initiative and the recent conference on XML and Ancient Near East.
      --
      forma3
  5. ... by evel+aka+matt · · Score: 4, Funny

    How Snowcrash.

  6. It appears that... by Teancom · · Score: 5, Funny

    they are also writing their tcp packets on clay tablets, and attempting to send them down the wire. That was the quickest /.'ing I've *ever* seen.

  7. Cunieform writing by Alien54 · · Score: 5, Informative
    Slashed already

    [smile]

    Scientific American has this article on Information Technology, 2500 B.C. on what life was like for the information worker of that day.

    As many as half a million cuneiform tablets, hand size up to book-page size, are now available around the world. Surely many more are waiting to be found. Those samples are of every quality: once prized accounts and receipts, schoolboys' lessons, litigation profound or droll, literary essays, erotica, mathematics--and entire ancient epics, centuries older than Father Abraham's. A mostly unread treasury, comprising the equivalent of tens of thousands of large printed volumes.

    Looks like there could be a lot of fun and good stuff there.

    --
    "It is a greater offense to steal men's labor, than their clothes"
  8. First case of poor infrastructure planning... by gregwbrooks · · Score: 5, Funny
    "640 clay tablets is enough for anyone!"


    -- William "Scorpion King" Gates

    --


    "It was a summer's tale: Just a boy, his Linux, and a head full of dreams..."
  9. Wow by Anonymous Coward · · Score: 2, Funny
    "More than 5,000 years ago, the very first information revolution occurred when some unknown research team in Mesopotamia found a way to download and store language through a killer application called "writing.". The cuneiform digital library will have 60,000 texts ready in a couple of years. Using SVG and XML to represent their documents.


    Sooo... this project has been going on for about 5,000 years, they're finally going to be making a large release in a few years, and we're *JUST NOW* hearing about this?

    My *god*, talk about keeping the PR lid on tight!
  10. Actually... by recursiv · · Score: 4, Interesting

    Unicode is often referred to as a 16-bit system, which would allow for only 65,536 characters, but by reserving some code points for mapping into additional 16-bit planes, it has the potential to cope with over one million unique characters.

    The current version (3.1) of the Unicode Standard, developed by the Unicode Consortium, assigns a unique identifier to each of 94,140 characters

    --
    I used to bulls-eye womp-rats in my pants
  11. XML is a poor choice for cuneiform by Waffle+Iron · · Score: 5, Funny

    IIRC, cuneiform writing is composed entirely of angle brackets. To write this in XML, every character is going to have to be escaped!

  12. all bound for mu-mu land by rfsayre · · Score: 4, Insightful

    "justified.dtd" >

    The cuneiforms are justified and ancient.
    and well formed.

    XML is gonna rock you.

  13. XML Hieroglyphics by darkov · · Score: 2, Funny

    I believe the ancient Egyptians avoiding using XML at the time because of concerns over RAND licencing and prefered the patent-free ideograms.

    No, really.

  14. Should story links also have [url] notation? by KNicolson · · Score: 2, Funny

    I was worried I might end up here instead...

  15. XML Overrated? by ffatTony · · Score: 2

    Correct me if I'm wrong, but what is XML doing that some homegrown solution couldn't? Obviously clients would have to know the protocol, but with XML that is also the case.

    I use XML all the time, maily because of XSLT, but I think its less functional and more hype. Feel free to enlighten me.

    1. Re:XML Overrated? by ukryule · · Score: 2, Interesting
      When you're coding up ancient writing, you want to store much more information about each character or word than with normal text (colour, angle, depth etc.). XML is quite good at storing these attributes, so it makes sense to use it.

      Taking a quote from the heiroglyphics link (can't comment on the cuneiform link as it's /.ed):

      Let's illustrate these points. In the current MCD, data about an individual sign is scattered around it. Look for example at :

      =A1\\r1 -i

      It means "Sign Gardiner A1", as both grammatical and word ending, reversed, rotated. fine positional data, colour data, and more are hard to add. On the other hand, the current proposal would represent the same sequence as

      <hieroglyph code="A1" gramend="y" wordend="y" rot="90" reversed="y">
      <hieroglyph code="i">

      Of course, as with any use of XML, you could do it with a 'homegrown' solution, but the point is that using XML gives you a well known (and well supported) framework which everyone can standardise on. (And yes I know the XML in the example is malformed ...)
  16. Protocol implementation by xant · · Score: 2, Informative

    Clients would have to know and implement the protocol. But since XML always looks the same, implementing the protocol is just a matter of linking the standard XML library in the language of your choice and using the DTD to decide what you want your client to understand.

    There's other advantages, but that's a big one.

    --
    It's rare that you're presented with a knob whose only two positions are Make History and Flee Your Glorious Destiny.
  17. Missing marks by os2fan · · Score: 2
    Unfortunately, the documents must be transscribed, which means that we may well miss out on the doodles and other things that gets written with writing.

    Consider, for example, the carry dots that some people use to add up numbers. Dots and things like that in the text may well uncover the way that calculations were done.

    --
    OS/2 - because choice is a terrible thing to waste.
    1. Re:Missing marks by rodentia · · Score: 2

      Considering that these materials were typically baked or kiln-fired to ensure permanency, it is unlikely that there is much in the way of doodles and annotation. Such ephemera were lost with the next rains.

      Interestingly, the developers of cuneiform also developed the first envelopes. The main message was kiln-fired and then wrapped with a new layer of clay, the address incised and the result merely air-dried. The recipient then gave the lot a crack against a nearby stone and brushed away the *envelope* to read his mail.

      --
      illegitimii non ingravare
  18. Who supports SVG? by roystgnr · · Score: 2

    I haven't looked in almost a year now, but the last time I did, there was an alpha (rendered lots of graphics correctly, lots incorrectly) patch for Mozilla and no SVG support for IE or any other browser. Did everybody catch up while I wasn't looking?

  19. XML for Ancients? by brad-d · · Score: 5, Funny

    All I can think of now is the new book series:

    "XML for Mummies"

    At least in this case when you see the reviews "this book will put you to sleep" it really doesn't matter.

    --
    -Brad
  20. ICE by zephc · · Score: 2, Troll

    the xml.org link for cuneiform encoding initiative is at http://www.jhu.edu/ice/

    There is an initiative for almost every ancient language that is know (and decipherable). I'm sure digging thru xml.org will turn up a bounty of results =]

    --
    "I would say that 99 per cent of what my father has written about his own life is false." - L. Ron Hubbard Jr.
  21. Copyright is 70 years on books 8) by da5idnetlimit.com · · Score: 2, Informative

    So a 5000 years original text should be no problem.

    The case will happen if you ask for the translation (What, you are not Cuneiform litterate ? Talk about education 8)

    --
    It takes 40+ muscles to frown, but only four to extend your arm and bitchslap the motherfucker
  22. Sonny Bono Act by yerricde · · Score: 2

    Copyright is 70 years on books

    No, 95 years on all works first published on or after January 31, 1923. See also Sonny Bono Copyright Term Extension Act. And it'll get even longer before 2020 as Di$ney frantically bribes Congre$$ to pass yet another corporate-welfare copyright extension.

    The case will happen if you ask for the translation

    ...even into XML.

    --
    Will I retire or break 10K?
  23. English works just fine with only 18 letters by yerricde · · Score: 2

    Not having these glyphs in the Unicode set would be like asking English-speakers to use alphabets reduced by five or six characters (M and N are similar, X, Q, C and Z could be replaced by one character as well)

    Spelling reform. China (outside Taiwan) has had it. It's perfectly possible to write English with only 18 letters.

    and dictionaries from which three out of four words have been deleted due to redundancy or age

    So? Desk dictionaries aren't nearly as comprehensive as Oxford English Dictionary or even the unabridged Webster's Third New International Dictionary.

    --
    Will I retire or break 10K?
  24. How Modern by rakerman · · Score: 2

    Then I can write a washing bill in Babylonic cuneiform

    1. Re:How Modern by bartle · · Score: 2

      Then I can write a washing bill in Babylonic cuneiform

      But it still won't help you learn about Caractacus's uniform. You've got to keep these things in perspective.

  25. Obvious answer by TheInternet · · Score: 2

    What are the earliest statements they've found? What do they say?

    "First post"

    - Scott

    --
    Scott Stevenson
    Tree House Ideas