Slashdot Mirror


Independent Data and Formatting with Microformats

IdaAshley writes to tell us IBM DeveloperWorks is running an article about how to best utilize microformats to embed data within standard XHTML code. From the article: "Microformats are a pragmatic approach to solving the issue of structured data on the Web. Is it as architecturally pure as XML-encoded data separated from its formatting through a mechanism such as XSLT style sheets? No. But I think this approach is a realistic middle step that will help build a more intelligent Web that is easier to use and provides better search and data integration."

15 of 99 comments (clear)

  1. Geez, man... by Chysn · · Score: 3, Insightful

    Some of us have been doing this for YEARS. At least now we have a buzzword for it.

    --
    --I'm so big, my sig has its own sig.
    -- See?
    1. Re:Geez, man... by frisket · · Score: 3, Insightful
      Some of us have been doing this for YEARS. At least now we have a buzzword for it.

      There is already a buzzword: tag abuse. It's the last resort of the untalented.

      This particular version is known as semantic imputation (giving things meanings they don't inherently have). It's neither new, special, exciting, nor useful, but at least we now know how little the people at IBM and Leverage Software know about markup and XML.

      I guess I'd better add a warning to the XML FAQ about it...

  2. LISP by Anonymous Coward · · Score: 5, Insightful

    I'm sure the LISP community would love to hear about this brand-new idea of embedding specialy, or domain-specific if you will, languages and data. How extraordinarilly novel.

    You'll be running a limited LISP implementation on every browser in no time!

  3. Standardization is the problem by Anonymous Coward · · Score: 5, Insightful

    This suffers from the same thing XML did. Remember when XML was going to revolutionize communication between computers by structuring everything consistently? Then tripped over which was crawling on the floor after being decked by who was rather pissed off after an argument with Henry&lt/name> and the whole thing went down in a pile of flames and is now relegated to being a 2MB configuration parsing library to embrace and extend "option=value".

    So now why is this "vevent" class special, and who decided it would be "vevent" and not "scheduledevent" or "calendarevent" or "microsoftcalendarhassomethingforyoutodotoday"? Clearly as a human I can look at "dtstart" and think about it and realize that this means the starting date, but how does a computer know this? If the "semantic web" is going to take off, then we need semantics, and pronto.

    Hopefully any standardization doesn't turn into a nightmare though. I used to develop in the healthcare insurance claims field, and the old NSF format for transmitting an insurance claim electronically was a horrible death-by-committee piece of work. It was as if nobody could come to a consensus and the committee decided to just throw everything in. You might look at your insurance card and think "gee I have an insurance ID number" but no, in the NSF, there were about 10 different blanks for insurance IDs, depending. Is it a Medicare number? Then it goes in the Medicare blank. God forbid the computer would have just one blank and assume that if you're billing Medicare then the number in the blank is probably a Medicare ID. Medicare was easy, there's just one. Medicaid in most states have a billion subcontractors, all with names that have nothing to do with "medicaid" so you simply had to maintain a magic list of insurance plans that changed every other year or so that used the Medicaid ID field. Or the separate fields for Blue Cross and Blue Shield. What about the states where you have BCBS as a single entity?

    Anyway, I'm digressing (and ranting about a chunk of my ilfe I'd much rather forget). What's important in standardizing in semantics is identifying everywhere where things are identical and reusing semantics whenever possible. Decisions have to be made up front as to what is the relationship between "name" and "last name" (people have a name, which has a last name, yet companies have names that typically don't have a last name. What about a cat named "John K. Wibblesworth" how is that different from one named "Tama"?) Yet, take dtstart which is used here for a calendar event. Should we have "dtclassstart" for the first day of school?

    1. Re:Standardization is the problem by Bogtha · · Score: 4, Insightful

      Remember when XML was going to revolutionize communication between computers by structuring everything consistently?

      No. I do remember how a lot of clueless PHB-types ran around telling everybody that though. XML solves the parsing problem, not the semantics problem. It's languages built on top of XML that handle semantics.

      XML was never meant to solve the problem you are talking about. Parsing markup into a tree is a totally different concept to figuring out what the stuff in the tree means. The only people who ever thought XML had something to do with what you say were totally clueless about XML.

      So now why is this "vevent" class special, and who decided it would be "vevent" and not "scheduledevent" or "calendarevent" or "microsoftcalendarhassomethingforyoutodotoday"?

      It's special because it appears in the hCalendar specification. The people who wrote the specification decided it would be "vevent". They intend to submit it to a standards body.

      --
      Bogtha Bogtha Bogtha
    2. Re:Standardization is the problem by stonecypher · · Score: 3, Insightful

      This suffers from the same thing XML did. Remember when XML was going to revolutionize communication between computers by structuring everything consistently?

      Yeah. It works when you use the same DTD, which was the promise. It's not XML's fault that you and your supplier can't get your ducks in a row. The purpose of XML is to provide a medium that two ends can use to standardize a communications format of their own design, while giving a regular form to said formats so that arbitrary formats could be supported by arbitrary tools. It fulfills this ideal quite well, as anyone even vaguely familiar with web standards knows. It is not meant to magically merge two inconsistent standards.

      Then <lname> tripped over <lastname> which was crawling on the floor after being decked by <name last="Henry"/> who was rather pissed off after an argument with <name><last>Henry</last>&lt/name>

      Yeah. And that's XML's fault how? Get a DTD and stick to it.

      and the whole thing went down in a pile of flames

      Yeah, essentially every office suite, database, most graphics editors, many layout programs, and quite a few games support XML. Jabber / Google Chat run on XML. The web is built on an SGML dialect, which is largely being converted into an XML dialect; XML is itself an SGML dialect. Web 2.0 (god I hate that name) is an outcropping of XML's parsability. XML is so useful that Microsoft was able to use it to ward Massachusettes' lawsuits off. The United Nations now releases their transcripts solely in XML. XML is now the second most pervasive data storage format on earth, after CSV/TSV, and it's gaining fast. (Don't bother saying SQL - it's an API, not a storage format.)

      Exactly what is your definition of "going down in flames" ?

      and the whole thing went down in a pile of flames and is now relegated to being a 2MB configuration parsing library to embrace and extend "option=value".

      Uh, TinyXML has a footprint of 40k, champ. Also, that's not what "embrace and extend" means.

      So now why is this "vevent" class special, and who decided it would be "vevent" and not "scheduledevent" or "calendarevent" or "microsoftcalendarhassomethingforyoutodotoday"?

      What a surprise, the guy who couldn't standardize on a DTD now fails to understand other format standardizations. Read the article, champ. It's not SlashDot's job to read for you, and this one's honestly pretty simple. Indeed, the specific purpose of microformats is to address your whining, but you don't see the point. Cough.

      Clearly as a human I can look at "dtstart" and think about it and realize that this means the starting date, but how does a computer know this?

      Er, by supporting a specific microformat. Are you putting in effort to be dense? It's the same way they support iCal, or MS Word files, or in fact any format at all, ever.

      If the "semantic web" is going to take off, then we need semantics, and pronto.

      This has nothing to do with the semantic web. You want to drop another? Ontological Web Language sounds important too. Use that one more often: fewer people will see through you.

      God forbid the computer would have just one blank and assume that if you're billing Medicare then the number in the blank is probably a Medicare ID.

      Yes, I'm sure the people billing Medicare who aren't using Medicare IDs will be greatly amused that your application just fails for them. Why is it that I don't believe you had much to do with the design of the system?

      What's important in standardizing in semantics is identifying everywhere where things are identical and reusing semantics whenever possible.

      "Semantics" aren't reusable. They're not arbitrarily applied. Please stop using words you fail to understand. Not every markup of data is semantic, even if the markup means something. Semantics are the work of understanding context, not identifying relations

      --
      StoneCypher is Full of BS
    3. Re:Standardization is the problem by grcumb · · Score: 2, Insightful
      " Then <lname> tripped over <lastname> which was crawling on the floor after being decked by <name last="Henry"/> who was rather pissed off after an argument with <name><last>Henry</last></name> "
      "Yeah. And that's XML's fault how? Get a DTD and stick to it."

      Well, actually, schema and RDF were supposed to address exactly that issue. So, in the opinion of the W3C, at least, it appears 'Get a DTD and stick to it' isn't the complete answer.

      But that's a simplistic retort. The truth is that there are many cases (especially when individual business-to-business transactions are concerned) where 'Get a DTD and stick to it' is probably the right answer. It's simpler, if nothing else.

      That's not the end of the conversation, though. There are a number of cases where future communications and permutations simply can't be known, and in situations like that, the option of sticking to a single DTD simply doesn't exist. In theory at least, schema and RDF supply the means to handle semantic translation of data.

      '"Semantics" aren't reusable. They're not arbitrarily applied. Please stop using words you fail to understand. Not every markup of data is semantic, even if the markup means something. Semantics are the work of understanding context, not identifying relationships. Telling the difference between two kinds of ID code isn't semantic. Telling the difference between bug (insect) and bug (Volkswagon,) however, is.'

      That may be true, but I remember very clearly listening to Tim Berners Lee introducing the Semantic Web in Toronto back in '99, and the example he used of how the Semantic Web would work showed A being determined to be semantically the same as C because A and B were known to be equivalent, and B and C were known to be equivalent as well. So while it's technically correct to say that semantics has nothing to do with translation, the promise of the Semantic Web is that one is able to translate between ad hoc data types precisely because their semantics can be inferred.

      I won't comment on the effectiveness of schema and RDF in practice. Suffice it to say that no one's found many compelling (or at least popular) uses for either so far. That said, we still don't take advantage of much of HTML and CSS, so the problem may be PEBCAK (or just impatience) rather than poor design.

      --
      Crumb's Corollary: Never bring a knife to a bun fight.
    4. Re:Standardization is the problem by Anonymous Coward · · Score: 2, Insightful
      XML solves the parsing problem, not the semantics problem.

      What parsing problem? Parsing is one of the most well-understood areas of computer science. Any comp. sci. graduate should be able to knock up a simple recursive descent parser, and there are dozens of good parser generators out there. It is the lack of semantics that makes XML little better than plain text — all the hard problems are left to applications.

  4. Re:Tagging in Text by cdcarter · · Score: 2, Insightful

    The difference between this and text tagging is that this has a set structure.

    --
    "Love is like a trampoline, first it's like "SWEET!!" then it's like *BLAMM!*"
  5. I don't get it... by grumbel · · Score: 4, Insightful

    Ok, so this "microformats" thing is about encoding extra data inside an HTML file by abusing CSS class names for markup, isn't that completly unnecessary and nothing more than an ugly hack? Don't we have XML namespaces for exactly that reason? Wouldn't something like:

    <span style="display: none">
       <vevent:event>
         <vevent:dtstart>20060501</vevent:dstart>
         <vevent:dtend>20060502<vevent:dtend>
         <vevent:summary">My Conference opening</vevent:summary>
         <vevent:location>Hollywood, CA</vevent:location>
       </vevent:event>
    </span>

    We the 'right'[tm] way to day it?

    1. Re:I don't get it... by jandrieu · · Score: 2, Insightful
      Your technique hides the semantic data from normal view and forces the author to replicate what they don't want hidden.

      With microformats, the data is presented once, with a few simple tags, and is then available to both HTML viewers/users and semantic parsers.

    2. Re:I don't get it... by jandrieu · · Score: 2, Insightful

      *Any* design activity is more complicated than copying a proven, open source design. And if you want that design to be understood by someone else, you still need to (correctly) use a common vocabulary.

      It is easier to use what you know (HTML+CSS) and rely on the technology you understand (IE/Firefox/etc). That's it. Some people like to play in new techno sandboxes. Others just need to publish their kid's soccer schedule on their webpage and aren't about to read the help files at their ISP--or sourceforge or the W3C or where ever--about how they install, configure, and use that XML/XSLT stuff. And given how vendors like to extend the functionality of "standards-based" technology, I expect it will take about as long for XML/XSLT to settle as it did for HTML. And if you've ever worked with HTML developers learning XML, you'll see how frustrating it is to transition from the extremely forgiving realm of HTML to the rigor of XML.

      Easier is better for many.

      The point is not for browsers to ignore anything. Browsers (or extensions) will/are build/ing in tools to respond intelligently to embedded microformats. Microformats make it easy to transform content that would otherwise be thrown up in basic HTML+CSS, so that it is semantically accessible for those systems that are looking for it.

      Its a pretty straightforward premise that the easier a technology is, the more people will use it, assuming there is value for doing so. If you still want to develop your own XML and write XSLT to generate HTML, go for it. If you think more people would rather learn XML/XSLT than use the HTML/CSS they already know plus a few microformats, then there isn't much more I can say.

      -j

  6. History, failures, doomed to repeat by ekhben · · Score: 5, Insightful

    This is a kind of neat idea, except, of course, if I have CSS that does something with, oh, say, a class of "dtstart". Sure, it's easy to recognise that ".vevent > .url > .dtstart" is a microformat data item for an hCalendar, but if I'm already using "dtstart" or "url" regularly in my markup so I can apply styles to those kinds of things, I'm pretty much SOL. Rewrite all your markup and CSS to stop using those names.

    There's no namespacing. There's not even an ATTEMPT at namespacing. This will fast become an unmanageable hodge-podge of insanity, with common words used willy-nilly in class attributes.

    The class attribute is defined as CDATA. That's it. You can use pretty much ANY character in it. There's a lot of characters that can't be used in a CSS selector, though, such as ":". See where I'm going with this? &lt;div class="mf:vevent"&gt; for a start. Better yet, &lt;div class="hidden mf:vevent"&gt; such that you can hide (or format) the block of data separately.

    Now, as if that wasn't bad enough, and, trust me, it IS bad enough, there's also the misuse of the "title" attribute and the "abbr" element. A machine formatted date is not the expanded version of a human formatted date, which is not an abbreviation. A renderer trying to make sense of &lt;abbr class="dtstart" title="10034134134T00"&gt;17th Smarch&lt;/abbr&gt; will think "AHA! This here is an abbreviation, I will provide unto the user some means to see what that '17th Smarch' abbrevation stands for!" Usability disasters follow.

    So, in summary, this is the worst idea I've seen in HTML space since some bright spark said, "let's suggest that people use the 'text/html' content type for their XHTML markup!"

  7. Re:Tagging in Text by stonecypher · · Score: 2, Insightful

    I do like the idea of being able to move XML around without having to parse to view the basic file in a formatted fashion. So, you're mixing HTML with a tag. Again, SO WHAT? But what about the encapsulated text, what's the point?

    To make things application parsable. Try reading the article before complaining that you don't see the point.

    If you're going to use a viewer eventually

    If you'd bother to read the article, which is about comparing one application parsable format (iCal) to the new microformat, you'd understand that the web is moving towards human-readable things being software-readable too.

    (because you have the encapsulated text)

    That's like referring to a car as a pile of steel and glass: it completely ignores the purpose of something in favor of describing its construction. You might as well refer to a database as a large string of bytes, then complain that it's not solely focussed on human readability either.

    use a viewer

    Most of us would like to be able to use more than a web browser, by now. Try stepping out of the early 90s. The air's better up here.

    This would only help in reading the actual data

    Or machine parsing.

    but not in bug fixing

    Well, that isn't the point at all, so oh well. 'Course, since it's machine parsable, it actually would be quite a bit easier to find markup errors (which aren't the same as bugs.) So even though that's not the point, you're still wrong.

    because the XML is that much more unreadable.

    Er, XHTML is an XML dialect. The difference between XML and XHTML/HTML, unless you're dealing with XSL or XPath, is negligable. Thanks for pretending to know things you don't, though; it always makes for entertaining reading.

    Moderators: informative means "gives us new information we didn't previously have." The moderation you were looking for was insightful, except of course that parent isn't that either.

    --
    StoneCypher is Full of BS
  8. HoTMetaL by Doc+Ruby · · Score: 2, Insightful

    And I think that muddling data and presentation without explicit distinction is exactly what was wrong with HTML. Which we just spent a decade slightly recovering from. I guess IBM has made a lot of money on crappy tools, good tools to extract data from crappy data, and extra money for doing it right.

    --

    --
    make install -not war