Independent Data and Formatting with Microformats
IdaAshley writes to tell us IBM DeveloperWorks is running an article about how to best utilize microformats to embed data within standard XHTML code. From the article: "Microformats are a pragmatic approach to solving the issue of structured data on the Web. Is it as architecturally pure as XML-encoded data separated from its formatting through a mechanism such as XSLT style sheets? No. But I think this approach is a realistic middle step that will help build a more intelligent Web that is easier to use and provides better search and data integration."
Some of us have been doing this for YEARS. At least now we have a buzzword for it.
--I'm so big, my sig has its own sig.
-- See?
How much of this could have been done 5 years ago if the structured-HTML community hadn't blindly rejected META headers?
I didn't know IBM used Firefox, I'd have figured that they had their own, "in-house" broswer. Neato
"linux is just DOS with a UNIX like syntax" -- Galactic Dominator (944134)
This is just tagging in text; it's exactly what you do for CSS: You're saying this text is of a certain class. And you contain it in a box. All this is doing is using the same stuff and storing a little variable name and using it later. One might argue you are already doing that with CSS, it's just formatting stuff you're attaching to the variable rather than, ah, data structure..
I do like the idea of being able to move XML around without having to parse to view the basic file in a formatted fashion. So, you're mixing HTML with a tag. Again, SO WHAT? But what about the encapsulated text, what's the point? If you're going to use a viewer eventually (because you have the encapsulated text), use a viewer. This would only help in reading the actual data, but not in bug fixing, because the XML is that much more unreadable.
On the other hand, this is kindof like the PDF format, with text as text. The PDF client renders it as a font bitmap but it's rendered from TEXT in the PDF, therefore you can do things like cut/paste/etc. This takes it a step further by adding a data structure around it which allows you to import rows of things. Pretty sweet, I might use this somewhere. I can see it being useful in mobile stuff, so you don't have to muck with a client parser.
Cool! Amazing Toys.
I'm sure the LISP community would love to hear about this brand-new idea of embedding specialy, or domain-specific if you will, languages and data. How extraordinarilly novel.
You'll be running a limited LISP implementation on every browser in no time!
This suffers from the same thing XML did. Remember when XML was going to revolutionize communication between computers by structuring everything consistently? Then tripped over which was crawling on the floor after being decked by who was rather pissed off after an argument with Henry</name> and the whole thing went down in a pile of flames and is now relegated to being a 2MB configuration parsing library to embrace and extend "option=value".
So now why is this "vevent" class special, and who decided it would be "vevent" and not "scheduledevent" or "calendarevent" or "microsoftcalendarhassomethingforyoutodotoday"? Clearly as a human I can look at "dtstart" and think about it and realize that this means the starting date, but how does a computer know this? If the "semantic web" is going to take off, then we need semantics, and pronto.
Hopefully any standardization doesn't turn into a nightmare though. I used to develop in the healthcare insurance claims field, and the old NSF format for transmitting an insurance claim electronically was a horrible death-by-committee piece of work. It was as if nobody could come to a consensus and the committee decided to just throw everything in. You might look at your insurance card and think "gee I have an insurance ID number" but no, in the NSF, there were about 10 different blanks for insurance IDs, depending. Is it a Medicare number? Then it goes in the Medicare blank. God forbid the computer would have just one blank and assume that if you're billing Medicare then the number in the blank is probably a Medicare ID. Medicare was easy, there's just one. Medicaid in most states have a billion subcontractors, all with names that have nothing to do with "medicaid" so you simply had to maintain a magic list of insurance plans that changed every other year or so that used the Medicaid ID field. Or the separate fields for Blue Cross and Blue Shield. What about the states where you have BCBS as a single entity?
Anyway, I'm digressing (and ranting about a chunk of my ilfe I'd much rather forget). What's important in standardizing in semantics is identifying everywhere where things are identical and reusing semantics whenever possible. Decisions have to be made up front as to what is the relationship between "name" and "last name" (people have a name, which has a last name, yet companies have names that typically don't have a last name. What about a cat named "John K. Wibblesworth" how is that different from one named "Tama"?) Yet, take dtstart which is used here for a calendar event. Should we have "dtclassstart" for the first day of school?
Ok, so this "microformats" thing is about encoding extra data inside an HTML file by abusing CSS class names for markup, isn't that completly unnecessary and nothing more than an ugly hack? Don't we have XML namespaces for exactly that reason? Wouldn't something like:
<span style="display: none">
<vevent:event>
<vevent:dtstart>20060501</vevent:dstart>
<vevent:dtend>20060502<vevent:dtend>
<vevent:summary">My Conference opening</vevent:summary>
<vevent:location>Hollywood, CA</vevent:location>
</vevent:event>
</span>
We the 'right'[tm] way to day it?
If the "semantic web" is going to take off, then we need semantics, and pronto.
as:
If the "semantic web" is going to take off, then we need semantics, and porno.
That is all.
Mixing presentation and data - good... bad... good. But it gets better a little, each time (maybe more of a spiral than a wheel).
We're using them on aim pages for module development (I cover it a bit here). Its a nice simple standard, and the idea needed SOME name - don't make more of it than it its.
-----
graphically speaking
graphically speaking
This is a kind of neat idea, except, of course, if I have CSS that does something with, oh, say, a class of "dtstart". Sure, it's easy to recognise that ".vevent > .url > .dtstart" is a microformat data item for an hCalendar, but if I'm already using "dtstart" or "url" regularly in my markup so I can apply styles to those kinds of things, I'm pretty much SOL. Rewrite all your markup and CSS to stop using those names.
There's no namespacing. There's not even an ATTEMPT at namespacing. This will fast become an unmanageable hodge-podge of insanity, with common words used willy-nilly in class attributes.
The class attribute is defined as CDATA. That's it. You can use pretty much ANY character in it. There's a lot of characters that can't be used in a CSS selector, though, such as ":". See where I'm going with this? <div class="mf:vevent"> for a start. Better yet, <div class="hidden mf:vevent"> such that you can hide (or format) the block of data separately.
Now, as if that wasn't bad enough, and, trust me, it IS bad enough, there's also the misuse of the "title" attribute and the "abbr" element. A machine formatted date is not the expanded version of a human formatted date, which is not an abbreviation. A renderer trying to make sense of <abbr class="dtstart" title="10034134134T00">17th Smarch</abbr> will think "AHA! This here is an abbreviation, I will provide unto the user some means to see what that '17th Smarch' abbrevation stands for!" Usability disasters follow.
So, in summary, this is the worst idea I've seen in HTML space since some bright spark said, "let's suggest that people use the 'text/html' content type for their XHTML markup!"
And I think that muddling data and presentation without explicit distinction is exactly what was wrong with HTML. Which we just spent a decade slightly recovering from. I guess IBM has made a lot of money on crappy tools, good tools to extract data from crappy data, and extra money for doing it right.
--
make install -not war
The VERY relevant site that Jack Herrington forgot to mention there is Pingerati. That is THE site through which all these Microformats are shared. The system is based on pings, much like the rest of the blogosphere. Both Pingerati and Microformats have a major force behind it - Technorati.
Simpy
We're looking to implement hResume on Emurse.com web resumes here in the next couple of days.
I'm really excited about being able to push the standard some. We've been wondering what the effects of it could be negatively though, in terms of screen scrapers (alex.emurse.com, for instance). Any one have any thoughts?
We've built hResume support to be configurable by the user, if it proves to be an issue. Just wondering how we should initially offer it.
HTML,DHTML,XML,XHTML,XML etc. etc. add freeking nausium, uhhhhhg!
t a) called like:
o Nurse!')
c oration,data) called like:
o ll,HScroll',data)
This is turning into PURE alphebet soup and thus into pure GARBAGE! CSS,
CSS2, CSS3 and more garbage yet to come, I am quite sure.
How about textbox(OrgPoint,EndpointXY,Font,Color,data) called like:
TextBox('10,10','100,100','arial','Red','Hello world!')
Or lets do it one better?
How about textbox(OrgPointXY,EndPointXY,Layer,Font,Color,da
TextBox('10,10','100,100','1','arial','Red','Hell
Or how about:
Image(OrgPointXY,ImageName,ScaleFactor,Layer) called like
Image('0,0','HotBabe.jpg','100','1');
Hmmm lets see the browser would render the image of the hot babe and the render the text 'Hello Nurse' on top of it!
WOW! Now how many lines of HTML & CSS would I have to write to do that?
The problem with the web is it has been designed by a bunch of academics who do not have to do real actual work aside from getting papers published.
Publishing to the web could be made easier by an order of magnitude by that one simple concept; being able to put something where you wanted it, absolutely, with a direct statement.
Ohh you want a scroll bar for that text box? Howabout:
How about TextBox(OrgPointXY,EndPointXY,Layer,Font,Color,De
TextBox('10,10','100,100','1','arial','red','VScr
Imagine how much faster a broswer would be if it didn't have to parse a few thousand lines of CSS.
KISS!
I was going to say "I Don't Get It" but somebody beat me to it.
I think the title of TFA "Separate data and formatting with microformats" is a bit ironic since it's about wedging your data into a web page in such a fashion that somebody might be able to pull it back out.
If you want to make your data available there are all sorts of standard and more efficient ways of doing it than embedding it in the presentation layer. If somebody is going to all the trouble to create a parseable human-readable page, why wouldn't they go to about the same amount of trouble and make a far more efficient and standard RSS feed? What about the buzzword of the last few years, SOAP? Hell, what about XML?
From TFA:
I agree. This reminds me of the lame number tricks where you have somebody pick a number, add something, multiply it by something, blah blah blah, you take the result, divide it by 7 and then you give them their orignal number because you had it all set up ahead of time. If they screw up in their calculations, the trick doesn't work. In this thing, if you screw up embedding the text within the HTML (plenty of ways to do that), the trick doesn't work - and doesn't accomplish much even if it does.Look into JSON..its basically javascript data structures that you eval on the client. Why bother assembling thick XML that needs to be parsed on the client. XML is slow, and even slower if you have to XSLT it out of the XHTML.
I don't believe it was intended to contain an alias (in Sowa's sense) or a general nomenclatura, however. This innovation actually undercuts the *semantic web* fairly radically, by confusing names with types, proper nouns with classes, as discussed in the second chapter of his Knowledge Representation.
.") A far cleaner method by any measure is to mediate the relationship between domain semantics and presentation or syndication semantics via a SAX-driven XSL transform performed by either the client or the server.
XML, as pointed out clearly elsewhere in the thread, is a conventional syntax for the representation of heterogeneous schemata. An XSL stylesheet is a deterministic means of defining the relationship between such schemata and mediating their discrepancies and gaps.
This method seems to be a social convention relying upon some contemporary user-agent (and user) behaviors. The article itself apparently conflates the functional separation of data and formatting with a system of semantic definition; though we can credit the author for recognizing this and other shortcomings in the article ("This code looks a bit complicated . .
illegitimii non ingravare
Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified bug-ridden slow implementation of half of Common Lisp. -- Phillip Greenspun's 10th Rule of Programming
illegitimii non ingravare
when browsers have built in support.
illegitimii non ingravare
It appears you were thinking about the data URI scheme. Unfortunately, and very much like modern CSS standards, the only browser to not support it is the one with the greatest market share.
Join Tor today!
exactly! sure the idea of using css class names to represent something for a machine to read is not new as it is an obvious one. I thought of it too when I first saw CSS used just like I thought of using made-up tags to represent things when I first saw html ... but THAT IS NOT THE POINT -
- the STANDARDISATION, the fact that LOTS OF PEOPLE ARE ACTUALLY STARTING TO USE IT, and the SIMPLICITY is what makes microformats interesting -
For someone like me who has been looking for many years for ways to make it easy for an events promoter to provide machine readable data for a nightlife listings website ( www.spraci.com ) without needing to provide them with special software and then having to teach them how to use it, its an exciting thing!
Sure the preferred way to add an event is to use the forms on the site -
but not all promoters have the time to do it and some may already have their events listed on their own sites - why should they have to enter the same data over and over to get it listed on a few listings sites? ... see the problem?
You might ask "what about RSS?" .. think about it ...
Events listings are calendar data - they need DATES ... plain old rss does not do that ....
(unless you use extended versions like RSS+Event - but not much software out there uses that - so that inevitably means people need to modify their software - not much good for most event promoters!)
spraci.com and many other listings sites require event dates to be seperate and machine-readable
because people can look up events by date.
"What about iCal?"
Is there a way to represent cities/countries/etc in iCal?
Listings sites that deal with more than one city need that kind of information.
If you use hCalendar you can combine it with hCard to specify the city/country!
For some of us who have been trying to get data syndication of this kind happening for years and having to deal with a lack of standards and software using them that is suitable for the average event promoter to use I see microformats as a very good thing.
1. they are easy for people to understand and use without needing to spend hours reading documentation to figure out the basics of what it does... a simple example is almost self-explanatory
2. not hard to parse with very basic xml/html/etc tools - you don't need anything exotic or overly bloated.
3. lots of people are actually already using it - that is pretty rapid uptake!
(what use is a "standard" if nobody is using it?)
4. it is actally trying to addresses the real world situation in a real world way.
- html is everywhere
- people want to create and consume data feeds containing data not handled well by plain old rss
- people also want to embed data in other places where they might be using html
- people want the minimum of installing or modifying software to do it - they want it NOW with a minimum of fuss
- there might be more than one item to be represented on one page (that pretty much rules out using meta)
- it tries to work with other existing standards where possible (eg hCalendar is based on iCal / hCard is based on vCard)
yes do check out http://microformats.org/wiki/ ...and if you are still not sure check out some of the links on there to other sites using microformats for more real-world examples.
If the parent document is XHTML, and the browser understands that, CSS can easily be used to style these additional non-XHTML elements any way you like.
This "Microformatting" concept is predicated on the idea that data is (or should be) human-readable in its default state, but with mechanisms that make it easier to translate it into something machine-readable. This seems backwards to me.
Humans only need to be able to comprehend the data structure at two points: input and output. In between, computers may perform a thousand different transfers and transformations on the data, and at those points, the ability to see the data in plain English (or plain Anyotherlanguage) is just excess baggage.
He mentions Webmonkey and Technorati as computer services which essentially work by screen-scraping content intended for humans and hacking it into something for computers. This is not to be encouraged.
The XML output of the author's sample transformation seems like a more logical default storage format for the data. It's easy and flexible to transform this data back into any format desired, and certainly easier than transforming from "Microformatted" XHTML to intermediate XML to target format.