Ask Slashdot: What Happened To Semantic Publishing?
An anonymous reader writes There has always been a demand for semantically enriched content, even long before the digital era. Take a look at the New York Times Index, which has been continuously published since 1913. Nowadays, technology can meet the high demands for "clever" content, and big publishers like the BBC and the NY Times are opening their data and also making a good use of it.
In this post, the author argues that Semantic Publishing is the future and talks about articles enriched with relevant facts and infoboxes with related content. Yet his example dates back to 2010, and today arguably every news website suggests related articles and provides links to external sources. This raises several questions: Why is there not much noise on this topic lately? Does this mean that we are already in the future of Online (Semantic) Publishing? Do we have all the tools now (e.g. Linked Data, fast NoSQL/Graph/RDF datastores, etc.) and what remains to be done is simply refinement and evolution? What is the difference in "cleverness" of content from different providers?
In this post, the author argues that Semantic Publishing is the future and talks about articles enriched with relevant facts and infoboxes with related content. Yet his example dates back to 2010, and today arguably every news website suggests related articles and provides links to external sources. This raises several questions: Why is there not much noise on this topic lately? Does this mean that we are already in the future of Online (Semantic) Publishing? Do we have all the tools now (e.g. Linked Data, fast NoSQL/Graph/RDF datastores, etc.) and what remains to be done is simply refinement and evolution? What is the difference in "cleverness" of content from different providers?
I don't want Symantec publishing. Costs too much to renew every year while hogging all my available CPU and RAM
There is a fine line between "clever" and "annoying". Very often, what gets considered as "related" content, is only tangently related, and sometimes the way it is displayed makes it indistinguishable from the content of the current article. Add to that all of the surrounding clickbait, and it just becomes a confusing mess.
Proverbs 21:19
I hate, hate, hate, hate web pages that have hot-linked words with popups. It is even worse when it is an advertisement. And those "recommended articles" at the end are just as bad. Click-bait links to content that is of no value.
The trouble is that this is both boring (for a person) and hard (for a computer.)
So nobody wants to do it manually, and while everybody's got an algorithm to mark up text, they're all terrible and prone to being gamed by unscrupulous advertisers.
How many websites have you gone to and seen some random word in the middle of the text that's bolded, double-underlined, larger font and a completely different color to really draw your eye to it (and away from what you're actually there to read.. ie: be as annoying as fucking possible) and then you hover over it and discover its a Wikipedia link to a house or something equally as pointless?
This has been the problem with "the semantic X" ever since link farms were invented. They usually don't provide a whole lot of additional information (if any) and they distract from what you're trying to see.
If you really want a semantic experience, go to basically any popular wiki. They're explicitly curated and therefore the links you find are (usually) actually both informative and relevant. Of course they do this by going the boring (manual) route and compensating for it by having a million people doing the job instead of just a handful.
Go back and read that "mundane" Wikipedia article about the house and, if you have even the slightest amount of curiosity about anything, can probably spend several hours link chaining.. there's links to construction, history, archaeology, anthropology, etc -- and they're all placed in such a way that they're relevant to the article and yet kept subtle enough that you can read over the ones you aren't interested in without a significant drain on attention.
I don't think it's that computers and machine learning really trump an exact model. It's more that manual curated semantic information is difficult to do well and even when done well is simply the curator's interpretation of the key points. Ontologies and controlled vocabularies (necessary to make semantic solutions work) are always biased towards their creators view of the world. Orthogonal interpretations rarely fit with the ontologies and require mapping between knowledge systems. Rather than simplifying things, this just creates another layer of abstraction and meta-data that now must be managed.*
Machine learning, on some level, basically admits this flaw in structured knowledge representation and punts. Instead, it provides tools for querying knowledge bases and finding patterns in them. I think the latter part is just as flawed as manual curation, but the query tools combined with a human are incredibly powerful.
A simple example: Yahoo originally indexed and categorized the Web. When I interviewed there in '96 (and, silly me, turned down the offer), they had a room full of people that did just that. Google, on the other hand, used a graph algorithm combined with standard text search methods to leverage the structure of the web to give good search results. Yahoo eventually bailed on manual curation and we learned how to leverage Google's approach to search to mine knowledge.
tl;dr: manual and automated curation will never properly capture human's representation of knowledge. Instead, better tools plus the human brain will improve our ability to leverage knowledge.
-Chris
*and there's that old saying: every software problem can be solved with another layer of abstraction.