I guess in the early days of the Web many people said the same thing - why bother if nobody is providing HTML pages and nobody is using HTML browsers (in fact, I remember that time very well).
Of course building a web of data is more demanding - the infrastructure is far more complicated.
But we have made tremendous progress over the last years - to the point where currently structured data coming from applications like Wikis, Mailing Lists, Bulletin Boards can, should and will be integrated.
And progress is being made - eg., with things like FOAF or SIOC (see http://sioc-project.org/.
The service http://pingthesemanticweb.com/ provides a good overview - progress may be slow, but Metcalfs Law did prevail in the past. Why should it not in the future?
And what is the alternative?
Zarf,
you are absolutly correct that indeed raw RDF data can be polluted if crawled naively.
That is exactly the reason why in all newer applications not the simple triple model is used, but actually quads, where the last argument may represent the source of the data. This data model is called named graphs.
So once you have the source recorded one is able to do trust computions with the graph and its source - eg., using pagerank like algorithms. Some sources can be assigned a low trust value, others can get assigned a higher one, based on their spam content and adoption of a web community (just like conventional webapges using pagerank).
Indeed the implementation that DERI reported on is realizing named graphs for exactly that reason, and Aidan is working on a ranking algorithm which is taking the source of the data into account.
First: The experiments have been done on a 18 node cluster of cheap servers.
Second: There are other ways to get metadata - eg., via SIOC (see URL:http://sioc-project.org/>. But true, trust is an issue. And some people in DERI Galway are working on ranking algorithms on top of the search engine.
We have a Technical Report available at http://www.deri.ie/fileadmin/documents/DERI-TR-200 7-04-20.pdf that should answer most of the technical questions.
From the abstract:
"We present the architecture of an end-to-end search engine that uses a graph data model to enable interactive query answering over structured and interlinked data collected from many disparate sources on the Web.
In particular, we study distributed indexing methods for graph-structured data and parallel query evaluation methods on a cluster of computers.
We evaluate the system on a dataset with 430 million statements collected from the Web, and provide scale-up experiments on 7 billion synthetically generated statements."
Fully agreed.
But it worked for RSS - and it also seems to work for SIOC (see http://sioc-project.org/ ).
Other XML structured formats are also catching on - eg., XBRL. All of them can be (quite easily) translated in a graph and integrated.
So there is hope.
However, Andreas and Aidans work reported on in the press release enables us to build scalable engines - scalability was a major headache before.
OK, I concede. You won. ;-)
Some people can overestimate the importance
I guess in the early days of the Web many people said the same thing - why bother if nobody is providing HTML pages and nobody is using HTML browsers (in fact, I remember that time very well).
Of course building a web of data is more demanding - the infrastructure is far more complicated.
But we have made tremendous progress over the last years - to the point where currently structured data coming from applications like Wikis, Mailing Lists, Bulletin Boards can, should and will be integrated. And progress is being made - eg., with things like FOAF or SIOC (see http://sioc-project.org/.
The service http://pingthesemanticweb.com/ provides a good overview - progress may be slow, but Metcalfs Law did prevail in the past. Why should it not in the future?
And what is the alternative?
Zarf, you are absolutly correct that indeed raw RDF data can be polluted if crawled naively. That is exactly the reason why in all newer applications not the simple triple model is used, but actually quads, where the last argument may represent the source of the data. This data model is called named graphs.
So once you have the source recorded one is able to do trust computions with the graph and its source - eg., using pagerank like algorithms. Some sources can be assigned a low trust value, others can get assigned a higher one, based on their spam content and adoption of a web community (just like conventional webapges using pagerank).
Indeed the implementation that DERI reported on is realizing named graphs for exactly that reason, and Aidan is working on a ranking algorithm which is taking the source of the data into account.
First: The experiments have been done on a 18 node cluster of cheap servers.
Second: There are other ways to get metadata - eg., via SIOC (see URL:http://sioc-project.org/>. But true, trust is an issue. And some people in DERI Galway are working on ranking algorithms on top of the search engine.
We have a Technical Report available at http://www.deri.ie/fileadmin/documents/DERI-TR-200 7-04-20.pdf that should answer most of the technical questions.
From the abstract:
"We present the architecture of an end-to-end search engine that uses a graph data model to enable interactive query answering over structured and interlinked data collected from many disparate sources on the Web.
In particular, we study distributed indexing methods for graph-structured data and parallel query evaluation methods on a cluster of computers.
We evaluate the system on a dataset with 430 million statements collected from the Web, and provide scale-up experiments on 7 billion synthetically generated statements."
Fully agreed. But it worked for RSS - and it also seems to work for SIOC (see http://sioc-project.org/ ). Other XML structured formats are also catching on - eg., XBRL. All of them can be (quite easily) translated in a graph and integrated. So there is hope. However, Andreas and Aidans work reported on in the press release enables us to build scalable engines - scalability was a major headache before.