Domain: purl.org
Stories and comments across the archive that link to purl.org.
Comments · 28
-
Data & Software Citation.
The top 100 most cited papers are actually a motley crew of methods, data resources and software tools that through usability, practicality and a little bit of luck have propelled them to the top of an enormous corpus of scientific literature.
The article itself never mention 'data resources' that I saw, but there's a problem in many fields that the standards are to cite the 'first results' paper for that data
... for which the results portion may have already been disproved or otherwise be crap. There are a number of efforts working on being able to cite 'data' separately from 'results of the data', and in a manner that's consistent across all disciplines (as we don't know in advance who might make use of our data). You also run into problems, as the paper being cited may describe the initial release of the data, and not be useful to determine which edition was used (as that may be significant to recreate their results). See the Joint Declaration of Data Ctation Principlies, DataCite (metadata schema + DOI registry system), and the 2012 CODATA-ICSTI report, "Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data".There are similar issues with software citation -- everyone's citing the announcement of the existing of the software, but how can you track who might've relied on a buggy version to let them know that they may need to re-run their analysis? I'm not as active in this field, but the arguments remain the same (giving proper attribution, documenting everything to make it reproducible, etc.). See the 2013 Knepley et.al paper, "Accurately Citing Software and Algorithms used in Publications" and the work of the Software Sustainability Institute (which also covers topics on writing better research software, as was alluded to in the article)
It's probably also work mentioning that our current ways of tracking 'importance' of papers are flawed. See the Altmetrics Manifesto for a collection of links to efforts to come up with other metrics and CiTO, the Citation Typing Ontology to enable a way to classify why something was cited (it might be for criticism; in most of the cases in the article, it would be "uses method in", which not all disciples feel needs to be cited).
-
Re:rel=shortlink could eradicate URL shorteners
I wrote the shortlink specification a few months ago (based on similar work done by others), released it into the public domain using CC Zero and went about soliciting feedback.
So, are you going to just put it on a random website out there or are you going to do the proper thing and get it on a standards track somewhere? (Maybe IETF or W3C.) That's the only way to get it really trusted by the bulk of users, since they trust those organizations to keep on what they've been doing for years.
-
rel=shortlink could eradicate URL shorteners
I've had a beef with URL shorteners for a long while now for reasons that have been covered ad nauseam (not the least of which being that in addition to adding significant overhead - typically hundreds of milliseconds per request - they are just plain evil). IMO the best solution is to let webmasters create and advertise their own short links using the "shortlink" link relation (e.g. rel="shortlink" in the HTTP headers and/or HTML HEAD) such that they can be auto-detected by clients who then no longer need to generate their own using 3rd party services. I wrote the shortlink specification a few months ago (based on similar work done by others), released it into the public domain using CC Zero and went about soliciting feedback. The standard got a big shot in the arm last week when WordPress.com announced support for rel=shortlink on over 100 million pages. I've since requested support be introduced into the top 20 Twitter clients (representing over 80% of Twitter usage) and have had only positive feedback so far. A number of other high profile sites like PHP.net and Ars Technica have also jumped on board. Anyway if you, like me, are sick of URL shorteners then you're welcome to give me a hand making them go away...
Sam
-
Re:Why can't we just move it?
> Perhaps you mean "long-lived, mostly reliable URI".
Yes, and I have a couple of name suggestions for this, we could call it a "permanent URI" or "persistent URI".
Purl may be a good choice for this DTD. -
Re:Why would this break RSS readers?
According to the Internet Archive, the DTD was last changed in February 2003. Here's the latest copy of RSS 0.91. Perhaps someone should set up a redirect at PURL.
-
RSS 1 by the W3C
RSS 1.0 is also the only syndication format endorsed by the World Wide Web consortium. RSS 0.9 and 2.0 were created at the companies Netscape and Userland.
-
Re:URI to the Rescue
Something like this?
-
Esperanto GPL dictionary
ReVo is a GPL Esperanto dictionary. It also provides translations to other languages.
-
I Used to Work for OCLC
I obviously can't speak for them, but I can provide some background on what they do. OCLC is a nonprofit org providing services for approx 45,000 libraries around the world. If you are a librarian and need to figure out how to catalog a new book in your collection, you go to OCLC to see how others have done it. Ever needed an item that wasn't in your library? OCLC handles the system for arranging inter-library loans. They do a fair amount of original research for libraries and they even open source some of the results. PURL is another OCLC project that some of you may be familiar with. The Dublin Core MetaData Initiative was co-founded by a researcher who got his start at OCLC and is now running the W3C's Symantic Web Initiaitve. OCLC is very well known and respected in the library community.
Library budgets the world over are under attack given the current economic situation. This leaves less and less money available for building the kind of common infrastructure that will help libraries continue to provide new and relevant services for their patrons as more and more of the content becomes digital. OCLC certainly has both the right and the need to defend the Dewey Decimal Trademarks from infringers. -
Re:The Semantic Web is already here
I agree that RSS is probably the most popular semantic web thingy at the moment.
However only RSS 1.0 is based on RDF and is a standard with an open committee in control, RSS 2.0 is plain XML and is controlled by Dave Winer...
One of the best collection of RSS resources is the one Ben Hammersley maintains on DMOZ.
-
'PURL' is an infringment on OCLC
'PURL' has been used as an acronym for "Persistent URL" (http://www.purl.org) since at least 1995. I believe that Tumbleweed's use of the 'PURL' name is an infringment, specifically as it is being used within the same knowledge/technical domain.
-
Forking RSS
I hate to say it, but this is YA example of why forking over petty differences can be A Bad Thing. If one stubborn contingent hadn't steadfastly clung onto the deprecated RSS 0.91 spec instead of moving to RSS 1.0 (which returns to RSS's more dynamic roots), said contingent wouldn't be locked out because *a single document was removed from a web site*. Yeah, Netscape did the wrong thing. But the proponents of the outdated and outmoded spec should have seen this coming a mile away.
-
Re:Directories are not search engines
doomed to failure until someone implements something like the Dewey Decimal System for web pages
Yes, we're stuffed -- but Dewey Decimal isn't the answer (we can do a lot better than that).
There's an initiative around that's gaining considerable momentum - the Semantic Web. It starts from one bright idea by one guy, but as the guy in question is Tim B-L, then he gets listened to. There are solutions to all this. We've barely started on what we could easily achieve for indexing the web, without even trying for the really hard stuff.
Once basic semantic level indexing becoms commonplace, through tools like Dublin Core, then take a look at ontological descriptions and projects like DAML.
There's a huge amount happening in this field research-wise, it just hasn't hit the punter's web yet.
-
Semantic Web is a layered framework
As the Semantic Web is a layered framework, the actual vocabularies you use to describe things are applications of the framework rather than the framework itself.
One such application that might prove useful in what you're tackling is RSS. What you seem to be looking for is a taxonomy against which you can classify things. These are expensive to develop and hence rare, the ODP being one of the few that are public. My advice is, if the ODP doesn't fit, classify by topic yourself (but avoid the mistake of struggling to produce a hierarchical system, this is rarely appropriate). At a later stage you can express equivalences to other folks' categories. Folks on the RSS-DEV mailing list would be happy to share experience of categorization.
Anyone seeking more information as to what the Semantic Web actually is and how it fits together might be interested in some of the articles I've written on the topic, which give an overview both of the vision and of ways you can get started with tools:
-- Edd Dumbill, Editor, XML.com.
-
This is where SWAG comes inI feel sort of bad plugging my own group, but this is exactly the problem that SWAG is meant to solve. SWAG is the Semantic Web Agreement Group, and we bring different users of the SWeb together to try and build sets of common terms. Our current project is to build a dictionary of common terms, which you can find at: WebNS.net.
Obviously, the Semantic Web won't work if we only have one dictionary, but it will work much better if agree on the terms we use when possible. So SWAG isn't trying to enforce terms, but merely recommend them.
We work on a process of consensus so that we can move quite quickly and new terms don't get bogged down in endless talking.
So I hope you'll visit us, once again the address is: http://purl.org/swag/.
-
This is where SWAG comes inI feel sort of bad plugging my own group, but this is exactly the problem that SWAG is meant to solve. SWAG is the Semantic Web Agreement Group, and we bring different users of the SWeb together to try and build sets of common terms. Our current project is to build a dictionary of common terms, which you can find at: WebNS.net.
Obviously, the Semantic Web won't work if we only have one dictionary, but it will work much better if agree on the terms we use when possible. So SWAG isn't trying to enforce terms, but merely recommend them.
We work on a process of consensus so that we can move quite quickly and new terms don't get bogged down in endless talking.
So I hope you'll visit us, once again the address is: http://purl.org/swag/.
-
This is where SWAG comes inI feel sort of bad plugging my own group, but this is exactly the problem that SWAG is meant to solve. SWAG is the Semantic Web Agreement Group, and we bring different users of the SWeb together to try and build sets of common terms. Our current project is to build a dictionary of common terms, which you can find at: WebNS.net.
Obviously, the Semantic Web won't work if we only have one dictionary, but it will work much better if agree on the terms we use when possible. So SWAG isn't trying to enforce terms, but merely recommend them.
We work on a process of consensus so that we can move quite quickly and new terms don't get bogged down in endless talking.
So I hope you'll visit us, once again the address is: http://purl.org/swag/.
-
XML considered harmful
This is another example of What's Wrong With XML (and particularly, what's wrong with proliferating schemas all over the place).
A schema isn't a means of publishing your data to a wider audience, it's a means of locking-out everyone who doesn't have a copy of it.
Look at real user of RDF for how to do this in a better way. XML is great, but the coupling between structure and semantics that comes from using an XML schema to represent both is a nightmare for interworking between teams that overlap, but aren't identical enough to use exactly the same schema.
A couple of years ago, we watched a bunch of old guys slaving over COBOL legacy conversion programs, desperately trying to suck the data out and into SQL, before Cinderella's glass computer turned back into the Y2K pumpkin. I don't want my future to turn into the same thing, scratching together n^2 XSL transforms to convert fooML into foo'ML.
-
RDFand this is different from RDF how? an RDF inference engine (together with agreed metadata conventions, like those being worked on by dublin core) would provide the basis for queries against metadata on the web, and that in my opinion what is what's important for the future evolution of the web.
if you haven't looked into RDF and the importance of metadata on the web, there's no time like the present.
it wouldn't hurt to read weaving the web, by tim berners-lee, the inventor of the world-wide web, either. he has chapter 1 online.
-
Metadata, URI, mirrors etc.....Sorry for self-quotation (from the TERENA Technical Report FTP Mirror Tracker):
Unfortunately, there is still no coherent architecture for mirroring and for mirror sites to register their collections with the sites which they mirror. In fact, we lack even a common (de facto) standard for recording this replication information in a machine readable for-mat. Some progress was made on this a few years ago by the Internet Engineering Task Force s [1] working group on Internet Anonymous FTP Archives, with the creation of the so-called IAFA Templates [2]. These provided a simple machine readable format for recording per-resource or collection metadata, which could easily be created by hand or programatically. Although support for IAFA templates was integrated into some software packages, e.g. the ALIWEB search engine [3] and the ROADS resource discovery sys-tem [4] , this approach never became successful on a large scale. The World Wide Web Consortium s Resource Description Format (RDF) [5] and the Dublin Core metadata effort [6] may eventually provide a viable machine readable interchange format.
Another attempt to create a framework for such a metadata was an "Open-Software-Index" that Oliver Maruhn and myself tried to create almost 2 years ago. After this document some discussion had started (code name "Russian Freshmeat") that had shifted mostly to localisation of such a metadata. Unfortunately no working code was produced.Currently, the database underlying the freshmeat.net weblog [7] is perhaps the closest thing we have to a genuine mirror registry - though it focuses almost exclusively on soft-ware packages and operating system distributions, and only offers limited mirror informa-tion. RDF is also being used in this capacity as part of rpmfind.net [8], although mirror information is very limited in this case too. The Internet Engineering Task Force s Uni-form Resource Names effort [9] is also relevant here, since it would be very useful if there were persistent and location independent names for these collections of replicated resources.
[1] http://www.ietf.org/ Internet Engineering Task Force website
[2] http://info.webcrawler.com/mak/projects/iafa/ IAFA Working Group & IAFA Templates homepage
[3] http://aliweb.emnet.co.uk/ ALIWEB website
[4] http://roads.opensource.ac.uk/ ROADS website
[5] http://www.w3.org/RDF/ World Wide Web Consortium Resource Description Format (RDF) homepage
[6] http://purl.org/dc/ Dublin Core website
[7] http://freshmeat.net/ freshmeat.net website P. Lenz & Andover Advanced Technologies, Inc.
[8] http://rpmfind.net/ rpmfind.net website
[9] RFC 1737, Functional Requirements for Uniform Resource Names K. Sollins & L. Masinter December 1994And at the end somewhat less relevant to the topic.
This kind of metadata should be extremely valuable for implementation of the URIs and particularly for the I2C(s) (URI tp URC). Quote from the RFC 2483:
"Uniform Resource Characteristics are descriptions of resources. This request allows the client to obtain a description of the resource identified by a URI, as opposed to the resource itself or simply the resource's URLs. The description might be a bibliographic citation, a digital signature, or a revision history. This memo does not specify the content of any response to a URC request. That content is expected to vary from one server to another."
Hopefully we already have mechanism for the I2L(s) (FTP Mirror Tracker). -
Re:XML all the way
Moderate this UP. This actually isn't as flaming as it comes off to be. Everyone is saying "use XML" but the reality, XML doesn't define anything. XML just makes it standard. In the real world, DTD's need to be written, and this is a HARD task. Fortunately, there are already metadata standards for XML, most notably RDF and Dublin Core. Check out purl.org for Dublin Core info.
-
Some Standards are out there
While there are no set-in-stone standards for describing electronic media, including code, there are some evolving standards in this area. One place to start would be to look at the Dublin Core website - this is a descriptive schema that provides a 'core' or basis for metadata schemes. Also look at the work on RDF (links can be found from the Dublin Core site) for info on structuring metadata.
-
2 sugestions - LSM and Dublin Core
I'd check out the Linux Software Maps, hosted at metalab (now ibiblio). Mor interesting is probably the Dublin Core Project. Although It's not set up specifically for software, it's extinsible. You could create an XML DTD (which many have suggested) using the Dublin Core standard for sucha purpose.
links:
iBiblio Linux archives
Dublin Core homepage -
Re:URIs don't change: people change them
There's also things like PURL, but they haven't really caught on.
-
Re:Centralized Searching is the Wrong Approach
What you're talking about is search brokering. Basically, you have one server which acts as a "broker" (or a Meta-Search engine, if you will). It sends queries out to search engines and collates the results.
The difference between brokering and meta-searching is that each of the search engines in a brokered system categorise and rank their results in a consistent manner.
Thus the brokering engine can return you a list of results that is meaningful.
The protocol can be simple HTTP. Instead of indexing a remote site, you just call a standard URL such as http://site.name/cgi-bin/broker-server?...
The arguments/parameters for this search could be based on the fields used in Dublin Core (or just skip to RFC 2413 - Dublin Core Metadata for Resource Discovery). However, Dublin Core is quickly being converted into a really complicated library-style cataloguing system. Perhaps something else exists that suits the purpose. -
Re:XML?
You may be thinking of The Dublin Core.
-
Indexing dynamic content (was Re:Customers :)
Is it even possible to index dynamic pages? They don't really exist until the page is generated.
Yes, for a very large category of dynamic pages, it is. For example, in an online shop, the actual number of a particular product in stock at the moment may very from minute to minute, the price of that product in the user's preferred currency may change from week to week, but the product itself doesn't change much over months over months or years. It makes perfect sense to index the product page, because although some of the contained data may be transient, a great deal more is not.
Or take another example: the weather forecast for a particular area. The forecast itself may change regularly, but the page always contains a current forecast and that fact is worth indexing. The best technology available for this sort of thing is probably RDF and the Dublin Core metadata specification. Of course, the search engines still have to be persuaded to take heed of this...
-
My take on RDFI'm currently looking into the alphabet soup of standards coming out of the W3C, trying to decide which ones are useful and how they might be applied to free software and Gnome in particular.
There's a lot of interesting things out there. In particular, I think XML and DOM could be the basis for a very good component framework in which powerful components would be easy to write, and would integrate nicely without a lot of hassle. I'm looking at RDF as a piece of this.
But, as far as I can tell, the problem that RDF solves is a bit different than the one mentioned in this article. RDF is a way of representing documents as graph structures, allowing individual files to contain both local and external pieces without everything getting tangled up.
The problem of representing metadata unambiguously is a tricky one, but is not yet solved. The RDF spec presents an interesting outline about how this might be done, but it doesn't quite tell me what I need to do to get my own Web pages to be correctly meta'ed. If I were a library, then the Dublin Core would start to give me the specific markup I needed, but that's just for libraries. What do I use do as metadata for my free software efforts?
It seems like the combination of XML plus XML-NameSpaces plus Dublin Core plus all the other recommendations, specifications, and standards analogous to the Dublin Core but for domains other than libraries might cohere into a workable metadata system for the Web, but on the other hand, the complexity and fuzziness of specification could very easily prevent the beast from flying.
When you're dealing with software, precise specification is key. Some metadata standards have succeeded pretty well in this regard - take MIME content types, for example. If you have a JPEG image, you know that the content type should be "image/jpeg". But the XML crew hasn't even managed a consistent namespace name for HTML 4.0 (I've seen "urn:w3-org-ns:HTML", "http://www.w3.org/TR/REC-html40" and others).
For those hoping for a more technical discussion of RDF, I recommend the Mozilla page on RDF and of course the specification itself.