Is There A Standard for Software Metadata?
"It's one thing to make this stuff available, but if people can't find it I'm wasting my time. Of course there are places I can go to publicise what I've done (Freshmeat, Jars, Gamelan, and Servletcentral in this case) and those services perform a valuable function, but in practice it is still quite hard for someone to find some code in language X that performs function Y in a way that complies with constraint Z. There's no search engine that finds reusable code based on variable criteria and, given the number of incompatible ways source code can be packaged, described and distributed, little prospect of anyone building one.
Right now, when I release this code, if I want people to find it I have to:
- write a description of it
- set up a home page for it
- register that page with numerous search engines possibly using the description I wrote
- visit the appropriate repository and announcement sites making submissions at each
- find out whether there's an appropriate usenet group and post to it
Assuming that such a standard doesn't exist does anyone want to get together with me and devise one. I'm thinking of something (human) language independent, simple, capable of encompassing all types of code, amenable to automatic processing. What about it?"
As much as I agree that something like that should exist, I believe that if you feel strongly about your code, then a home page is a must for your project (as well as writing descriptions about your project and registering it with search engines). A metadata standard would be a big help in this respect, but it's not going to be a replacement for going out there and spreading the word yourself as best you can.
With that said, what current data formats could be extended to serve as such a metadata standard, and if none of them are completely sufficient to handle this type of application, what would such a format need to be robust and flexible enough to serve this purpose.
Got Rhinos?
Sounds like a good idea
XML would be the current standard to start with, you'd just need to develop a schema that contains the data you want to share. It'd definately help code repositories.
Trolls throughout history:
Jonathan Swift
$PROGRAM_NAME:$LICENSE_NAME:(Commercial|Shareware| Freeware|Semi-free|Free|Public-Domain):$ AUTHOR:$WEBSITE:$EMAIL
Then hack something together with cpio to include it with your program.
See? Problem solved.
The leader of the project, SF Perl Mongers' own Rich Morin, is being very circumspect about it, trying to gather lots of information from experts in different OSs and distributions, and of course working on it in his free time, so the product is not there now--but if you're interested in contributing to such an effort, this would be the place to help out.
Vovida, OS VoIP
Beer recipe: free! #Source
Cold pints: $2 #Product
"The LSM is a directory of information about each of the software packages available via FTP for the Linux operating system. It is meant to be a public information resource. All entries have been entered by volunteers all over the world via email using the template below..."
ftp://ftp.execpc.com/pub/lsm/LSM.README
I'm not aware of one that is cross-platform, though there is one for Linux called the "Linux Software Map."
The format includes the following fields:
Title
Version
Entered-date
Description
Keywords
Author
Maintained-by
Primary-site
Alternate-site
Original-site
Platforms
Copying-policy
Given that there is a platform field, despite it being refered to as the *Linux* Software Map, this does qualify on most of the criteria that you mentioned.
Freshmeat, though not a format, is also a fairly comprehensive database of software which provides much the same information as you mentioned, including:
Title
Description
Author
Licence
Category
Download
Packages
Homepage
Changelog
Freshmeat, aside from providing updates on their site, also provide them via text files, which are suitable for simple automated parsing.
Though neither solution is entirely perfect, both are definitely close to what you're looking for.
What I would like to see is an SQL backend, with a simplified query engine on top of it that returns an XML formated document back. This would take care of the extensibility portion of it, as fields could be added to the backend and XML format, without breaking compatibility with the client.
Likewise, I would like the database to be available as a download so that mirrors could be created and/or alternative front-ends.
(E.g. The search functions of Freshmeat aren't always flexible enough for me to easily pinpoint what I am looking for. I would definitely prefer being able to download a snapshot of the database and run custom SQL queries locally.)
In any case, freshmeat and lsm are likely your best choices for the time being.
There are several such initiatives under way in the on-line library community -- librarians collect so much cruft and since they tend never to throw things out, they feel an even stronger need than you for good metadata. Dewey Decimal System is one such (very simple) metadata standard (sortof). Anyway, SunSITE.UNC.edu -- now iBiblio.org -- has required Linux developers uploading software to /incoming to include a inux Software Map (lsm) file for quite a long time now. The .lsm file is a basic metadata file in a fairly simple format. So, you might look at that: http://www.ibiblio.org/pub/Linux/LSM-TEMPLATE
The Dublin Core initiative is a more generalized attempt to answer the question, "How do we standardize on a metadata format?" Dublin Core is using XML and XML DTDs as the basis for their work. It applies to not only software but also to other online resources. So, as one might guess its arcane and difficult to understand at best and completely impenetrable most of the time. You can find more about Dublin Core at http://www.purl.org/dc
Sadly, most search engine companies focus on searching a specific kind of document type -- like HTML -- for arbitrary content. Interestingly, searching metadata is both an easiser computational problem to solve and more productive for the user. Unfortunately, its also a far more difficult social problem. Getting everyone to write common metadata is very, very difficult. Going back and writing metadata for any sizeable archive (say, iBiblio, for example) is a Herculean task. I think most of the coders who write search engines are more interested in the actual mathematics behind searching than they are in actual Document Retrieval. You might also check out http://www.cnidr.org, who were the authors of Isearch and some other good searching tools.
"He wrested the world's whereabouts from the heavens And locked the secret in a pocketwatch." - Dava Sobel
--
The most notable places where RDF is presently used for real things (as opposed to "we'd like it to be used here vaporware") include:
The latter is exceedingly relevant, as it represents an encoding of metadata about Linux software packages in RDF form.
If you're not part of the solution, you're part of the precipitate.
OSD (Open Software Description), implemented by Microsoft in XML. Of course, it is the work of Satan.
There's even real code available, in Python, which I confess I haven't looked at, so I'm vauge on what it does or soesn't do yet. I suspect there's that which is worth a look.
Doesn't seem to have an element to store the license the software is distributed under. Or a homepage. Or a download link. Or a file size.
.deb, .rpm etc. have to offer.
I guess XML is a good idea, but how about trying to mirror everything one can enter for a freshmeat.net entry? And consider what
Perhaps people find the Javadoc Conventions to be just a little confusing?
(Anybody who knows me knows I have a personal bias on things Javadoc. Probably not worth discussing on Slashdot. I mention it just to keep myself honest.)
Some parts of such a metadata standard are easy: language, compiler, platform, architecture, etc. But once you start trying to document the actual functionality of your code, you get into some sticky territory that is still the domain of researchers at a number of universities. The problem first is to devise a language powerful enough to facilitate formal methods. The next problem is actually convincing people that it's worth all the effort to formalize their specs (I think it is, but there are many who disagree). The last problem is coming up with a search algorithm that is able to match specs. For this part, you can't just use a string match or unification algorithm... there's some deeper semantic and structural analysis that needs to be done to determine that a certain fragment of code meets the constraints you want. To make the whole problem even worse, we don't even know if such an algorithm is computable! So, a full-blown metadata standard seems a bit out of the question now, but if you're willing to lower your standards a bit, I bet you can whip up a more practical implementation (with some natural language thrown in).
Name, Title, Organization, Address, Phone #, Fax #, URL, Date, etc... These metadata get assigned to the higher level of metadata such as: Originator, Copyright holder, Maintainer, Mirror Sites, etc...
It gets more complicated at the next description level. For instance, a set of metadata for software would be something like: Programming language, Operating System, Compiler, Library requirements (dependencies), Hardware requirements, License, Distribution restrictions, Lines of code, etc... Along with this would be tags like version number.
Then comes the software description: Application type (e.g., graphic converter, audio playback), User interface, Data input, Data output, Data formats, Batch/Interactive, Algorithms, Previous versions, Code stolen from, etc...
Metadata should be flexible enough to take in new types. Metadata sometimes points to more metadata which points to more metadata. Not all the metadata attributes need to be filled in. One should strongly attempt to standardize some of the key words. Metadata are a bitch to come up with.:-)
Of course, this is most relavent to Open Source projects that make extensive use of CVS, but in a few years there will be no conflict to worry about.
----
I believe this is what you want: Open Software Description Format (OSD) from w3.org.
Abstract: This document provides an initial proposal for the Open Software Description (OSD) format. OSD, an application of the eXtensible Markup Language (XML), is a vocabulary used for describing software packages and their dependencies for heterogeneous clients. We expect OSD to be useful in automated software distribution environments.
_______
2B1ASK1
I don't mean this to be flamebait or a troll, please don't read it this way.
I'm not sure about most people and can't speak for any of them, but I've personally never visited either the Linux map nor the Meta project.
For me (and millions of others?) the tried-and-true method of software distribution tarball, website, and all-important README file have been sufficient all these years.
Of course it's not efficient, it doesn't encourage searchability, etc... But it's what everybody uses and is used to.
I personally haven't seen enough discipline in the Open Source community with regards to a metadata project. The RM and Meta projects are great, but people need to use them.
Don't sweat the petty things. But do pet the sweaty things.
that would mean that you have to move to some sort of symbolic token language. The fact is, the problem is even greater with programming languages. If you program in C/C++, Pascal, COBOL etc, you're really programming in the English version of that language. That leads to very bizzare and sometimes funny code in languages other than English, where keywords are English but identifiers and comments are in some other language.
I think what's important to realize here is that we're not trying to find the meaning of life. XML is simply a standardized way of tagging information and it's not perfect. But the quest for perfection can sometimes prevent us from arriving at a solution at all. The whole struggle to standardize on various industry-specific markup languages is difficult enough and has led to enough feuds and confusion. Let's not make it even more difficult by obfuscating the whole issue with another order of complexity. Once XML has done its job well, we can worry about the finer points.
I'm sitting here preparing some Java source to release under the GPL
Java.. GPL.. problems talks problems, questions, concerns.. ehhhhh.. geez man.. just code !!
Are we coders or politicians ?
Just code !!!
I already have a system that does most of what is described in the story for in-house projects i do. The system uses XML in the back end and adds the file into the package. I also have an automated program that creates a readme.txt and readme.html from the data.
The XML holds a short changelog, date/times of all builds, long descriptions of changes for each build, misc project information, a major and a minor general description field, and some other minor information. the generator appends the file size of both the final package and all files contained. This system is designed for in-house only, and is in no way currently useable as a solution for public use (part of a requirement i've made to co-ordinate development between our 4 coders). also, many of you may be disapointed to know its win32 only, as that is all the work we do internally. But if anyone wants to know some of the specifics, of such a system, let me know.
And at the end somewhat less relevant to the topic.
This kind of metadata should be extremely valuable for implementation of the URIs and particularly for the I2C(s) (URI tp URC). Quote from the RFC 2483:
Hopefully we already have mechanism for the I2L(s) (FTP Mirror Tracker).Right now I'm working on a proof-of-concept kinda thing that will test the current implementation of the xml-format and the tools. Another guy is working on a Windows-implementation, using the same standard.
If all goes well I'll have something to post by the end of this week. Keep an eye out on Freshmeat.
The National Institute of Standards and Technology has a division in the Info.Tech. Lab that has metadata as one of their projects. Looks like they're thinking XML.
"Even if you're on the right track, you'll get run over if you just sit there" - Will Rogers
XMI is a recently developed metadata standard by OMG. It's encoded on XML and is used on some tools: ArgoUML(http://www.argouml.org) Case Tool, Rational Rose and there's a tool from IBM, XMI Toolkit (http://alphaworks.ibm.com/tech/xmitoolkit), which automatically extracts metadata from a Java program. The standard is rather complex. To be productive with it, you have to understand things like metadata architecture, the OMF meta-meta-metadata standard (also by OMG). The alternative, XIF, used on its repository tool, is not much simpler, but there's very little information available to write a XIF-compliant tool.
You left out the solution SourceForge.net is working on, called Trove, or simply the Software Map. It contains fields for Development Status, Environment, Intended Audience, License, Operating System, Programming Language, Topic, and Description, and is centrally served from http://sourceforge.net/softwarema p/trove_list.php.
Several of the categories are even hierarchical, which helps validate the values used. Another benefit is that if the license is open 'enough', you can host your web page and downloads at SourceForge, at which point it will help you track versions and release notes.
Wow -- that sounds like I'm a SourceForge PR person. Please understand that I'm not necessarily advocating them as the best solution -- I think freshmeat and lsm are extremely valuable. I just wanted to make sure the SourceForge solution was mentioned.
--Chouser
--Chouser
"To stay young requires unceasing cultivation of the ability to unlearn old falsehoods." -LL
Whatever format is used, it should
Idiot Alert! Learn something about both sites before you try to figure them out. Slashdot has editors that post stories. This was a question, so it was posted by an editor to "Ask Slashdot" a section for questions. Kuro5hin, OTOH, works by having the users vote on the stories to determine whether they should be posted. The users apparently decided the story should be posted.
So, to repeat what I just said so that your small pea-sized brain can comprehend it, their was one editor, and a bunch of users.
g'd day!