Advice for Building a Multi-Platform Lyrics Database?
AntonOnymous,Cowherd asks: "I am in the process of designing an application for general public use. The application will allow end users to search and display a large collection of songs (both lyrics and tunes) with annotations, all in text format. The intent is for this application to run cross-platform (Linux, Windows, Mac, and whatever else), so I want to avoid platform-specific binaries as much as possible. I also believe that the program should be Open Source. The end users will not necessarily be computer experts, so I want to avoid as much additional setup on their computers as possible. The application (data and program) will all be stored on a CD or DVD, and it should be able to be run locally. The most important part of this application is the data, not the program, so the guts of it should be fairly simple with a decent user interface. Does anyone have any suggestions as to general approach to setting this up, or have any pointers to existing open source programs which already perform a similar function?"
"One way to implement this would be to set up each song (with lyrics, tune, and annotations) as a single record in a database. I would like to avoid the inherent security issues and overhead of setting up and running a database on a user's computer.
Another possibility, which is fairly appealing, is to use a Web Browser to provide the user interface, and to use Open Source text indexing/searching programs (such as Lucene or Egothor) as the engine. It is probably safe to assume that most users have a Browser. However, most users probably would not have a web-server (even a local one) on their computer, and going by the principle of as little messing around with the user's computer as possible, I would like to avoid having to set one up, even a local one."
Another possibility, which is fairly appealing, is to use a Web Browser to provide the user interface, and to use Open Source text indexing/searching programs (such as Lucene or Egothor) as the engine. It is probably safe to assume that most users have a Browser. However, most users probably would not have a web-server (even a local one) on their computer, and going by the principle of as little messing around with the user's computer as possible, I would like to avoid having to set one up, even a local one."
since when did spam for someones program get promoted to the front page of /. /// heads back over to digg
Whatever you do, please store everything in UTF-8 encoding, since most of the lyrics of the world's music are not in English. I was outraged the day I discovered that the old CDDB system required everything to be in ISO-8859-1. What is someone to do with music in foreign scripts? ISO-8859-1 doesn't even have the necessary characters from standard Latin transliterations (such the characters with carons for Cyrillic transliteration).
If you don't have any experience with Unicode issues, a problem shared by a regrettable number of developers, try Gilliam's Unicode Demystified .
store all of the data on a server and write either a .NET or Java EE program to share the information as a web service.
Then just have a desktop client people download, which contacts the webservice to request the information.
Free continuous multi-player strategy http://www.holy-war.com/
Sounds like a perfect application of Wiki on a stick. I set one up in a few hours, most of which I wasn't even sober - and it can install with a zero-footprint (designed to run from a thumbdrive.)
I have a little more write-up in my Journal, along with links.
Glonoinha the MebiByte Slayer
You're making this more complicated than it needs to be. Since the DB will be frozen you can pre-compute the index and store that. The front-end can be browser-based since browsers are on every platform, and within a subset are compatiable. O-Reilly use to do this with some of their book/CD combinations.
The record companies will come after you sooner or later if your site is successful.
Never mind that you're providing a service that they no longer care to provide. I remember when albums and even early CDs came with the lyrics to every song in a booklet. Don't see that very often anymore, if at all. And I have probably dozens of songs in my iPod that I would never have found if lyrics databases didn't exist (because God knows the shitty DJs on radio today can't be bothered to give the title and artist when you really want to know it).
So, good luck to you, but form an LLC to own what you're building so when the record companies come after you, they won't bankrupt you personally.
Music lyrics, unfortunately, are copyrighted. Every db on the web thats gained real size has been shut down by the RIAA. Whatever you do needs to be hosted out of a country that doesn't do copyrights, or you're dead in the water.
I still have more fans than freaks. WTF is wrong with you people?
Webservice.
Lots of websites already do this, why bog your self down with something that has already been done? Unless its for some kind of research project for university/college of course.
Open source solutions which do the same? Amarok has a "lyrics" tab which brings up the lyrics to the playing song, i think they are pulled from wikipedia but im not sure.
Also musicbrainz has a huge database of music too, this is why they are seemingly linked in amarok.
So basicly your not onto a winner with this unless your going to offer something all the hundreds of others fail to offer.
Amarok, wikipedia and musicbrainz are all open source.
Im not sure however, how all of these cope with non-english alphabets, which is something lots of people tend to bring up.
- http://www.milkme.co.uk
I like how you plan to use open source software so that you can then violate someone else's copyright. You do realize that you won't have the rights to distribute the music and lyrics to these songs, don't you? That is unless, of course, you plan to only distribute songs that are in the public domain. In which case, you'll have a fairly small market (yes, I realize there are some instances where this wouldn't be true--church hymns for instance).
This guy's the limit!
You will be sued the minute you launch the service. Lyrics are copyrighted, and fiercely protected by the copyright owners.
In Soviet Russia, I ruled you
You think setting up a local database is a security risk, but setting up a local web server isn't? Why? You are aware that databases don't have to be servers listening on public ports don't use? You could use something like SQLite.
The important thing is not the implementation itself. It's the data format and/or API. Make the data available, and plenty of people will be willing to write web interfaces, Qt interfaces, GTK interfaces, etc. Expose the API as plain C, and make the data easily importable/exportable, and it really doesn't matter if you produce the crappiest proprietary implementation imaginable, because both the backend and the frontend can be replaced individually.
Bogtha Bogtha Bogtha
He seems to be saying that everything will be on the person's computer. He doesn't say where the lyrics are coming from. Perhaps the user is to cut and paste them from online. I don't see how this will be much of an improvement on either Googling for lyrics, or using local search on text files on your hard drive.
However, http://www.animelyrics.com/ is one database of lyrics that isn't getting sued, like most things anime.
I get that it's "nice" to be able to encode something in Latin, Greek, Russian, Arabic, Chinese, etc. But what I don't get is the fascination with putting multiple code pages in the same document. Seriously: I don't get it
Can someone please explain why it's a good idea to have the option of changing the code page on every single character in the document? To me that feels like a step backwards. Why not just define the code page ONCE per document -- perhaps even in metadata?
I'm serious.
You can start with the MusicBrainz codebase. The schema already supports albums, tracks, and annotations. You could extend it for your purpose to add lyrics. A daily dump of the database is available as is the source code to the server application.
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
Do you know nothing about Unicode? UTF-8 is a single character set, or "code page" as you say. You just define it once in the first characters of the document, and then you can put nearly every script ever created in the same document without changing anything.
Copyright issues aside (I'm assuming that you're talking about lyrics that you have the legal right to use) I'd say that there's a pretty simple answer to your problem. You're thinking through the pros and cons of using a back-end database versus a browser front-end, and you're not keen on running any flavor of server.
You can get both the database and browser advantages without having to set up a separate server by building your app on the Mozilla platform. You can utilize its built-in RDF capabilities to store your data in a clean, extensible way, and fairly quickly put together a user interface using XUL and CSS that can work with Firefox, Seamonkey, Flock, etc., or even just the XUL app runner for a more stand-alone user experience.
Because all of your data (and even interfaces) will be XML-compliant, you'll even be making it easier for third party apps to work with your stuff.
Now you've reached the point of actually needing a clue to accomplish it.
Just pay someone, you obviously don't have a clue.
...but it seems to me writing binaries would be a mistake, and that the best route would be browser-based. If you're doing this as a "public consumption" application, requiring novice PC users to learn a special application to do something like this will make part of your audience reticent to try your application. If you tell them "Open your web-browser and go this private web-site at http://www.whatever/" it won't seem as imposing--so many musicians have MySpace pages, seeing it as a web-page to visit a la Google is vastly superior than perceiving it as a "Program I have to learn how to use." Musicians like simple: If you can do it in a browser with a handful of elegant controls, you are better off, since a well-implemented browser-based application is effectively client-OS-independent. It might matter what runs on your web, DB, and app servers, but otherwise, not an issue unless you're talking about an Apple IIgs running ProDos 16 and an alpha release of Mosaic over your token-ring LAN... But how many of your users are going to be in that weird of a configuration?
Solaris, BSD, Linux, Windows, Mac OS, all of the above have standards-compliant web-browsers, and all have a Java Virtual Machine--a challenge to make all the platforms have identical performance, but easier than writing ten different client applications, and more likely to be usable.
Who did what now?
Forgive my use of the outdated term "code page" (which was actually the proper term before unicode changed the terminology). Reread my post with s/code page/script/ and you should be satisfied.
Now, can someone please explain why it's a good idea?
As for the database handling since this will be static (if you want it to run off a CD, it's static) here is what I can think of. You can embed an SQL server (I know there is one, can't remember the name) and do it that way. I don't know if that is an option for Java. Your other option is to store it in files. You could easily make a bunch of directories (made by a script) that would give you a large directory tree. Simply assign a ID to each song. The first digit of the ID is the first folder (there are 0-9), the second folder inside the first works the same only for the second ID, etc. You can go as far as you need. You can either keep the individual files, or inside each directory that you make (maybe you only want to go 2 levels deep) you keep ALL the data for those songs in a compressed file (XML, serialized objects, whatever). Then you have an index file that you load that tells you how to find the songs (which ID goes to which artist/cd/track) and you could add a second file that holds a database of words and in which IDs they appear for searching purposes.
I'd say go Java. You can include the JRE on the disc (at least for OS X and Windows). Java is very stable and mature, where as something else like wxWindows may not be (don't know how well it performs on various platforms). Plus wxWindows or QT would require extra libraries.
If the user didn't want to run the app off the CD, they'd just have to copy it all to a folder they have access to. If you put the code in a JAR file, not only is it cleaner but you can run the program simply by double clicking on it in either Windows or OS X (might need an extra property or two in the manifest file for OS X).
For the web part, you could have a little application that launches the web server and closes the server when you close the program. You could embed the web browser in that. It would be more complex though, especially when going cross platform.
I'm going to have to stick with using Java. I think that will be your best bet.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
but what's the point? lyrics are copyrighted, and how useful is it to have "annotations" attachted to a song, and is it that hard to just listen to the lyrics isn't that why we have music in the first place?
Choose a country that has sane copyright, data, and privacy protection laws. The Netherlands, maybe. Not in the US.
How are you meant to encode the lyrics to, say, Baby Love Child, where it's mainly English, but with some Japanese? No single code page contains both English and Japanese characters.
So, a friend of mine wrote one of the first online lyrics servers.
Here's his story.
Do we need another lyrics database? hell no we don't. I can't turn around without bumping into 3 or 4 of them. Every time I search for something on google, it finds at least 10 lyrics sites that have songs including those words. Do you know how many artists wrote songs about elephantitis, or psoriasis, or ranitidine????
Seriously, search for any lyrics and you get hundreds of sites, all with the same spelling or typo errors. We don't need another one!!!!
Well, "script" doesn't really make sense in the context of your original post, but I'll take you at your word that you don't see the appeal of mixing scripts on one page.
To start, I'll direct you to the Japanese codepage 932, which includes at least four scripts: basic latin alphabet, katakana, hirigana, and kanji. People seem to have thought it was necessary to be able to use all of those on one page, perhaps because Japanese tends to mix three of them together on a regular basis and likes to throw in English words for flavor. (No doubt, Latin characters helped to write computer programs as well.)
Unicode just extends the principle so that you can do things like:
...and so on. The Unicode character set is just a big flat space, just like ASCII except with a lot more code points.
The point about internationalization perhaps shouldn't focus on UTF-8 specifically -- one could use UTF-16 instead -- but both encodings give you access to the Unicode character set, which allows you to, as you put it, "define the code page once per document."
Nothing is internationalized by default. There is no magic that converts a program's English strings into their Traditional Chinese translations.
"Internationalized" means capable of working with user data in multiple languages and does not imply ability to translate user data from one language to another. "Localized" means that the interface is available in more than one language.
Wow. Reading that I suddenly remembered my own experience providing a lyrics service on the web.
Back in 1995, I put together a website that cross-referenced the lyrics to Les Misérables in English, French and German (all typed in by hand from the CD liner notes). At first it was hosted on webspace at AOL, but I later moved it to some space I had at college. From 1996-2000 I added songs in more and more languages, each time carefully cross-referencing and linking so that you could jump from each song straight to the same song in each other language. I had a modern French version (the original was considerably different from the show as it opened in London and Broadway) in all-caps, and a French speaker agreed to provide all the accents and diacritical marks. People sent me, sometimes one song at a time, lyrics in Hungarian, Norwegian, and Swedish. I tracked down import CDs of more languages that I could type myself. People even started sending me songs in Chinese and Japanese, first as GIF images, later in text. I learned a lot about cross-platform use of character encodings and fonts, and about website accessibility.
After I graduated from college, a friend at the lab agreed to keep my site running for a few months while I found new hosting. In January 2000, I bought a domain name. In February, I transferred my entire website from www.arts.uci.edu to hyperborea.org. In March I received a cease-and-desist letter. Knowing I had no legal right to keep the lyrics online, I took the Les Mis section down that afternoon, leaving only the parts that weren't subject to copyright.
Now, keep in mind that I ran this site for five years at AOL and UCI, making no effort to hide it. Within a month of setting up my own domain name, suddenly the lawyers were after me? It seemed too much of a coincidence.
Even today, there are still pages on the net that link to "Les Mis: The Complete Multilingual Libretto." (Of course, many of them are Geocities sites that haven't been updated since 1997, or exported bookmarks files languishing on some university server.) And I still get the occasional request for lyrics by email.
Seriously!
I disagree. Ask Slashdot used to be specific questions about a technology or how to go about something. Lately, however, it's been one question after another that goes:
"I'm working on this project that will be able to do X? How do I do it?"
There's a big difference between learning how to do something and asking somebody else to figure it out for you.
Don't get sued by the RIAA, many a lyrics website has been taken down for copyright infringement.
Advice for Building a Multi-Platform Lyrics Database?
Try not to get sued.
Actually, it is the NMPA (National Music Publishers' Association) and the publishing companies, who shut down the lyrics databases. The RIAA has no jurisdiction over the use of lyrics.
If you want to start such a database, my advice would be to lay the groundwork for it, software-wise. Then, contact the publishers to get lyric reprint licenses. It would be nice to say that the publishers would be happy to provide you with such licenses, but chances are they would be difficult to obtain, since you are not actually making a recording with said lyrics attached.
To find out who owns the publishing rights (and therefore lyric reprint rights) to songs, at least in the U.S. you can search databases like ascap.com and bmi.com. You will find out quickly that the major music groups, i.e. Warner Brothers, EMI, Sony/BMG, and Universal, own a vast majority of these rights. ASCAP and BMI will provide you with contact information for the publishers.
In short, get the licenses, or at least investigate how much it would cost to do so before you start such an undertaking. Chances are it will be more than you expected, but a lot less that court costs when you get sued for copyright infringement. Don't say I didn't warn you.
You mention open source. What about the lyrics themselves? If you are the single provider of that CD or DVD, I don't care if the programs are open source or not. All I care about is that the data is in an open format so I can code against it myself. Closed-format content is useless to me.
That's easy: a song in Russian with one author from Latvia. You won't be able to write author's name in Latvian.
Besides, there are about 6 Russian codepages: Win1251, KOI8-R, CP866, ISO, MacCyr, GOST-Cyr. What codepage are you going to use?