Rudi+Cilibrasi · Slashdot Mirror

translation scripts are the key on Pair-Programming with a Wide Gap in Talent? · 2006-03-30 04:54 · Score: 2, Informative

The best way to bring up a newbie in pair programming is to give her simple stuff that she can do reasonably quickly and well enough. The most common highly delegatable by far for text-based programmers like me is translation scripts. It's a breeze to specify; You need to figure out what other programs you are going to interface with and then just tell them what you need in and out. Best to have real files to work with. For example, in my field I use a lot of distance matrixes that consist of an array of labels and a 2-dimension square of numbers that is diagonally symmettric; but there are many different formats for these and my CompLearn system only supports one text matrix format. So already there are a bunch of simple translation programs that I can easily delegate by saying "see this distance matrix output from program X?" Translate it to something that works with CompLearn! This is also useful for the output side of most programs that need to work with the real world. And of course the venerable and highly useful HTML screen scraper is a great subtype of this big class of programs. The reasons this is so good to delegate are as follows:

It only needs to work, it doesn't need to interface to the rest of your program, so if they really mess up the interface or have bad style it is probably no big deal.
It is usually simple enough to dispense with all custom objects/classes and sometimes even functions are unnecessary for these things. the goal should be to get it working as quickly and easily as possible with a minimum of fuss: perfect for agile goal-oriented development and extremers.
Once they write a half-dozen or a dozen of these they are ready to start playing with lambdas, functions or maybe small classes. As a scientific programmer, half my coding is translation scripts and as I get more and more into text-based systems the percentage seems to go up. text goes a long way and the skills are very composable with existing unix utilities and pipelines without any hassle at all.

My favorite language for this kind of thing is Ruby, but I think Perl or Python would be fine too; anything that supports regexp and decent strings is a winner.

Happy Pair Programming and good luck! Rudi

Here's an easy one that works on Linux + Windows on Open-Source Bioinformatics Programs? · 2005-07-23 19:51 · Score: 1

Check out http://complearn.org/ I've used it for many different applications, including genomics and proteinomics. It can be used by novices or experts easily as it is parameter-free.

It's not just movies and dollars, it's lives here on Copyright Issues in the Mainstream · 2005-07-01 03:06 · Score: 2, Interesting

Good day. My name is Rudi Cilibrasi. You may email me at cilibrar@gmail.com

I am a lifelong computer programmer and open source author. I have contributed to the Linux kernel. I have also worked at Microsoft for a few months. You can see some of the software I am now writing at

http://complearn.org which allows you to do advanced data-mining for free.

I am writing this now to address what I consider to be a very serious matter. It is relevant to the moral basis upon which Intellectual Property is founded. As a scientist and programmer, I am a very technical person and tend to get very involved in my own health decisions. It happens that I was born prematurely in 1974, and received a blood transfusion from my mother who was infected with Hepatitis C (HCV). For those that are not aware, this causes a lifelong degenerative liver disease. Both of my parents have already died young due to its effects, and I am HCV+ as well and have been slowly suffering liver degeneration my entire life as a result.

This concerns IP because some years ago I did some research online about my possible treatment options. In the year 2000 the possibilities were "old, normal" interferon or pegylated interferon, taken in both cases in combination with ribavirin. These are chemotherapeutic type drugs, with very harsh side effects, and you must take them for a year in order to have a decent chance of curing yourself of HCV+. The problem is, with my genotype, 1b, the chances of success using the old medicine were only about 30%. The new medicine had about a 60% chance. But the FDA did not approve the new medicine until years after Europe did, for reasons which are not entirely clear, given the solid research findings in its favor. So I flew to Europe, got the new pegylated medicines for about twenty-five thousand dollars of my own carefully saved money, and flew back to USA.

I spent about 3 months treating myself with this medicine that was not yet approved in USA and then checked my viral counts to find great news: I had lowered my viral count to undetectable levels, suggesting that if I just continued with the yearlong course of treatment, I would probably be permanently cured. What great news!

Imagine my dismay, then, when I received a note from the customs office saying they had blocked shipment of the second half of my pegylated interferon + ribavirin. The reason, apparently, was that there was a patent or IP law problem restricting the European branch of the pharmaceutical industry from selling these drugs to Americans, even if I bought them in Europe with my own money for personal use. I figured it would not be a big deal -- I would just explain to the customs officers that this was a life-threatening illness, and they would help me find some way to appeal the block before it was too late.

The big problem is that if you skip your medicine for more than a week or two before the full year, you may as well stop entirely because the virus will almost certainly come back in full force.

So, having explained this to the customs official over the phone, I was shocked to find that it seemed there were no provisions in place to handle the case where an IP restriction is in direct conflict with human life. My life. I am still HCV positive now.

Its now several years later. My parents have since died and my liver has gotten worse. I would enjoy being around to continue to contribute for free (because I love programming) and would enjoy talking more with all of you about many things. But this will not be possible unless we reframe the IP debate in terms of human-centric goals. It should not be the case that a creative scientist and programmer with a lifelong history of giving away his technological creations for free would be denied the resources he needs to satisfy one of the simplest and most basic human needs -- to have his illness treated in the most effective way possible -- because of a mere Intellectual Property dispute. It should not be

Re:Limits to semantic derivations from Google on Deriving Semantic Meaning From Google Results · 2005-01-30 20:55 · Score: 1

I'm glad to see you are interested in our work. I applaud parallel and different efforts like your own system, however I think you are making at least one misleading and factually false assumption that I would like to correct. By coincidence, I have already done an unpublished experiment that involved Tom Cruise. Contrary to your assertion that it's impossible to get useful data, in fact I have already gotten the data that

Tom Cruise is an actor/actress more than something else
Tom Cruise is an actor more than an actress

This was actually one of the first experiments that I tried and the results were about 85% accurate for this actor / non-actor classification problem. But most of my experiments cannot fit in the paper.

I got these results from trained classification using an SVM in binary mode with training, about the same size training data as all my other WordNet experiments. If you review my paper you will find my program looks only at page counts and does not look at results yet. Therefore, your claim that we cannot gain "useful" data from pagecounts is patently false.

One of the most astounding results of my research is in fact precisely the opposite of your assertion: namely that we can in fact derive useful data from just page counts alone. And another clear conclusion of my experiments is that the problems you are imagining, that the web is somehow too low quality to be useful, is false. In fact there is a very active branch of learning theory that deals with boosting accuracy of imperfect heuristics using a variety of techniques such a majority-voting schemes, multiple trials, etc. But my experiments show that you can achieve good accuracy even without these techniques.

For the skeptical, I invite you to try to replicate my experiment using Tom Cruise youself using your favorite scripting language. Just make a list of actors and a list of non-actor (but famous) people. Then choose anchor terms as I did in my paper and train an SVM. Then test it with Tom or whoever else you like as a test case, and tell us about it. If you don't script then you can do this by hand in a few hours using websearches in your browser and a calculator to calculate the NGD grid. Then just feed that in to an SVM package of your choice. (or try any other learning algorithm if you like) Best reagrds, Rudi

Re:Limitations of NGD (Normalized Google Distance) on Deriving Semantic Meaning From Google Results · 2005-01-29 12:14 · Score: 2, Interesting

You are right that Google may be performing estimation and this could effect results and I don't really know what sort of rounding they do at this time. Perhaps more will become apparent. But your other assertion about no higher order statistics is incorrect. see the earlier Clustering by Compression paper for more info. Quickly, the reason is as follows:

I use NGD to convert arbitrarily-large lists of search-terms into feature-vectors of arbitrary dimension. The only limit to this is the max query length for Google, and this is just a detail.
I use a Support Vector Machine with a Radial Basis Function kernel. The RBF kernel has an effectively infinite dimension and so can learn any function. SVM is a universal learner like neural nets and many other famous algorithms. So higher-order features (composed of products of several NGD) can indeed be used in learning.

The main purpose of the research is in extending generality of automatic learning. See the earlier papers in the series including Algorithmic Clustering of Music, and the earlier theoretical work. NGD is a special case of NCD. NCD is a family of functions that can be used as the basis of a universal learning system in a variety of ways. Our theory justifies this innovation and leads to a whole class of easy to write algorithms.

Thanks for your interest, it is good to see that this research is striking a chord with the Slashdot community. I hope this leads to a whole lot of more easy-to-use semi-intelligent software. Cheers!

Super Scary Climate Blog (SSCB) begun on New Climate Change Warning · 2005-01-26 21:11 · Score: 2, Interesting

A while ago I was inspired to create this blog, and ever since it seems to be writing itself. I have set up Super Scary Climate Blog. I've got an Instiki Wiki started there for the purpose of tracking climate variability. Please feel invited to contribute.

Slashdot Mirror

User: Rudi+Cilibrasi

Comments · 6