I'm as big of a privacy wonk as you could ask for. But I'm also in favor of individual rights. I don't think anybody should be able to casually stroll onto your property and peruse the contents without a warrent, obtained through due process as reasonably defined by current law.
On the other hand, I also don't think it's anybody's responsibility to avert their eyes from anything you are foolish enough to throw out. Or even more, destroy it.
We call it throwing away because that's what it is. If you have reason to be concerned, destroy it before tossing it. If you don't, and you literally drop it on public property, well, who's fault is that?
In fact, everybody ought to practice this to an extent, because of the number of f****'ing idiotic companies who seem to think it desirable to print your CC number on every reciept or payment you make with it. I think that ought to be criminal negligence personally, especially the ones that also print the expiration on it (combined with your address which the dumpster diver can easily obtain, that's enough to rack up charges), but in the meantime, anyone with a CC should at least try to destroy anything with it before it leaves the house. You don't need a shredder per se, it may do just to rip it up. Or burn your mail, if you live in the country. That's not paranoia, that's just prudence.
This problem is specific case of a more general problem. Going backwards and forwards is basically undoing and re-doing actions. Any undo/redo system has the same problem.
The problem is that while the solution is obvious (store all the commands as a tree), both the UI and user models suck. How are you going to show the structure? How are most users going to mentally model some sequence like "A, B, C, undo, D, E, undo, undo, redo C, F, undo, redo D, redo E, G, H, I"? (You may need to draw that to see what I mean, which is exactly my point; even techies don't think this way. (Yet.))
In fact I intend to find out whether people can understand this in the most direct way possible, which will be to implement it in my project and see what happens. I have no idea if people will be able to understand it. (It happens in my case that the project naturally fits into this anyhow, so it's worth a try.)
In the meantime, the current model (undoing a few things and then doing something else completely destroys the remaining redo list) has the virtue of simplicity, something that no other proposal, including the one in this article, can have over the current system. I suspect that the current system is the best choice if we can't have the full tree-based implementation.
So they're trying to point out that source code is free speech.
You totally missed the point. The point is not the words "free", "public domain", or "free software". The point is the word speech. There's a big legal fight going on right now over whether software is speech at all, let alone "free" or "public domain" speech.
Your quote should say "So they're trying to point out that source code is speech." The rest of the message following that is just pointing out stuff unrelated to artwork.
Despite your protest, that is the answer. The IETF has formed the XMPP working group, where XMPP stands for "Extensible Messaging and Presence Protocol", which should become the interoperability standard for instant messaging services; it's certainly the closest thing we've got. And XMPP is basically Jabber with a spit-polish.
"Blogging", besides being an extremely annoying term has way too much attention paid to it nowadays.
I always get a kick out of seeing this kind of comment on Slashdot. It makes me wonder what the poster thinks a weblog is... because by most definitions, Slashdot is one.
Yeah, it's now a multiple-author weblog with a very well-established comment system, both traits somewhat unusual, but it's a weblog. Many people use Slash to run more traditional weblogs.
Does your post count as part of the "too much attention" paid to it?
This is all theory because I don't run an Open Source project myself yet, but I have something that I want to be a successful open source project someday. So I've spent a lot of time thinking about this, and also looking around at what is successful, and these are my thoughts:
Have a solid proof-of-concept implementation. You must, on your own, produce something that is usable, by which I mean "can actually do something useful better then anything else", not just "it executes without crashing". Until you have this, you will not be able to attract serious help of any kind. This is easily the hardest hurdle to jump in getting an open source project going. In my case, I've been working on mine for six months and I've probably got another six months to go before I can successfully jump this. But if I released what I have now, it would almost certainly get just a "so-what?" reaction from anybody who took the time to download it.
This may require a bit of creativity. If you're going to create some sort of stand-alone web server middleware thing, then you need to find the coolest, most unusual aspect of the final design, and implement it inside of Apache, with an eye towards making the code translate easily back to your eventual own server. If you're making a new widget set, you might write a few widgets of your own, but display them inside of a GTK window until you have your own window class. This principle doesn't necessarily mean you need the whole final system in skeleton form (though that's better if you can get it), but you need something that shows you're both serious and capable, which sets you apart from the riff-raff.
It must be easy to do the kind of work you are looking for people to do. That's a general principle that must be specialized to whatever it is you are looking for.
If you are looking for programmers, you must make it relatively easy to add functionality. That means a careful design with strongly seperated concerns is vital. If nobody can pick up your program and twiddle a line without the whole house of cards coming down, nobody will bother with it. If you're looking for graphics help, you'll need to make it easy to add (or change the) graphics in the program without knowing how to use the compiler. If you're looking for documentation help, it must be easy to document the program. (Pick the doc standard, make it easy to add help to the program without knowing how to use the compiler.) Ideally, for all of these, you should include a tutorial of some kind; how to create a plug-in from scratch, a step-by-step guide to changing the title screen's font (hopefully not too many steps, the act of writing the guide will show you what to make easier!), a step-by-step guide on adding help.
It's not so much the obvious "the easier it is, the more likely it is someone will do it", though that's trye. It's more of a gambling payoff thing; you need your contributors to be able to sit down with your product and experience a "payoff" in the form of an actual improvement as quickly as possible, so that they'll keep working on it.
Documentation, damn it! The bastard step-child of Open Source projects. Look, the fact is, anything you want people to do early on is going to require documentation. Maybe you're final product will be so easy to use that nobody will need docs (usually because you're doing Yet-Another-(something) that we all know how to use, like a GUI word processor or something), but to get the project off the ground, you need docs for all of your target users. To an open source development leader, "users" aren't just the people running the program to get work done, those are special users called "end-users". You also have:
Programmers. Docs for this bunch means well-formatted and commented code (documenting anything not obvious to someone on first glance), and for any non-trivial project, some sort of overview of the design, ideally detailed enough that someone can read the overview, pick up any source code file, and (in conjunction with the comments) figure out where that source fits into the project as a whole. It probably also means some sort of coding standard for code submitted back to you.
Documenters: You'll need an overview of what the program is, and probably a skeleton of the necessary docs.
Testers: If you have a testing framework, document it. What, you don't? Get one. Document things that have been problem spots (bugs keep re-appearing), so they can be scrutinized. Make a release checklist.
Already described some of the docs the artists and such need. Note that you only need docs for the groups of people you are trying to attract. In the early phases of an open-source project, you may neither need nor even necessarily want end users. In that case, no need to write docs for them (or at least don't make them public). The only docs the end-users need is "This program is not ready for end-users. Please check back later to see if it's usable.", changed appropriately as you start to need actual beta-testers and stuff. It is a serious mistake for a project to attract end-users before the project is ready, as the users will sap away development time with support needs, with no infrastructure/community to support them.
A solid design. I want to mention this explicitly, though it's already strongly hinted at in what programmers need. You need to start with a strong design for the program, because it is likely that the initial design will be in place for a long time. If it sucks, it could even kill a project that would otherwise be very successful. If it's OO, learn what OO means and how to do good OO design, same for any other paradigm you might want to use. The problem is that no contributor, no matter how good, will want to completely rearchitect any project, because they won't know it as well as you and they won't know what will bite them. (In fact, for my project, there are in theory other Open Source projects I could have at least started with, but the designs sucked so horribly they were unsalavagable.) I'm not certain, but it seems like one of my favorite Open Source projects, LyX, has been bogged down the last two or three years trying to re-architect the program into something that can grow again, more-or-less treading water in the meantime (and re-organizing the damn menus every release... but that's another story...). While you're at it, make sure you include hooks for everything that you determined you need, like documentation places and suck.
Certainly there are successful projects that haven't done some of these things, but I think that's success in spite of bad planning, not because of it.
Everything boils down to it must be easy to contribute. The number of projects out there bitching because nobody is willing to program for them, when the leader can't be bothered to even format the code decently, let alone comment the code, provide a blue-print for the design, or even have a design in the first place, boggles my mind. It's like the usability folks have found... every barrier to entry knocks away a percentage of people, and any you can remove helps. Even if a hypothetical wizard coder in the language your project is written in can understand your code by reading it for three or four hours and getting the Big Picture, doesn't mean they aren't more likely to join your project if there's a document they can read that gives them the Big Picture in five minutes (leaving them that much more time to actually contribute, or learn something else), for instance, and that goes for everything.
Too many projects make the mistake of expecting the contributors to jump through hoops to contribute, instead of making it easy. I think it's part of the Open Source hubris that we see so much of. Don't fall victim to it.
A closed mouth gathers no foot?
on
DSL Rising
·
· Score: 1
"A closed mouth gathers no foot?" That's not half bad; I may just use that. Very similar to "Better to keep one's mouth shut and be thought a fool, then to open one's mouth and remove all doubt.", but, obviously, much less overused.
Irrelevant. "No Shirt No Shoes No Service" is not a statement of conditions for doing business. It is actually a statement of the law. It is illegal for public health and safety reasons to enter a store without shoes, due to the spreading of foot disease. "No shirt" is probably more a community standards thing, but it's still the law.
We didn't sign an agreement because regardless of the agreement, the law still binds. Technically, the store owner doesn't have a choice, they must evict a bare-footed customer, even if they agree with the customer not to.
Try again, perhaps this time without metaphors... concentrate on the real differences.
You're worried about securing the unsecurable. If you're a terrorist and you want to shoot down a plane, and you don't care which one, it's pretty damned easy to find one.
Procedure:
Go outside on a cloudless day.
Look around.
If you're within 10 miles of an airport (even a minor one), odds are, hey, there's a plane!
GPS coords don't add much to information already so available all you have to do is literally open your eyes and it comes streaming in. From what I've seen on the news, most missles are fired at planes taking off or landing (usually taking off from what I've seen), in plain sight. You just can't hide a plane taking off, so please, on behalf of all us freedom-loving citizens, don't propose half-assed "solutions" to the non-problem; we've got government officials working on that full-time already and God-forbid one of them see this "non-problem" of yours and decide to try your non-solution.... more freedoms gone for no gain whatsoever, just to make someone look like they're "doing something".
The civil service seem very eager for there to be a national identity card, and keep proposing it as a solution for a variety of different problems.
To be fair, this is largely due to the Two Great Beauracratic Myths, "More Data Is Good" and "More Centralization Is Good". Note the lack of qualifiers on those statements; while they are true in some instances, a Beauracracy (with a capital B, which fits most/all government agencies) sees them as always good, even when they are totally, transparently untrue.
You already mentioned that they are drowning in information, which is why the first myth is wrong in this case, and see the latest Cryptogram for a good discussion of why the second myth (centralizing everything, especially security) is wrong.
Note that all beauracracies can be expected to produce those myths after a certain size. Part of the challenge of building a truly dynamic company is trying to keep the beauracracy to a minimum, lest it strangle you. I know hating Microsoft is standard around here, but they're actually a fairly admirable example of a company becoming huge and yet managing to keep the beauracracy largely in check. (Whether they can keep that going once they cease growing like wildfire is an open and interesting question.)
The point here being, for what it's worth, trying to convince the government itself that these things are wrong is trying to make the government into something it can never be. Our only hope is to get this killed by Congress or perhaps better yet, the Supreme Court (although the latter case means that someone has to be hurt enough by the system to sue, which means lots of other people will be hurt but not sue).
The cynic in me says this is going to happen no matter what and the best thing we can do is stop spending energy fighting it and spend it sensitizing everybody around us to the consequences, so they see it when it happens. I think it's obvious that us civil rights folks don't have the power to stop this directly; we will need people voting in Congressfolk or Presidents based on whether they will promise to dismantle the Normal American surveillance machinery.
I also find it neat that the Toyota Echo was expressely designed for older people... Makes me laugh seeing twenty-somethings driving them...
Ah, but that's the greatest unsung benefit of usability engineering... by making something easier to use for diminished perception or interaction users, it becomes easier for everyone. Just because I can see eight-point font doesn't mean I want to, or that it wouldn't be easier in 14-point, even for me, a user of average vision.
for every burnt out admin thats going to quit theres 5 more waiting to take his place
I've seen that comment several times but it finally occurred to me what's wrong with it: KM. (That's "Knowlege Management".)
When an admin leaves, he takes a tremendous amount of knowlege away with him or her, which must be painstakingly re-aquired by the next admin. This can easily take in excess of a year, depending on how well the original admin did his job. (It takes less if everything's broken; well-oiled parts of the machine may not require attention for a long time, so those take longer to learn.)
Changing admins is far, far from free, and business will eventually notice that. (In fact, they are starting to, but my impression is that KM is still a "kooky" field, not yet mainstream. Corrections from those closer to that community welcome.)
Personally, and I'm totally serious about this, I'd blame it on the assignments we get in both high school and college wherein the teacher/professor, in a well-meaning attempt to indoctrinate us in the ways of the academic, says "You must include (5/10/30) citations in your final paper!" (And no more then X may come from whatever bad thing students are using... encyclopedias in my day, now the Internet.)
Totally naturally, we go out, find 1.5*X citations, winnow out the obvious losers, and randomly cite them at the end of our papers, having read maybe one of them. Because we all know the teacher/prof doesn't have time to check even one of them from each of our papers, let alone check them all. How many of us have completely manufactured a citation from whole cloth for one of these things and totally gotten away with it? (I haven't myself, but I certainly thought about it; the only reason I didn't is it was generally easier to just go get likely looking citations on the Internet. Teacher never realizes you "used the Internet" if you cite paper journals....)
Certainly you don't think this habit is going to go away just because they got a degree, when the stakes are even higher? Everybody else's six-page research papers have 40 citations at the end, if yours don't you'll stick out, and that's bad.
It would probably be better to require that students cite as appropriate, and require at least a spot check of the citations for at least one random assignment at some point in a student's career.
I'm writing something in my spare time that might in some sense be considered an academic paper, but I just use footnotes as appropriate. Citations are often overrated when they are used as a cover for "We've known this and endlessly debated this in the field for the past 50 years, but I can squeeze seven pointless, information-free citations out of this" sorts of things.
Note I'm not saying that citations are unimportent or that they should be abolished; they are legitimately importent and useful. I'm just saying the the stupid way they are handled in school has natural consequences in the resulting academics, and their value is unnecessarily diminished as a result.
Just imagine how big a computational problem could be solved in 50 years with contemporary P4 hardware.
No need to imagine. Suppose we round Moore's Law to a simplistic "double every year", which is about right. (Processors may not move that fast but remember it's the whole computer that affects processing time; add up processor advances, disk advances, memory advances, graphical advances etc. and you get probably more then a doubling per year, so this is conservative.)
I can start my 50-year computation on my P2000 (processor 2000, not Pentium 2000) in 2000 and be done in 2050.
Or I can wait a year and buy the P2001 and be done in 25 years, in 2026.
Or I can wait two years and buy the P2002, and be done in 12.5 years, in 2014.5.
Or I can wait three years and buy the P2003, and be done in 6.25 years, and finish in 2009.25.
Or I can wait four years and buy the P2004, and be done in 3.125 years, and finish in 2007.125.
If I wait five years and buy the P2005, I can be done in 1.0625 years, and finish in 2006.0625.
If I wait six years and buy the P2006, I can be done in.53125 years, and finish in 2006.53125.
Because of the continuing exponential growth in power, the value of keeping a fifty-year-old processor online for fifty years is nearly zero once you get past the first few years. Note the P2050 finishes your P2000-50-year task in 50/(2^50) years, or.00000000000004440892 years, which if my calculations are correct is 1.4 nanoseconds. Actually I think computational power bottoms out before then, but the principle holds. More specifically to your post, the value of the "contemporary P4 hardware" over 50 years is effectively negative; instead of waiting 50 years, you could have spent the same amount of dough and been done in a mere six years! Until we stop exponentially advancing, the value of old chips drops like a rock until they are nearly worthless in a mere 3 to 4 years for any serious long-term computation.
This isn't just theory, either; for some computations, it is more cost-effective to wait for better computers. The constants in the analysis of the first part of this message changes (usually an analyst would look at "spending $X" rather then "buying one computer"), but it works out the same. Sometimes you're better off waiting.
Now, for some people in some situations, practically, old computers can be useful. Don't extend my post past the context I've placed it in. I've got a happily cranking 233MHz P1 at home... but I don't do weather simulations on it for profit, I use it for some web scanning as a personal use in preference to throwing it out. (Even so, in ten years or so, it would be cheaper to turn it off and buy a lower-power-consumption computer...)
On a smaller scale (personal), this is essentially what I do.
First, only some personal data is critical, not the GBs of operating systems and programs I can redownload/recompile if necessary. Things like documents, saved games (you'd think it's unimportent until you play the first 2/3s of Fallout 2 five times and can't stomach getting far enough to see how it all turns out, because you'd have to play that 2/3s again...), email maybe, whatever, but some limited amount. 10MB can go a long way... that's a lot of programming, for instance. (Been working on a project for about half a year now and I'm just ready to break 300KB of code...)
Then, set up a live backup amounst all the disks you have on various machines. I use unison so that I can change files in the repository on any machine and have the changes propogate correctly, instead of the unidirectional updates rsync does.
Use symlinks to put everything you need into one directory, and tell Unison to follow the symlinks, not archive them directly. Then just run that every so often on the machines, and you're set.
Once more of my family gets set up with always-on connections, I intend to set up a family-level repository of backed up files with Unison, so that "off-site backups" are a weekly script run without intervention by the family, making off-site backups across the state (or country, or world) easy. This will protect the scanned pictures and other things in the family heritage easily and effectively.
Which reminds me, the first always-on connection just came online and I really ought to talk to that member about a reciprocating backup setup...
That's actually what I do now. A filter for each of my mailing lists (which are all nice enough to use the [abbrev.] convention in every title), and a filter per left-over person that goes into my generic "Keepers" folder.
No spam filtering in terms of what most people mean, but it turns out that unless you have a lot of people emailing you out of the blue (tech support, maybe an Open Source project lead), this means that around 90%-95% of the stuff that *isn't* filtered into a folder is spam, and the percentage is going up every week. That's for my personal account, which is less focused then the average work account, where I think your numbers would hold.
This can't be perfect, but it also can't be fooled or defeated in the general case. It's a hell of a lot less sexy then the latest Bayesian filters, but in another six months, the whitelists will work better.
what i really wonder though is how many legitimate (non-spams) emails i never receive because of filtering software!
That is how the spam war will end: The spammers will become sophisticated enough that no matter what we do, any filter we try to use will result in too many false positives (falsely labelled "spam") to be of any use.
(False positives, of the four possible outcomes, are by far the worst, if you think about it.)
Nice article overall, but what the hell is up with that picture? If the Earth bulged that much, we'd all have noticed the incredible changes in gravity between 45 degree N and 0 and 90 degree north. I mean, yikes, that's gotta be at least a 10% difference between the two Earth-like planets in that picture.
Realistically, the shift much be vanishing fractions of a percent, and you wouldn't be able to find a difference between the two Earths ("pre-bulge" and "post-bulge"), even in principle, on a low-resolution picture like that; the effects they are talking about would be sub-pixel, to say the least.
I'd do the math, but there aren't any numbers in the linked text and it's too late to go out and try to find them. (Perhaps someone else will... I'll lay my money down on, ohhh, within an order of magnitude of "one ten-millionth of a pixel difference" between pre- and post-bulge Earth.)
First, your friend's idea is either fatally flawed, or he has made a breakthrough of fantastic proportions, because this set of "computable" numbers would have a cardinality between that of aleph_0 and c, violating the continuum hypothesis.
No. A computable number is defined as having a TM that will output it, though possibly in infinite time. Thus they have the same cardinality as the set of TMs, which is the same as the set of integers. They are interesting only because they seem to give us all the practical (and I can't emphasize that strongly enough) usefulness of the reals while technically only having the same cardinality as the ints.
You can eliminate that flaw by restricting the input to a fixed value (say, the null string).
You can do that "without loss of generality", to use the math phrase. Figuring out the transform for "TM + input" -> "TM" is left as an exercise for the reader.
We do this all the time in proving things about computability; since we can just suck the input into the TM, it removes one (useless!) variable from the proof, which makes them that much cleaner.
This is a strong statement, one that must be proven.
Well, yes and no. It's an English statement, not a math statement, so proof would tend to look like proof by definition. It would basically run as "By the act of pointing to a claimed incomputable number, you are either showing me how to compute it, or you are not pointing at a unique, well-defined number. A sibling to your post did construct a unique, well-defined real number that is not in the computable set to my satisfaction, so to the extent that my phrase had any mathematical meaning, it has already been contradicted. However, that was my error, not my friend's.
BTW, note that nobody claims these "computable numbers" are good for anything; it's mostly a thought experiment. I tend to see it as a nifty demonstration that the integers are more flexible then many people give them credit for.
What will gzip-the-decompressor do if it encounters a 1 in the middle of the message ?
This is a stupid question in two ways:
1. 1 only decompresses to pi if the message is of length one, and == 1. Only if this is not true do I invoke gzip. gzip under my definition is exactly the same as the gzip on your hard drive (except running on an infinite memory machine).
2. You clearly don't understand anything about how gzip works. gzip-compressed text is not even remotely substring-invarient... a 1101110 string in one part of the compressed file may mean "Hello!", and in another part of the compressed file may actually be parts of three tokens, or merely a part of a larger token. Thus, your question What will gzip-the-decompressor do if it encounters a 1 in the middle of the message ? is just about meaningless; the answer totally depends on the context it is encountered in, since on average it's encountered roughly half the time. The question of "What would gunzip do if it encountered a (anything here)?" is a valid one and had to be answered before it could be written! Well-defined answers exist.
As for the rest of your "What Would Gunzip Do?" questions, I suggest you run the program and find out for yourself. Alternatively, consult RFC 1951 and RFC 1952.
Even just the same algorithm can be coded purely differently in c an in pl.
Oh, so you mean compressing source code. The sentence was incredibly ambiguous: My idea is that the c compression algorithm would be beat by a perl compression. sounds like "c compression algorithm" is a compression algorithm written in C.
Somebody took something similar to that idea and ran with it: You may want to look in Google for some programming comparisions based on taking a benchmark task in many different languages, gzipping the code, and comparing that size, instead of the raw text size. The idea being that the gzip would tend to factor out the verbosity differences and touch on the actual complexity (though of course it's far from a perfect match, and it's hard to even define the relatively complexity of two implementations in two languages in a way that captures everything we intuitively mean, if you think about it). Interesting results.
I never chose binary encoding
"Your choice of encoding" means here that it's true for all encodings of binary real numbers that are reasonable for infinite-length numbers. Sorry, didn't mean you personally.
Mathematically, and in a very provable sense, the cardinality of the reals is greater than the cardinality of the integers.
Yes, I know. You miss the point: All computable numbers, by definition, map to integers, because the TMs do. Thus, there are countably infinite computable numbers, despite the fact that those computable numbers include reals, transcendentals, etc.
Not all reals are computable; in fact uncountably many of them are not, because there are more reals then integers in every mathematical sense.
BTW, I use "larger" in human intuitive sense in that case: The computable numbers is larger then the rationals because the computable numbers contains all rationals, plus more numbers.
Of course mathematically, both sets are the same size, the cardinality of the set of integers; we can talk of Turing Machines running forever but not of "infinitely long" Turing Machines, which is counter to the definition.
(Which highlights the interesting point of that idea, that all the numbers we ever use are still just the integers in a very real sense, even when we talk about "pi" or "e". Not necessarily groundbreaking stuff, but interesting to some of us math wonks.)
I post this in an effort to forstall the inevitable "correction"...;-)
A compression function is a mapping from input to output. A decompression function maps from all possible outputs of the compression function, back to all possible inputs (though there may be some illegal input to the decompression function). As long as decode(code(x)) = x for any x in the domain, it's a "compression" function, even if possibly a really bad one. There's an infinite number of such functions but most of them are terribly uninteresting. For instance, a particular 'code' might repeat x twice and one of its corresponding 'decode's might cut the input in half again; it meets the definition but we'd never be interested in that.
Different functions perform better or worse in different domains, which is why we have "zip", "gzip", "bz2", "shl" or whatever the lossless audio encoder is, and all kinds of other compressions.
It is trivial to define a function that maps one bit to pi, even if pi is defined as some infinite sequence, instead of a finite symbol representing the infinite concept. You just do it.
Where all numbers are in binary:
decompress(x) = { (the infinite binary encode of pi) if x == 1
what gunzip would do if x != 1 }
Perfectly permissible since "1" isn't a legit gunzip file.
compress(x) = { 1 if x == (the infinite binary encoding of pi)
what gzip would do if x != pi }
For your choice of binary encodings of real numbers that makes sense in this domain.
You seem to have neglected that strings have length, and that just because a given thing compresses down to one bit, does not mean that all things the compression scheme produces will be one bit. In fact, that's impossible for obvious reasons.
There's a perfectly well defined mapping that exists. Of course you can't implement this directly since x can be infinite in this case, and would thence take an infinite amount of time to check if x is pi for the compression case, but it's the same kinda thing as "you can't implement a Turing Machine because you can't have an infinite tape." The function itself, like Turing Machines, is perfectly well defined.
There's nothing unrealistic about this, either; the same principles underly the proof that no compression algorithm can compress all input. You forget that there is no "one true representation" of anything; we can define symbols to mean whatever the hell we want.
(This assumes gzip is defined for infinite input, which IIRC it is, since it's a stream-based compressor; conceptually, there's no reason that gunzip won't perfectly happily run forever on an infinite input, giving perfectly well-defined output, as long as the machine in question has infinite memory.)
Pi would not compress at all, given it's an infinitely long number.
Trivially wrong anyhow, even with your misunderstandings. The people in the article who generated over a trillion digits of pi did not pull them out of their ass; there's a mathematical procedure that produces the digits of pi, as many as you have time to compute. Realistically, that means that pi is compressed as the Turing Machine that spits these digits out, and this Turing Machine is fed to the Universal Turing Machine, which "decrypts" (normally we wouldn't use that word, but a UTM fits into the definition of a decryption function, mapping input to output) the output into the string of numbers. The Pi TM is finite, the output is not. Again, you can't run in finite time, but conceptually, the TM represents all of Pi, given enough time. (It "limits" to it, if you like, as time goes to infinity.)
(The corresponding encryption routine for UTM as a decryption routine is much, much tougher, beyond human capability to perform optimally, and often at all; many interesting things about that have been proven.)
A friend of mine has toyed with a theory of "computable" numbers, lying somewhere between the reals and the rationals. A "computable" number is one where there exists a Turing Machine that will output it, as time goes to infinity. Since there are fewer TMs then real numbers, it's clearly smaller then the set of reals, yet equally clearly, it's larger then the rationals, since it includes things like Pi, e, and, most interestingly, any number we could ever conceivably communicate to each other in such a way that we could construct it. That's the most interesting part of it; it's not the full reals, yet you can't point to a real number or reference one that is not in this "computable" set. Not directly germane, but perhaps interesting to anybody following the posts this deeply.
Anyway, the most optimal compression for pi is probably saying "Pi" by itself.
Ironically, you further demonstrate a decompression algorithm ("simplifying an expression into its decimal equivalent according to the corpus of human mathematical knowlege") that decompresses the sixteen-bit phrase "Pi" into the infinite decimal sequence.
My idea is that the c compression algorithm would be beat by a perl compression.
And what is that supposed to mean, anyhow? Algorithms exist independently of their implementation in a given language!
Your understanding of information theory is skin deep; you recall some of the results but you do not understand the deeper logic. I'm not an expert but I'm pretty confident that this post is accurate enough for Slashdot. (I'd be a bit more careful with definitions and domain specifications for a class assignment, but this isn't, and it's long enough.) The exactly compressions techniques you learned are just a special case that happens to be useful in the real world, not the be-all end-all of compression.
There are matters of privacy here.
I'm as big of a privacy wonk as you could ask for. But I'm also in favor of individual rights. I don't think anybody should be able to casually stroll onto your property and peruse the contents without a warrent, obtained through due process as reasonably defined by current law.
On the other hand, I also don't think it's anybody's responsibility to avert their eyes from anything you are foolish enough to throw out. Or even more, destroy it.
We call it throwing away because that's what it is. If you have reason to be concerned, destroy it before tossing it. If you don't, and you literally drop it on public property, well, who's fault is that?
In fact, everybody ought to practice this to an extent, because of the number of f****'ing idiotic companies who seem to think it desirable to print your CC number on every reciept or payment you make with it. I think that ought to be criminal negligence personally, especially the ones that also print the expiration on it (combined with your address which the dumpster diver can easily obtain, that's enough to rack up charges), but in the meantime, anyone with a CC should at least try to destroy anything with it before it leaves the house. You don't need a shredder per se, it may do just to rip it up. Or burn your mail, if you live in the country. That's not paranoia, that's just prudence.
This problem is specific case of a more general problem. Going backwards and forwards is basically undoing and re-doing actions. Any undo/redo system has the same problem.
The problem is that while the solution is obvious (store all the commands as a tree), both the UI and user models suck. How are you going to show the structure? How are most users going to mentally model some sequence like "A, B, C, undo, D, E, undo, undo, redo C, F, undo, redo D, redo E, G, H, I"? (You may need to draw that to see what I mean, which is exactly my point; even techies don't think this way. (Yet.))
In fact I intend to find out whether people can understand this in the most direct way possible, which will be to implement it in my project and see what happens. I have no idea if people will be able to understand it. (It happens in my case that the project naturally fits into this anyhow, so it's worth a try.)
In the meantime, the current model (undoing a few things and then doing something else completely destroys the remaining redo list) has the virtue of simplicity, something that no other proposal, including the one in this article, can have over the current system. I suspect that the current system is the best choice if we can't have the full tree-based implementation.
So they're trying to point out that source code is free speech.
You totally missed the point. The point is not the words "free", "public domain", or "free software". The point is the word speech. There's a big legal fight going on right now over whether software is speech at all, let alone "free" or "public domain" speech.
Your quote should say "So they're trying to point out that source code is speech." The rest of the message following that is just pointing out stuff unrelated to artwork.
ANd I'm not talking about something like Jabber.
Despite your protest, that is the answer. The IETF has formed the XMPP working group, where XMPP stands for "Extensible Messaging and Presence Protocol", which should become the interoperability standard for instant messaging services; it's certainly the closest thing we've got. And XMPP is basically Jabber with a spit-polish.
So, yes, you are talking about XMPP ne* Jabber.
(*: should have an accent on it)
"Blogging", besides being an extremely annoying term has way too much attention paid to it nowadays.
I always get a kick out of seeing this kind of comment on Slashdot. It makes me wonder what the poster thinks a weblog is... because by most definitions, Slashdot is one.
Yeah, it's now a multiple-author weblog with a very well-established comment system, both traits somewhat unusual, but it's a weblog. Many people use Slash to run more traditional weblogs.
Does your post count as part of the "too much attention" paid to it?
This may require a bit of creativity. If you're going to create some sort of stand-alone web server middleware thing, then you need to find the coolest, most unusual aspect of the final design, and implement it inside of Apache, with an eye towards making the code translate easily back to your eventual own server. If you're making a new widget set, you might write a few widgets of your own, but display them inside of a GTK window until you have your own window class. This principle doesn't necessarily mean you need the whole final system in skeleton form (though that's better if you can get it), but you need something that shows you're both serious and capable, which sets you apart from the riff-raff.
If you are looking for programmers, you must make it relatively easy to add functionality. That means a careful design with strongly seperated concerns is vital. If nobody can pick up your program and twiddle a line without the whole house of cards coming down, nobody will bother with it. If you're looking for graphics help, you'll need to make it easy to add (or change the) graphics in the program without knowing how to use the compiler. If you're looking for documentation help, it must be easy to document the program. (Pick the doc standard, make it easy to add help to the program without knowing how to use the compiler.) Ideally, for all of these, you should include a tutorial of some kind; how to create a plug-in from scratch, a step-by-step guide to changing the title screen's font (hopefully not too many steps, the act of writing the guide will show you what to make easier!), a step-by-step guide on adding help.
It's not so much the obvious "the easier it is, the more likely it is someone will do it", though that's trye. It's more of a gambling payoff thing; you need your contributors to be able to sit down with your product and experience a "payoff" in the form of an actual improvement as quickly as possible, so that they'll keep working on it.
- Programmers. Docs for this bunch means well-formatted and commented code (documenting anything not obvious to someone on first glance), and for any non-trivial project, some sort of overview of the design, ideally detailed enough that someone can read the overview, pick up any source code file, and (in conjunction with the comments) figure out where that source fits into the project as a whole. It probably also means some sort of coding standard for code submitted back to you.
- Documenters: You'll need an overview of what the program is, and probably a skeleton of the necessary docs.
- Testers: If you have a testing framework, document it. What, you don't? Get one. Document things that have been problem spots (bugs keep re-appearing), so they can be scrutinized. Make a release checklist.
Already described some of the docs the artists and such need.Note that you only need docs for the groups of people you are trying to attract. In the early phases of an open-source project, you may neither need nor even necessarily want end users. In that case, no need to write docs for them (or at least don't make them public). The only docs the end-users need is "This program is not ready for end-users. Please check back later to see if it's usable.", changed appropriately as you start to need actual beta-testers and stuff. It is a serious mistake for a project to attract end-users before the project is ready, as the users will sap away development time with support needs, with no infrastructure/community to support them.
Certainly there are successful projects that haven't done some of these things, but I think that's success in spite of bad planning, not because of it.
Everything boils down to it must be easy to contribute. The number of projects out there bitching because nobody is willing to program for them, when the leader can't be bothered to even format the code decently, let alone comment the code, provide a blue-print for the design, or even have a design in the first place, boggles my mind. It's like the usability folks have found... every barrier to entry knocks away a percentage of people, and any you can remove helps. Even if a hypothetical wizard coder in the language your project is written in can understand your code by reading it for three or four hours and getting the Big Picture, doesn't mean they aren't more likely to join your project if there's a document they can read that gives them the Big Picture in five minutes (leaving them that much more time to actually contribute, or learn something else), for instance, and that goes for everything.
Too many projects make the mistake of expecting the contributors to jump through hoops to contribute, instead of making it easy. I think it's part of the Open Source hubris that we see so much of. Don't fall victim to it.
"A closed mouth gathers no foot?" That's not half bad; I may just use that. Very similar to "Better to keep one's mouth shut and be thought a fool, then to open one's mouth and remove all doubt.", but, obviously, much less overused.
Irrelevant. "No Shirt No Shoes No Service" is not a statement of conditions for doing business. It is actually a statement of the law. It is illegal for public health and safety reasons to enter a store without shoes, due to the spreading of foot disease. "No shirt" is probably more a community standards thing, but it's still the law.
We didn't sign an agreement because regardless of the agreement, the law still binds. Technically, the store owner doesn't have a choice, they must evict a bare-footed customer, even if they agree with the customer not to.
Try again, perhaps this time without metaphors... concentrate on the real differences.
Procedure:
- Go outside on a cloudless day.
- Look around.
If you're within 10 miles of an airport (even a minor one), odds are, hey, there's a plane!GPS coords don't add much to information already so available all you have to do is literally open your eyes and it comes streaming in. From what I've seen on the news, most missles are fired at planes taking off or landing (usually taking off from what I've seen), in plain sight. You just can't hide a plane taking off, so please, on behalf of all us freedom-loving citizens, don't propose half-assed "solutions" to the non-problem; we've got government officials working on that full-time already and God-forbid one of them see this "non-problem" of yours and decide to try your non-solution.... more freedoms gone for no gain whatsoever, just to make someone look like they're "doing something".
The civil service seem very eager for there to be a national identity card, and keep proposing it as a solution for a variety of different problems.
To be fair, this is largely due to the Two Great Beauracratic Myths, "More Data Is Good" and "More Centralization Is Good". Note the lack of qualifiers on those statements; while they are true in some instances, a Beauracracy (with a capital B, which fits most/all government agencies) sees them as always good, even when they are totally, transparently untrue.
You already mentioned that they are drowning in information, which is why the first myth is wrong in this case, and see the latest Cryptogram for a good discussion of why the second myth (centralizing everything, especially security) is wrong.
Note that all beauracracies can be expected to produce those myths after a certain size. Part of the challenge of building a truly dynamic company is trying to keep the beauracracy to a minimum, lest it strangle you. I know hating Microsoft is standard around here, but they're actually a fairly admirable example of a company becoming huge and yet managing to keep the beauracracy largely in check. (Whether they can keep that going once they cease growing like wildfire is an open and interesting question.)
The point here being, for what it's worth, trying to convince the government itself that these things are wrong is trying to make the government into something it can never be. Our only hope is to get this killed by Congress or perhaps better yet, the Supreme Court (although the latter case means that someone has to be hurt enough by the system to sue, which means lots of other people will be hurt but not sue).
The cynic in me says this is going to happen no matter what and the best thing we can do is stop spending energy fighting it and spend it sensitizing everybody around us to the consequences, so they see it when it happens. I think it's obvious that us civil rights folks don't have the power to stop this directly; we will need people voting in Congressfolk or Presidents based on whether they will promise to dismantle the Normal American surveillance machinery.
I also find it neat that the Toyota Echo was expressely designed for older people... Makes me laugh seeing twenty-somethings driving them...
Ah, but that's the greatest unsung benefit of usability engineering... by making something easier to use for diminished perception or interaction users, it becomes easier for everyone. Just because I can see eight-point font doesn't mean I want to, or that it wouldn't be easier in 14-point, even for me, a user of average vision.
for every burnt out admin thats going to quit theres 5 more waiting to take his place
I've seen that comment several times but it finally occurred to me what's wrong with it: KM. (That's "Knowlege Management".)
When an admin leaves, he takes a tremendous amount of knowlege away with him or her, which must be painstakingly re-aquired by the next admin. This can easily take in excess of a year, depending on how well the original admin did his job. (It takes less if everything's broken; well-oiled parts of the machine may not require attention for a long time, so those take longer to learn.)
Changing admins is far, far from free, and business will eventually notice that. (In fact, they are starting to, but my impression is that KM is still a "kooky" field, not yet mainstream. Corrections from those closer to that community welcome.)
Personally, and I'm totally serious about this, I'd blame it on the assignments we get in both high school and college wherein the teacher/professor, in a well-meaning attempt to indoctrinate us in the ways of the academic, says "You must include (5/10/30) citations in your final paper!" (And no more then X may come from whatever bad thing students are using... encyclopedias in my day, now the Internet.)
Totally naturally, we go out, find 1.5*X citations, winnow out the obvious losers, and randomly cite them at the end of our papers, having read maybe one of them. Because we all know the teacher/prof doesn't have time to check even one of them from each of our papers, let alone check them all. How many of us have completely manufactured a citation from whole cloth for one of these things and totally gotten away with it? (I haven't myself, but I certainly thought about it; the only reason I didn't is it was generally easier to just go get likely looking citations on the Internet. Teacher never realizes you "used the Internet" if you cite paper journals....)
Certainly you don't think this habit is going to go away just because they got a degree, when the stakes are even higher? Everybody else's six-page research papers have 40 citations at the end, if yours don't you'll stick out, and that's bad.
It would probably be better to require that students cite as appropriate, and require at least a spot check of the citations for at least one random assignment at some point in a student's career.
I'm writing something in my spare time that might in some sense be considered an academic paper, but I just use footnotes as appropriate. Citations are often overrated when they are used as a cover for "We've known this and endlessly debated this in the field for the past 50 years, but I can squeeze seven pointless, information-free citations out of this" sorts of things.
Note I'm not saying that citations are unimportent or that they should be abolished; they are legitimately importent and useful. I'm just saying the the stupid way they are handled in school has natural consequences in the resulting academics, and their value is unnecessarily diminished as a result.
Just imagine how big a computational problem could be solved in 50 years with contemporary P4 hardware.
.53125 years, and finish in 2006.53125.
.00000000000004440892 years, which if my calculations are correct is 1.4 nanoseconds. Actually I think computational power bottoms out before then, but the principle holds. More specifically to your post, the value of the "contemporary P4 hardware" over 50 years is effectively negative; instead of waiting 50 years, you could have spent the same amount of dough and been done in a mere six years! Until we stop exponentially advancing, the value of old chips drops like a rock until they are nearly worthless in a mere 3 to 4 years for any serious long-term computation.
No need to imagine. Suppose we round Moore's Law to a simplistic "double every year", which is about right. (Processors may not move that fast but remember it's the whole computer that affects processing time; add up processor advances, disk advances, memory advances, graphical advances etc. and you get probably more then a doubling per year, so this is conservative.)
I can start my 50-year computation on my P2000 (processor 2000, not Pentium 2000) in 2000 and be done in 2050.
Or I can wait a year and buy the P2001 and be done in 25 years, in 2026.
Or I can wait two years and buy the P2002, and be done in 12.5 years, in 2014.5.
Or I can wait three years and buy the P2003, and be done in 6.25 years, and finish in 2009.25.
Or I can wait four years and buy the P2004, and be done in 3.125 years, and finish in 2007.125.
If I wait five years and buy the P2005, I can be done in 1.0625 years, and finish in 2006.0625.
If I wait six years and buy the P2006, I can be done in
Because of the continuing exponential growth in power, the value of keeping a fifty-year-old processor online for fifty years is nearly zero once you get past the first few years. Note the P2050 finishes your P2000-50-year task in 50/(2^50) years, or
This isn't just theory, either; for some computations, it is more cost-effective to wait for better computers. The constants in the analysis of the first part of this message changes (usually an analyst would look at "spending $X" rather then "buying one computer"), but it works out the same. Sometimes you're better off waiting.
Now, for some people in some situations, practically, old computers can be useful. Don't extend my post past the context I've placed it in. I've got a happily cranking 233MHz P1 at home... but I don't do weather simulations on it for profit, I use it for some web scanning as a personal use in preference to throwing it out. (Even so, in ten years or so, it would be cheaper to turn it off and buy a lower-power-consumption computer...)
On a smaller scale (personal), this is essentially what I do.
First, only some personal data is critical, not the GBs of operating systems and programs I can redownload/recompile if necessary. Things like documents, saved games (you'd think it's unimportent until you play the first 2/3s of Fallout 2 five times and can't stomach getting far enough to see how it all turns out, because you'd have to play that 2/3s again...), email maybe, whatever, but some limited amount. 10MB can go a long way... that's a lot of programming, for instance. (Been working on a project for about half a year now and I'm just ready to break 300KB of code...)
Then, set up a live backup amounst all the disks you have on various machines. I use unison so that I can change files in the repository on any machine and have the changes propogate correctly, instead of the unidirectional updates rsync does.
Use symlinks to put everything you need into one directory, and tell Unison to follow the symlinks, not archive them directly. Then just run that every so often on the machines, and you're set.
Once more of my family gets set up with always-on connections, I intend to set up a family-level repository of backed up files with Unison, so that "off-site backups" are a weekly script run without intervention by the family, making off-site backups across the state (or country, or world) easy. This will protect the scanned pictures and other things in the family heritage easily and effectively.
Which reminds me, the first always-on connection just came online and I really ought to talk to that member about a reciprocating backup setup...
I recommend a read through this book, currently partially available online.
It will certainly provide food for thought.
That's actually what I do now. A filter for each of my mailing lists (which are all nice enough to use the [abbrev.] convention in every title), and a filter per left-over person that goes into my generic "Keepers" folder.
No spam filtering in terms of what most people mean, but it turns out that unless you have a lot of people emailing you out of the blue (tech support, maybe an Open Source project lead), this means that around 90%-95% of the stuff that *isn't* filtered into a folder is spam, and the percentage is going up every week. That's for my personal account, which is less focused then the average work account, where I think your numbers would hold.
This can't be perfect, but it also can't be fooled or defeated in the general case. It's a hell of a lot less sexy then the latest Bayesian filters, but in another six months, the whitelists will work better.
what i really wonder though is how many legitimate (non-spams) emails i never receive because of filtering software!
That is how the spam war will end: The spammers will become sophisticated enough that no matter what we do, any filter we try to use will result in too many false positives (falsely labelled "spam") to be of any use.
(False positives, of the four possible outcomes, are by far the worst, if you think about it.)
Spam is only going to get worse.
Nice article overall, but what the hell is up with that picture? If the Earth bulged that much, we'd all have noticed the incredible changes in gravity between 45 degree N and 0 and 90 degree north. I mean, yikes, that's gotta be at least a 10% difference between the two Earth-like planets in that picture.
Realistically, the shift much be vanishing fractions of a percent, and you wouldn't be able to find a difference between the two Earths ("pre-bulge" and "post-bulge"), even in principle, on a low-resolution picture like that; the effects they are talking about would be sub-pixel, to say the least.
I'd do the math, but there aren't any numbers in the linked text and it's too late to go out and try to find them. (Perhaps someone else will... I'll lay my money down on, ohhh, within an order of magnitude of "one ten-millionth of a pixel difference" between pre- and post-bulge Earth.)
First, your friend's idea is either fatally flawed, or he has made a breakthrough of fantastic proportions, because this set of "computable" numbers would have a cardinality between that of aleph_0 and c, violating the continuum hypothesis.
No. A computable number is defined as having a TM that will output it, though possibly in infinite time. Thus they have the same cardinality as the set of TMs, which is the same as the set of integers. They are interesting only because they seem to give us all the practical (and I can't emphasize that strongly enough) usefulness of the reals while technically only having the same cardinality as the ints.
You can eliminate that flaw by restricting the input to a fixed value (say, the null string).
You can do that "without loss of generality", to use the math phrase. Figuring out the transform for "TM + input" -> "TM" is left as an exercise for the reader.
We do this all the time in proving things about computability; since we can just suck the input into the TM, it removes one (useless!) variable from the proof, which makes them that much cleaner.
This is a strong statement, one that must be proven.
Well, yes and no. It's an English statement, not a math statement, so proof would tend to look like proof by definition. It would basically run as "By the act of pointing to a claimed incomputable number, you are either showing me how to compute it, or you are not pointing at a unique, well-defined number. A sibling to your post did construct a unique, well-defined real number that is not in the computable set to my satisfaction, so to the extent that my phrase had any mathematical meaning, it has already been contradicted. However, that was my error, not my friend's.
BTW, note that nobody claims these "computable numbers" are good for anything; it's mostly a thought experiment. I tend to see it as a nifty demonstration that the integers are more flexible then many people give them credit for.
What will gzip-the-decompressor do if it encounters a 1 in the middle of the message ?
This is a stupid question in two ways:
1. 1 only decompresses to pi if the message is of length one, and == 1. Only if this is not true do I invoke gzip. gzip under my definition is exactly the same as the gzip on your hard drive (except running on an infinite memory machine).
2. You clearly don't understand anything about how gzip works. gzip-compressed text is not even remotely substring-invarient... a 1101110 string in one part of the compressed file may mean "Hello!", and in another part of the compressed file may actually be parts of three tokens, or merely a part of a larger token. Thus, your question What will gzip-the-decompressor do if it encounters a 1 in the middle of the message ? is just about meaningless; the answer totally depends on the context it is encountered in, since on average it's encountered roughly half the time. The question of "What would gunzip do if it encountered a (anything here)?" is a valid one and had to be answered before it could be written! Well-defined answers exist.
As for the rest of your "What Would Gunzip Do?" questions, I suggest you run the program and find out for yourself. Alternatively, consult RFC 1951 and RFC 1952.
Even just the same algorithm can be coded purely differently in c an in pl.
Oh, so you mean compressing source code. The sentence was incredibly ambiguous: My idea is that the c compression algorithm would be beat by a perl compression. sounds like "c compression algorithm" is a compression algorithm written in C.
Somebody took something similar to that idea and ran with it: You may want to look in Google for some programming comparisions based on taking a benchmark task in many different languages, gzipping the code, and comparing that size, instead of the raw text size. The idea being that the gzip would tend to factor out the verbosity differences and touch on the actual complexity (though of course it's far from a perfect match, and it's hard to even define the relatively complexity of two implementations in two languages in a way that captures everything we intuitively mean, if you think about it). Interesting results.
I never chose binary encoding
"Your choice of encoding" means here that it's true for all encodings of binary real numbers that are reasonable for infinite-length numbers. Sorry, didn't mean you personally.
Mathematically, and in a very provable sense, the cardinality of the reals is greater than the cardinality of the integers.
Yes, I know. You miss the point: All computable numbers, by definition, map to integers, because the TMs do. Thus, there are countably infinite computable numbers, despite the fact that those computable numbers include reals, transcendentals, etc.
Not all reals are computable; in fact uncountably many of them are not, because there are more reals then integers in every mathematical sense.
Touche. I was too glib. Good catch.
BTW, I use "larger" in human intuitive sense in that case: The computable numbers is larger then the rationals because the computable numbers contains all rationals, plus more numbers.
;-)
Of course mathematically, both sets are the same size, the cardinality of the set of integers; we can talk of Turing Machines running forever but not of "infinitely long" Turing Machines, which is counter to the definition.
(Which highlights the interesting point of that idea, that all the numbers we ever use are still just the integers in a very real sense, even when we talk about "pi" or "e". Not necessarily groundbreaking stuff, but interesting to some of us math wonks.)
I post this in an effort to forstall the inevitable "correction"...
A compression function is a mapping from input to output. A decompression function maps from all possible outputs of the compression function, back to all possible inputs (though there may be some illegal input to the decompression function). As long as decode(code(x)) = x for any x in the domain, it's a "compression" function, even if possibly a really bad one. There's an infinite number of such functions but most of them are terribly uninteresting. For instance, a particular 'code' might repeat x twice and one of its corresponding 'decode's might cut the input in half again; it meets the definition but we'd never be interested in that.
Different functions perform better or worse in different domains, which is why we have "zip", "gzip", "bz2", "shl" or whatever the lossless audio encoder is, and all kinds of other compressions.
It is trivial to define a function that maps one bit to pi, even if pi is defined as some infinite sequence, instead of a finite symbol representing the infinite concept. You just do it.
Where all numbers are in binary:
decompress(x) = { (the infinite binary encode of pi) if x == 1
what gunzip would do if x != 1 }
Perfectly permissible since "1" isn't a legit gunzip file.
compress(x) = { 1 if x == (the infinite binary encoding of pi)
what gzip would do if x != pi }
For your choice of binary encodings of real numbers that makes sense in this domain.
You seem to have neglected that strings have length, and that just because a given thing compresses down to one bit, does not mean that all things the compression scheme produces will be one bit. In fact, that's impossible for obvious reasons.
There's a perfectly well defined mapping that exists. Of course you can't implement this directly since x can be infinite in this case, and would thence take an infinite amount of time to check if x is pi for the compression case, but it's the same kinda thing as "you can't implement a Turing Machine because you can't have an infinite tape." The function itself, like Turing Machines, is perfectly well defined.
There's nothing unrealistic about this, either; the same principles underly the proof that no compression algorithm can compress all input. You forget that there is no "one true representation" of anything; we can define symbols to mean whatever the hell we want.
(This assumes gzip is defined for infinite input, which IIRC it is, since it's a stream-based compressor; conceptually, there's no reason that gunzip won't perfectly happily run forever on an infinite input, giving perfectly well-defined output, as long as the machine in question has infinite memory.)
Pi would not compress at all, given it's an infinitely long number.
Trivially wrong anyhow, even with your misunderstandings. The people in the article who generated over a trillion digits of pi did not pull them out of their ass; there's a mathematical procedure that produces the digits of pi, as many as you have time to compute. Realistically, that means that pi is compressed as the Turing Machine that spits these digits out, and this Turing Machine is fed to the Universal Turing Machine, which "decrypts" (normally we wouldn't use that word, but a UTM fits into the definition of a decryption function, mapping input to output) the output into the string of numbers. The Pi TM is finite, the output is not. Again, you can't run in finite time, but conceptually, the TM represents all of Pi, given enough time. (It "limits" to it, if you like, as time goes to infinity.)
(The corresponding encryption routine for UTM as a decryption routine is much, much tougher, beyond human capability to perform optimally, and often at all; many interesting things about that have been proven.)
A friend of mine has toyed with a theory of "computable" numbers, lying somewhere between the reals and the rationals. A "computable" number is one where there exists a Turing Machine that will output it, as time goes to infinity. Since there are fewer TMs then real numbers, it's clearly smaller then the set of reals, yet equally clearly, it's larger then the rationals, since it includes things like Pi, e, and, most interestingly, any number we could ever conceivably communicate to each other in such a way that we could construct it. That's the most interesting part of it; it's not the full reals, yet you can't point to a real number or reference one that is not in this "computable" set. Not directly germane, but perhaps interesting to anybody following the posts this deeply.
Anyway, the most optimal compression for pi is probably saying "Pi" by itself.
Ironically, you further demonstrate a decompression algorithm ("simplifying an expression into its decimal equivalent according to the corpus of human mathematical knowlege") that decompresses the sixteen-bit phrase "Pi" into the infinite decimal sequence.
My idea is that the c compression algorithm would be beat by a perl compression.
And what is that supposed to mean, anyhow? Algorithms exist independently of their implementation in a given language!
Your understanding of information theory is skin deep; you recall some of the results but you do not understand the deeper logic. I'm not an expert but I'm pretty confident that this post is accurate enough for Slashdot. (I'd be a bit more careful with definitions and domain specifications for a class assignment, but this isn't, and it's long enough.) The exactly compressions techniques you learned are just a special case that happens to be useful in the real world, not the be-all end-all of compression.