The only known copy? I'm fairly confident, that there are tons of copies (maybe not in private hands, possibly not in context of the 1984 Superbowl). I've seen, and re-seen that commercial several times. I saw it during the original 1984 broadcast. Heck, one of the cable channels actually played the top 10/top 25/top N, influential commericals of all time. They replayed the entire spot then. That, the Mean Joe Green throwing his jersey to a kid for Coke, and the "Where's the Beef" commercials for Wendy's were all at the top of the list. I believe they we're showing the top TV commercials as picked by TV guide sometime in the last 1990's.
Heck, I know some football fans who probably have the Superbowl on tape from that year. So, "only known" surviving copy, is sorta like me copy of a some obscure TV show is the only known copy, because I never seen anyone else with it.
I can't play it, but here's a link I found for it
by Googling:
Link.
If you go googling for it, there are lots of references to it, and plenty of places that appear to be able to display it. I can't play them, as I can't run quicktime or.mov's at work (I'm not installing a movie player at work just to know that they are correct). Clicking on it gets me a 2.2MB file.
In answer to your question, "Smitty" was the tool that all AIX admin's I've ever met knew about, generally they hated it.
Most of the people who I know who tell me about the old time UNIX, tell me:
AIX was in general a real pain in the ass (this might have changed in the intervening 5-10 years since their experience). That it was even more different then Solaris was in terms of the arcane knowledge you needed to administer it properly.
I've known more then a few DBA's and SA's who have told me that at various points, it was easier to just run screaming from AIX then to deal with it. Eventually, most all of it could be overcome, but that learning all of it's pitfalls could be a very painful experience (After reviewing more then a few of the Oracle Bugs reports, I can see that it would terrify me to run Oracle on anything from the 4.[23] era). When given a choice, 9 times out of 10, everyone I've ever dealt with would rather run Solaris then AIX. Maybe it's because I work in a city where most everyone uses Solaris, so it's flaws are just well known pot holes every avoids out of habit.
I've heard horror stories about AIX, HP-UX (HP's UNIX), DNIX (Sequent's UNIX), IRIX (SGI's Unix), SCO, and OSF/1 (DEC's UNIX). Actually, I can't remember too many SCO administrative nightmares, but that might be that not too many people I know have ever dealt with UnixWare.
Most of them, I can't even recall, but I remember the AIX goop quite clearly, as it never sounded very UNIX'y to me.
In the end, I've always been told that doing anything not thru SMIT (I believe it was referred to as "smitty"), was a bad idea. That just hand editting files was a recipe for disaster in their experience. I thought they said that the files in/etc, got output for compatibility, but that there was a binary backend that was authoratitive, and could be accessed via a programitic API. All that sounded like a disaster waiting to happen. Now, I might have been informed incorrectly, or my knowledge might be years and years out of date.
Finally, I've found that shipping two sets of commands is a recipe for disaster when shell scripting. I'd much rather have one or other, but not both (a shell script run as one person won't work when another runs it on the same machine). In the end, it's a source of more problems then just learning the native tools. I never minded having too command sets (gmake, gcc, gawk), but having to figure out if make is IBM make, or GNU make always seemed silly to me.
Well, the problem is that on Sun, you are using SysV style arguments, not BSD style. To solve this, you ususally have to include "/usr/ucb" and such in order to solve this (it's not perfect). The other option is to install the GNU versions of all of those utilities.
I agree with you, I find some of the incompatibilities scary, like "reboot" and "halt", either had, or still have very different behavior on Solaris then they do on Linux. On solaris, they are immediate panic type commands, on Linux it's an orderly shutdown. A friend of mine, always uses the init 0 or init 6 to get that behavior on both.
However, there are probably just as many Solaris heads out there going, "I hate Linux why can't I just re-use my everyday Solaris knowledge". AIX is so different to administer, I'm shocked you include it as "Linux-like". (Note, I've never used AIX, but from what I've been told, everything runs thru some admin tool that edits binary files for configuration instead of the standard human readable text files used under Linux).
Hmmm, I guess. My guess is that they have implemented something akin to SQL for datastrems. You define a message format. Think of each message as a row in the table. The message format is the table schema.
You have a "standing query". So you can ask things, like, what's the rolling average for the last 60 seconds for this ticker name. What's the minimum price for this commodity.
You can ask to correlate things. Store the last 90 minutes worth of transactions on these commodities. Search for these types of patterns.
It sounds like what they have done is build an OLAP cube that builds its dataset on the fly by processing messages coming over a streaming interface.
It's much smarter to do that, then write every last transaction to disk, and then query the transactions after the fact. That'd be the natural way to thing about it if you used a Relational database.
Essentially, it sure sounds like he's written a generalized packet filter, that can compute interesting functions on the data. Think snort, think ethereal, think iptables, think policy routing. Now apply those kinds of technology to "The price of this stock", "the location of that soldier", where those values are embedded in a network packet frame somewhere.
While each single application of this sounds trivial to implement, if he has done it in a generalized way, that can keep pay with larger systems, bully for him.
The irony of all this for me is that at a former job, I used to process medical data exactly this way. It sounds like the HL7 interface issues we used to have. You couldn't possibly take a full HL7 stream and process it, so you'd filter it down to just the patients that this department was interested in. Then only process messages about those patients.
There were rows that even about those patients you weren't interested in that you had to filter out. You spent a bunch of time filtering, and re-filtering.
We wrote the raw messages to disk, and spooled them to ensure we didn't miss messages due database problems (if the database was down, you had to spool until the database came back up, it was unacceptable to miss patient records for database maintience).
I don't know about the Columbia accident (it might be a freakish thing), but from the review I've read of the Challenger accident. Two things come to mind:
Political pressure was put upon NASA to lauch a vehicle during this launch window. I forget the details behindwhat it was. If I remember right, there wasn't another window for several weeks if they missed this one.
Second, the O-Ring was blantantly known. There's a reason the demonstration the NASA Engineer put on, where he pulled the O-Ring out of water and pulled on it was so blantantly bad. It's my understanding from reading Richard Feynmann's comments that he insisted be added as an appendix to the report, that essentially the right people in NASA knew it was going to blow up. However, they justified it, with "Well the O-Ring is three times as thick as it needs to be, so the problem it is showing where it has a 1/3rd erosion is not a problem". You can read up on it
here
Feynman essesstially accuses them of using previous success as a evidence that all future launches will be a success. That's not good science or good Engineering. I think Columbia was screwed from the moment they made orbit (they might have been able to abort pre-orbit, post foam collision. I'm not sure on that). However, Challenger, they KNEW, they had an important piece behaving oddly in a way they didn't understand, while launching under extreme conditions. That's not being particularly safe.
While I agree with you on Columbia, I strongly disagree with your characterization of Challenger.
Also, I'm absolutely positive those parts haven't been sitting in a warehouse since 1965. We were fairly busy with the Mecury and Gemini missions in that time frame. My guess is they got invented no earlier then 1975, and made no early then 1980 or so. I'd have to go look into the history, but I'm reasonable sure the drawing hadn't even been brought out before 1972 or 1973, screw making parts to a specification.
As to your RAID thoughts. You are clinically insane. man "mdadm". On a RedHat machine "service mdmonitor start".
mdadm --scan --detail >/etc/mdadm.conf echo "DEVICE/dev/[sh]d[a-z][1-9]" >>/etc/mdadm.conf echo "DEVICE/dev/[sh]d[a-z][a-z][1-9]" >>/etc/mdadm.conf echo "MAILADDR alert_email@domain.com" >>/etc/mdadm.conf chkconfig mdmonitor on service mdmonitor start
You can easily adapt the RedHat scripts to run on Slackware. Personally I would recommend setting up nagios or some other software monitoring. Everytime something goes wrong on a machine, we write a script to monitor that. Now, every few things go wrong unnoticed.
We write monitoring scripts that run via nagios, that check that out. Within 10 minutes of a drive failing I have a page, within 5 I have an e-mail (there's a five minute latency on nagios recognizing the problem, and about a five minute latency from the time the paging company gets the page until the pager goes off). That's pretty much the worst case scenerio.
I'd really much prefer that to not having a RAID array. We've used that system (*knock*,*knock*,*knock*), for 4 years, and with about 5TB of filesystems at work, we've never ever lost a RAID'ed filesystem (worst case, was the SCSI locked up due to a driver failure, but I think that would have happened even with no RAID configuration, the machine had to be power cycled, but the filesystem was still in tact).
We have lost several, incredibly important filesystems that weren't RAID'ed. Technically speaking the filesystem wasn't important, however, the downtime was really bad during the rebuild/recover phase. The first time we lost $10K due to a failed IDE disk that was 4 years old, we convinced the boss that he should really purchase us mirrored SCSI disks for all the OS drives, it was a cheaper one time cost.
If you have spare drives arround, you can configure mdadm to automatically add them into the system. Unlike the standard md tools, you can have one spare for any number of md arrays.
Nope. I'm saying that would be an outstanding way to do phisihing attacks.
Next, I'm saying, that I'm confident, that if a phisher can figure out how to write to your/etc/hosts file, it's merely a matter of time, until they write to where ever your cert's are installed. They will install a cert that makes them the equivilent of Verisign. There's a file on your machine that is all you have that makes you trust Verisign. I can create one of those files, call it "Phisher Cert's R US".
Then any site that has a cert signed by "Phisher Cert's R US" will not give you an alert in IE.
If you you aren't actively checking your certs files, that could be a serious issue. To the best of my knowledge, your cert files aren't cryptographically checked in any manner. I know you can just add a cert to your own machines to make self-signed cert messages go away.
If you have to contact Verisign in order to authenticate your cert with them, that's not a problem either. They control your DNS via the hosts file. They will direct you to thier site and feed you bogus information. What a wonderful thing.
The problem is to the best of my knowledge nothing will alert you that your cert files has been tampered with or been added to. However it's signed, I have to be able to add certs to them myself. Phisher's they can just set themselves up as a cert provider you trust.
Yes and no. Remember, they control the DNS, they wrote to the/etc/hosts (where ever it is they bury that on windows, c:\windows\system32\hosts if I remember correctly) files. How long until they add a file to your cert list. So it looks like a trusted host when you go to their site?
Besides all that, I'm fairly sharp about my security, and I know most of the fundamentals of the math behind it, and I wouldn't be shocked if my bank switched SSL keys because their old one just expired. Imagine the bedlam that would ensue if everyone did freak out, just because a key had changed.
Now, if they hijack a DNS server, or break into Verisign and get the secret key they are in (or more likely, one of the smaller SSL Key providers that have default keys on Microsoft IE installs).
I don't remember the exact details of how you use the certs on your desktop machine, if at any point you have to connect to Verisign, they have you. They control the IP where you believe Verisign is located. The trick will be you having to establish cryptographic trust of files you us, an every bit of information between you and completing the transaction. Them being able to control any stage of the transaction, and they can wreak havoc on you.
This is in breach of the Acceptable Use Policy and could potentially put the university's network and core business at risk.
Okay, I understand the whole AUP piece. I understand that it could be a problem for the network.
What I'm not sure I understand, is how a simple program could "put the University's core business at risk". If that is a publically funded University, I really object to that statement (it's not a business, a public service. It's nice if it's self funding, but the objective it not to turn a profit), if it's a private University I suppose it is in fact a business. I really don't see how this will in any way interfere with teaching students and colleting fees. While I suppose the degradation of internet service and the raising of ISP charges would affect the bottom line, it surely doesn't affect the ability of the facaulty to interact with students.
Yes, yes. I know, I've been down that path. My problem is two fold. First off, I need to upgrade for several reasons. However, in my experience, upgrades leads to new bugs, new performance problems. Right now the machine is stable, and runs just fine. I'm leaving it alone. In general new problems all around. I need to get to 9i. That's all there is to it. I believe even 8.1.7.4 is being EOL'ed (I thought it was in Dec 31, 2004). I need to get to 9i for support reasons. I'll get the security fixes then. The machine is due to get upgraded RSN (and has been for 18 months *grin*).
Second, if anyone who isn't part of my organization can get anywhere near that machine, I'm already so incredibly compromised it's not even funny. There are several layers of firewalls between it and anyone not physically in the server room. There is an application server that is allowed to access it, and that's pretty much it. Any hacker worth his salt would have complete control of all of my machines by the time he could get past the firewalls to get to that machine.
Hmmm, curious, I might actually have to take back what I said. You are correct my copy of Oracle doesn't either (8.1.7.2). However, if you do a count on a field that is allows NULL's, it doesn't count the NULL's. I was taught that count(*) was only supposed to count rows where all fields that are NOT NULL. Sometimes there is a difference, and you need to be aware of it. However, in the Oracle documentation (Oracle 8i: The Complete Reference) it describes the behaviour you state. Hmmm, maybe that's in a different database (the guy I learned a lot of SQL from was a Firebird head, back when it was still a commercial product by Borland called Interbase). I of coure double checked what I wrote above by counting a field that was allowed to be NULL and assumed that it's behavior was consistant when applied to all fields.
I don't have a copy of the SQL standard handy to look at to see what behavior it describes.
I know it's a joke, but 2038 will be a lot more like 1902 or 1901 -> ( 1970 - 2 * (2038 - 1970 ) ) (it could be like 1901 because of the extra months not included in the calculation. time_t's are signed. You can represent time fairly far back with a time_t. There's a 68 odd year range on either side.
Sorry, I know, I'm just being pedantic, the joke is more obvious an a lot funnier the way you are presenting it.
In my experience,/bin/ls can be aliased just as well as ls can (I'm assuming thats how it was etup in.cshrc). If you want to make sure junk like that isn't there in your scripts. Setup the environment from scratch. There are ways to start a shell with absolutely no environment. The easiest being to start the shell with no login scripts and no interactive scripts turned on.
Just hard coding/bin/ls is just as suseptible to the problem you are talking about. The real problem here is that you are using shell scripts to do real work. Stop that. If you really want something to work, write it in a real language, you really have control of. That's reliability. Shell scripts are nice, they are wonderful. However, it's preicsely these sorts of problems that lead me to believe that re-writing the scripts in python, C, or perl is a good idea. Especially if you avoid "system", "popen" like the plauge. In those cases, you control the environement much better, and have native data structures with well defined interfaces. Instead of using "ls", you use "readdir" and a loop of some kind.
Not true. count(*) will only count rows that have a non-NULL field in them. count(1) will return all rows weather or not they have a non-null field in them. Run count(fieldName) where fieldName has NULL's in them. It should not count the NULL's.
They are not the same. However for 99.999% of all sane schema designs they are in fact the same. However, count(*) has to pull the data to ensure that there is a non-null field. More then likely if it was clever, it would not pull the data if there was a primary key on any of the tables in the select statement.
If the data inside the tags is an important modifier (which in my case it won't be, but I'll consider it as the theoretical possibility), I can still parse 99% of the data correctly, and I might get 1% wrong until I realize it's a problem (further more, I could just completely skip the records that have sections I don't understand and import the rest that do have records I understand, the skipped records get logged and investigated).
The problem is I can't extract the documentation, either because said documentation doesn't exist, or because the vendor involved feels that the documentation is a propritary and won't give it to us. They want to get into the business we are in, and thus won't help their clients help give data to us. If it was just standard to get data in XML format, I could easily get that done.
I really don't need clear documentation. Honest, what I deal with is fairly simple. I need to extract roughly 10-15 fields of information. I know what they are up front. 95% of the time, I get way, way, way more data then I ever needed. It's not like I have to parse every last bit. I just need to extract the 10 fields I need out of the up to several hundred I was sent. My problem is never that I can't find the data in the file. My problem is always, that the export didn't handle the case where a delimiter was in a data field. (I have an amazing number of people who give us data that have commas in comma delimited data who failed to use quotes or any other escape). XML if you use a simple off the shelf library will solve all of those problems. I have a lot less to worry about if the idiot who designed the export file format, was an idiot.
Next, sure, the STL isn't the end all be all of data structures. I hate some of the limitations you mentioned. I'm a professional C++ programmer, and I see a lot of what you are saying (I don't use a lot of STL algorithms, or other things, precisely because implementing a class to write a functor so I can avoid writing a three line for loop seems like overkill). However, if you need a linked list, a dynamic resizing array, a simple dictionary, or a simple set. 90% of the things I work on those semantics are exactly what I need. Writing your own linked list, binary tree, or resizing array is just stupid. The STL code will take your code and eat it for lunch speedwise, and when you realized you needed an extra feature, it will already be implemented. Yes, it can't do circular lists.
I've seen an amazing number of people who think: "Linked lists aren't hard, I can write my own that's at least as fast the one in the STL, and I can debug it easier". Uh huh. Sure. Right! Take the one that is already speed tested and well tuned, and really well debugged. When it comes right down to it, the only thing you'll need to customize to make it go faster is the allocator can make a large difference. About the only thing that drives me nuts about the STL is that it can segfault if you have a bad "operator less than".
No I saw it. Conceptually, you are correct. In order to solve Social Security, people should just save money. In order to solve homelessness, people with spare rooms should take people in. In order to solve viruses, no one should open mail from people they don't know. In order to solve the middle east problem we should just declare a rotating schedule of who get access to what religious sites. Wouldn't it be lovely if I could enforce my will on the world. I'd solve all sorts of problems quickly and easily.
I hate to beat you with reality, but declaring "XML has nothing over a properly documented file format", is fairly akin to my above solutions. NASA worlds most anal retentive organization, has lots large amounts of data because they lost the documentation. I deal with lots of companies that didn't originate their data export. They have no control over it, and no documentation. There's nothing anyone involved can do about it (sometimes because the vendors are out of business, othertimes because the vendors feel that giving out their documentation for the export gives up a competitive advantage).
XML has beauty two ways. First off, you'll never have to write a proper parser every again in your life. Yeah! Second, the documentation should exist in the DTD. The DTD should wander around. Even if the DTD doesn't exist, the file format is still easy to parse, even the portions you have never seen before, because it is a well specified format. Unlike CSV. Unlike the most of the 250 odd file formats I have had to deal with in real world situations. On top of all that, why on earth do you want everyone and their brother to make up their own silly file format. Dedicate the resources to make one good one that can cover 90% of the cases. (I know there are cases where XML won't fit).
Next, I agree with you, "XML Databases" and "XML Language Modeling" are silly constructs. Unless you know the data could end up with someone you'll never meet or do business with, it shouldn't be in XML.
Well, I guess I'm not seeing the problem with modeling it. It shouldn't be that hard, you can represent the heirachy fairly straightfowardly in an C++ parse tree. It should be blazingly obvious to convert the C++ parse tree into XML for how to store an instance. (Which either still has the problem you describe, at which point, I'd guess it is a flaw in C++, or you have a natural non-redundant XML representation). It could be that getting teaching a datatype how to serialize itself at runtime doesn't naturally lead to encapsulating itself that way, but that is the obvious way that the XML should be structured.
As an aside, I've seen very, very few cases where Multiple inheritence that wasn't of the "virtual base class that defines only methods" (a rough description of Java's single inheritance + the "Implements" model) type that made much sense. Multiple inheritance was fairly vile for a long time, especially before C++ was standardized. Personally, I never ever use it (for the problems I work on it's not terribly natural). Almost always, I can use templated derived classes, or implement some functionality in helper functions up and down the class hierachy and then pick the functionality I want in the classes you can actually create instances of. That generally eliminates all the uses of multiple inheritance I see people use, and makes the code easier to maintain (I've generally found that if you can eliminate multiple inheritance you have simplied the API and programming model, which is uniformily a good thing in my experience).
Maybe I spent too much time writting Objective C, and just got used to designing code with Protocols, and single inheritance in my former life.
You've never ever had to deal with an external company to get data have you? I deal with thousands (literally thousands). We have defined our own CSV format. Of which in the end, we have had to accomodate about 4-6 variations of it depending on the type of CSV (gotta love CSV, there are different variations on how to encapsulate/esape comma's and quotes).
Then there are the 100 (it's probably closer to 250 formats, of which only 100 are still being used) other formats we accept in. Most of them had to be manually reverse engineered. We are exporting data out of a third party software. The people on the other end of the phone didn't write the software, they don't know how the software works. So we get to deduce all of that from them.
The 10th time you come across a variable length record file, where the sample didn't have all of the record types that exist, you'll be more then a touch peveed about it. Okay, so *K records are 432 bytes, *M records are 345 bytes. Aggg! I've never seen a *I record before. Whoopie, I get to go add one line of code recompile and deploy the software to overcome the lack of file documentation. You'll think to yourself, you know, if this was in XML, I could just skip this whole sub-tree. All the data I need, I've already found, so just clue the parser in to read to the end of the data, and I'll start on the next record.
In the end, I'm much rather throw a bit of RAM and some CPU at the problem and teach the software the semantics it needs to know to read XML DTD's to figure out how to extract data, instead of trying to reverse engineer a binary file format (gotta love people who use file formats for export but forget that Endianness exists so the same file format isn't portable across platforms). Figuring out the parsing of each file type is the part that is time consuming. Once you get the actual state machine to parse the data built, it's cake to actually extract the information I need.
There's a reason everyone documents a file format. Why not just use a general case one until you see that there is in fact a speed problem. 90% of the time, speed of parsing the damn data, and the storage overhead it incurs is minimal relative to the amount of resources wasted dealing with writting efficient parsers for each file type. The documentation and validation of the data is built in to XML (yeah, no more error detection of the system that are newline oriented, and the input system accepts carriage returns in the fields, or just corrupted on particular row of a huge CSV file). The file format is well maintained and can express all data desired (I've seen more then my fair share of file formats that if you put the delimiter in a datafile, it won't bother to escape it, I love those).
Just like I've learned to love the STL. The STL is better, faster, more memory efficient, and well organized then any code I'll probably ever write in my life. Your arguments sound very similar to the idiots I see expousing how they can out do the STL, right up until I make them benchmark their code against the most naive STL implementation you could imagine of their code.
I'm fairly sure the same is true of the XML parsers. Sure the files themselves are overly verbose. However, barring just ludicrust constraints (either enormous files, or really limited resources on the device), does it make any difference?
I'd never use XML internally to transport data around (in that case your arguments make some sense). My problem, is that there are thousands of people who might want data from me. I'd much rather use XML and tell them to use a stock XML parser then give them the EBNF to read my and some sample Lex/Yacc on how to process it. I would provide an XML export to any third party who wanted one. It'd be extremely simple, and it'd solve a ton of problems for me if I could just send and recieve data in XML. I'd have saved several man years here at my job if everyone who handed me data did it in an XML format. Sure, I'd have to have spent $5-10K extra on CPU and drive space (given that I've spent on the order of $250K-$500K it would have been no big deal).
For starters, that's the first time you said it only catches script kiddies (that might be true). I can't argue that for sure or not (I don't know that any of the Honeypots has ever given out personal information about the people they catch, several of the logs have shown them to be script kiddies, but that's not conclusive evidence). You claim that no one but a script kiddie wouldn't know after 15 minutes. Well, they've already given up the goat in the first 15 minutes. You now have successfully logged how they go there. On unpatched machines that might be less interesting. So setup a completely up-to-date one. It's not complex.
Prior to the previous post, you have said, "It can't catch zero day exploits", which it can, and has in the past. I've pointed that out, and offered to cite a source on it (It's "Honeypots" by Lance Spitzer, I don't have a page reference right off hand).
Second, you might not find that valuable data, however "valuable" is in the eye of the beholder.
So setup a completely patched honeypot, watch that one. Christ, they haven't, but that doesn't mean it can't be done or isn't interesting. One of the more interesting things, if you track down the original paper and read it, is that 2 of the cracks didn't get cracked via binary flaws, they were brute force password attacks (which in and of itself is interesting to me at least). Plenty of people do could setup production. I'll bet google does. I'll bet Yahoo does. I'll be American Express does. They have machines that are there to be attacked, and serve no other purpose.
I'll bet they have machines setup in the middle of their internal network that are specially logged via a transparent bridge (I've set one of these up before), that sits and captures all packets that cross the interface (make sure it doesn't munge the MAC addr is about the only trick). It's in the DNS server. It's fully operational just like the 10 other machines just like it. It just sits there in the middle of any number of other machines. When traffic crosses that bridge that isn't arp traffic, bells and whistles go off.
The reason they use unpatched machines is to keep the deterance factor low. So people will easily be successful in the attack. My guess is that Amazon, AmEx, Yahoo, Google and any number of others, have machines they want to get attacked with the full security setup. Specifically so they have machines that are safe to pull off line once they realize a hack is being attempted. I wouldn't be shocked to see that they have a network of such machines that communicate with each other. That way the entire system looks busy enough to be a live system to not give up the goat to quickly to the hacker. So data is flowing thru the system, but just not data you really care about. It wouldn't be too incredibly hard to just replay data from yesterday thru the system. In a well designed message passing system, that's all you have to do. Treat it just like every other machine, make sure it has load that is passing thru it. Log all the packets via transparent bridges that have not TCP/IP configured. Just plain jane Ethernet 802.3 repeaters (use a Linux box it's trivial). Put in scads of harddrive space that writes really fast. Spool it to tape with a big tape drive. Production honeypot on a production system, that is indistinguisable to a blackhat from the production system until after he has broken in to a larger number of machines. Honeypots are left easy to break into, specifically so they will succeed first. So have an easy set, and a hard set. Geez.
This point of data is interesting to me, as it clues me in that I can't just update a Windows machine every over the internet from a fresh install. I'll have to have the security patches, or I'm screwed. However, it appears with a Linux box, assuming I shutdown enough services, it appears I can feel relatively save updating it via the network even from a scratch install (generally I never ever do an install off known media, but it's a warm fuzzy to that I have less and less to worry about a hack being available and me not having the update immediatly).
You have a novel definition of "a little bit of your paycheck".
Last time I checked, SS is somewhere between 12% and 15% of my paycheck (Yes, I only pay 6% or 7% of it, but I could negociate a better salary from my employer if they didn't have to pay the other half. I'm smart enough to know that the other half they are taking from me is mine, but they are smart enough not to put it on the paystubs of ever last worker in the U.S.).
I'm fiscially responsible, and given another 12%-15% of my paycheck back, I could retire 20 years earlier then the SSA will pay me a dime, and as bonus, because people in my family generally don't live long enough, I could even enjoy some of the benefits of the system I'll end up paying into right up until the day I die! Maybe, I'd even get to pass that money on to others in my family to help them form the nest egg they could use to retire on. I can't ever use a lot of the retirement funding they set up tax shelters for. Unless I'm special in my family, I won't ever get the 401K or SS benefits. I still put money into the 401K as I can use it in other ways.
This is simple demographics problem. The simple problem is that the boomers didn't pay in enough while they were younger (and also that a lot of people got benefits that never paid into the system when it was originally started). Some of the rest of the problem is that people are living much longer, and are consuming more resources then had been anticipated.
It was an ill conceived system. If you think that individuals can't manage to plan for their future. Take a good long hard look at what the Gov't is doing to try and plan for an entire countries future. I'd much rather leave it to individuals, and let them participate in other private sector mechanisms (like say, insurance).
You keep stating that like it's a fact, despite counter evidence. Honeypots have discovered previously unknown attacks in the wild. Full stop (unless you have a different definition of "zero-day exploit" then I do, that'd be a counter example to "Honeypots can't possibly capture a zero-day exploit"). Who used them, and how they discovered them is unknown.
I'm not saying they have discovered them all, but they do discover them. Any number of honeypots are intentionally put into the middle of existing production networks by people do have valuable data, specifically so a blackhat will attack it with all it's best tools so they can be aware. What are dead giveaways to blackhats would be avoided in those situations.
The Honeynet Project does do somethings to make it fairly obvious that you are being captured. However, don't fall into the trap of believing that blackhats are all knowing, ominscient gods of computing. Some of them, are very, very good at what they do. Any number of master criminals get caught because they've been lured into doing something silly in both the real world, and in the computer crime world.
For example, if someone got root using some local exploit no-one had seen before we could reverse engineer the script they used and fix the bug. But this has never happened
You really should read up on the honeynet project sometime before saying silly things like this.
For starters, they have in fact found previously unknown exploits (at least one, but possibly several). I forget the exact details off hand, but in "Honeypots" (A pretty decent book), it is covered. They cover it in the section about different types of honeypots and what they are good for. They discovered a hole in a network service that was previously unknown on Linux machines several years ago when the project first started. I can cite it tomorrow if you really don't believe me (the book is at home, I'm not). A lot of blackhats give out zero days as a way of gaining credibility. While it wans't a zero day, a honeypot was one of the first things to figure out how one of the Major worms worked (Code Red I think, but it might have been one of the others).
Also, black hats need a platform to mount their attack from that they can easily own without worry. So they attach home networks knowing that they can complete own a box and wipe the logs. Meanwhile, they can mount attacks from those machines onto others that are important. They need the intermediate machines to be anonymous. They might want to attack "American Express", or "Amazon.com". Anyone with any brains doesn't attack those from the IP's known to be in their basement. They find other machines that will have no logging, or logging that can be completely compromised to use as a base of attack. Then the trail to find them dies at these random machines on the interent.
Besides that, any one wanting to implement a "Andy Worhal Worm", needs to find a set of machines that have an exploit available. In order to find those, one has to start attacking random machines on the internet. The honeypot project could accomplish that (I don't know that they have, but it would be a very good use of it).
Finally, I don't have any important machines, so information about random machines on the internet fits me to a "T". I am more interested in what the script kiddies are doing, and what sorts of attacks they are making. The honeynet project does provide details about what JRandom guy with an IP on the internet can expect to be hit with.
I'd venture to say that no science experiment ever conducted has ever been under "the same conditions". It's merely a matter of how close the conditions are, and why everything else doesn't matter. You figure that out by starting by making measurements and when you can't explain something, guess why, and form a model. Then try and setup a situation to measure if you guess is correct. Any number of "Scientific" measurements aren't repeatable (the analysis of any number of astronomical events are unique to our lifetimes and are irrepeatable in the sense you are using).
You can only draw those conclusions about water because someone has done all the scientific measurements before you.
We didn't figure out gravity all at once. Some guy started dropping balls and measuring time. Some guys started measuring the time it took to roll down planks. Eventually they made lots of measurements that were "big boiling pot of useless variables", and figured out that air resistance makes a difference. That if you measure incredibly accurately, that the latitude and longitude (more specifically your distance from the center of the earth) matter. Even more accurately, what time of year does matter (our distance from the sun changes). They sorted out the patterns in the data. What they are doing is called "basic science". It isn't sexy, and it isn't useful right away. However to start something that a is a "science", you have to start by making measurements and then explaining them. Explain to me roughly speaking, how one makes "Scientific" measurements on the internet where you have control groups? How precisely does one setup a second world wide interent that is identical in all ways except one has an extra Linux machine on it? Maybe if they continue to make such measurements, they might figure what the variables are.
That's precisely what they are doing. I'd have to read the actual statement they made to see how well they are lying with statistics. My guess is the statement they made was accurate and accurately captured what it was they measured.
Also, I'm going to guess they used the same RedHat distributions (or at least had all of the old ones, and some new ones), and they used all the same old IP's (or at least used all the old ranges, and some new ranges). So I'd further venture to guess that your "boiling water" analogy is incorrect. I've read about these guys quite often. They are fairly "scientific" about what they do, and how they do it. The biggest problem they have is man power to setup and analyze the machines and attacks. Which is really a function of their other big problem, a serious lack of financial resources. What they are doing on a large scale would result in really useful measurements. Sure what they are doing is on the level of "Grade School Science Projects" in terms of the scale and quality of science. However, that doesn't make it any less "scientific".
As to this:
get an experiment that is so wildly useless that you can't honestly call it scientific
Useful science, is called "Engineering". Useless science is all over the place. Science is about forming a hypothesis, setting up a way of measuring your hypothesis, then analyzing the data after the fact. This sure seems to fit the bill. Useless Science, is how all science started. Next you'll tell me Linux isn't at all like Unix, because it started out life as a useless terminal program.
You are approaching that all incorrectly. I haven't read the study, but from a general understanding of honeypot theory it is "scientific".
They have an experiment they run, and they measure the outcomes. The measurements over time have changed. They compared the measurements.
That's pretty much the textbook definition of "scientific" and "statistics".
No, this "study", might be an anecdote (I'm unaware of how many machines they have). However, it is a "fact", N months that putting an unpatched Linux system on the Internet used to on average last X minutes. A more recent measurement shows that it now lasts M * X minutes before being compromised. I'm fairly sure these people have several measurements at several points in time (I've read similar measurments like this from the same people a number of times).
That's a controlled experiment (technically speaking, the old measurement is the "baseline"). It's an interesting fact. It doesn't mean "Linux is getting more Secure". It means that on average it appears that a Linux machine without security patches lasts longer before being compromised. That could be because of the cost of beef in Tokyo. It could be because Linux is more secure. It could be because Linux is a low priority target for blackhats. It could be because the IP ranges used this time are known honeypot addresses by the blackhats (which is one of the few causes of problems that would make this "fact" useless to me).
It's not a measurement of causation. It's not a measurement of security. It's a scientific measurement of a length of time. Just like measuring the length of daylight outside. You can measure that scientifically. It won't explain seasonality. It won't explain the tilt of the earth. It won't explain the nature of quantum mechanics. However, it will be an accurate measurement of what it is: "How long the sun was up". Sure it's not the worlds most fact that Linux machines are lasting longer before being successfully attacked, but it is novel for those of us who have Linux machines on the Internet. However, it's lack of being the end all be all theory of Linux security, that doesn't mean it isn't a well defined measure.
In this case, there really isn't. Perl is for pattern hashing, so that's probably one of the most commonly referred to pages in the world.
If I was a perl-head, and that was my reference material (which it would be), I'd have known the answer in an interview. Especially if I had just started learning perl. (I know some perl, but not enough to actually read up on it. C and C++ interest me enough that I consider reading technical documentation about them "fun").
Heck, that was my guess as to what it was (I knew it had to be one of a handful of concepts: pattern matching, a section describing the difference between scalar and associative context, or the first section that mentioned CPAN in that likelyhood order). Pattern Matching is what perl was invented for if I remember my history correctly. That page wasn't chosen at random by the questioner, it was chosen for a specific reason. They didn't pick page 12, they didn't pick page 275. They picked the page that had documentation about the essense of what Perl is, and a page that would likely be referred to early and often.
When we interview people, we have questions of that nature (not insane questions, but questions that we don't expect anyone to answer correctly, but if you do we feel that you can easily be assimilated).
Heck, I know some football fans who probably have the Superbowl on tape from that year. So, "only known" surviving copy, is sorta like me copy of a some obscure TV show is the only known copy, because I never seen anyone else with it.
I can't play it, but here's a link I found for it by Googling: Link.
If you go googling for it, there are lots of references to it, and plenty of places that appear to be able to display it. I can't play them, as I can't run quicktime or .mov's at work (I'm not installing a movie player at work just to know that they are correct). Clicking on it gets me a 2.2MB file.
Kirby
Most of the people who I know who tell me about the old time UNIX, tell me:
AIX was in general a real pain in the ass (this might have changed in the intervening 5-10 years since their experience). That it was even more different then Solaris was in terms of the arcane knowledge you needed to administer it properly.
I've known more then a few DBA's and SA's who have told me that at various points, it was easier to just run screaming from AIX then to deal with it. Eventually, most all of it could be overcome, but that learning all of it's pitfalls could be a very painful experience (After reviewing more then a few of the Oracle Bugs reports, I can see that it would terrify me to run Oracle on anything from the 4.[23] era). When given a choice, 9 times out of 10, everyone I've ever dealt with would rather run Solaris then AIX. Maybe it's because I work in a city where most everyone uses Solaris, so it's flaws are just well known pot holes every avoids out of habit.
I've heard horror stories about AIX, HP-UX (HP's UNIX), DNIX (Sequent's UNIX), IRIX (SGI's Unix), SCO, and OSF/1 (DEC's UNIX). Actually, I can't remember too many SCO administrative nightmares, but that might be that not too many people I know have ever dealt with UnixWare.
Most of them, I can't even recall, but I remember the AIX goop quite clearly, as it never sounded very UNIX'y to me.
In the end, I've always been told that doing anything not thru SMIT (I believe it was referred to as "smitty"), was a bad idea. That just hand editting files was a recipe for disaster in their experience. I thought they said that the files in /etc, got output for compatibility, but that there was a binary backend that was authoratitive, and could be accessed via a programitic API. All that sounded like a disaster waiting to happen. Now, I might have been informed incorrectly, or my knowledge might be years and years out of date.
Finally, I've found that shipping two sets of commands is a recipe for disaster when shell scripting. I'd much rather have one or other, but not both (a shell script run as one person won't work when another runs it on the same machine). In the end, it's a source of more problems then just learning the native tools. I never minded having too command sets (gmake, gcc, gawk), but having to figure out if make is IBM make, or GNU make always seemed silly to me.
Kirby
I agree with you, I find some of the incompatibilities scary, like "reboot" and "halt", either had, or still have very different behavior on Solaris then they do on Linux. On solaris, they are immediate panic type commands, on Linux it's an orderly shutdown. A friend of mine, always uses the init 0 or init 6 to get that behavior on both.
However, there are probably just as many Solaris heads out there going, "I hate Linux why can't I just re-use my everyday Solaris knowledge". AIX is so different to administer, I'm shocked you include it as "Linux-like". (Note, I've never used AIX, but from what I've been told, everything runs thru some admin tool that edits binary files for configuration instead of the standard human readable text files used under Linux).
Kirby
You have a "standing query". So you can ask things, like, what's the rolling average for the last 60 seconds for this ticker name. What's the minimum price for this commodity.
You can ask to correlate things. Store the last 90 minutes worth of transactions on these commodities. Search for these types of patterns.
It sounds like what they have done is build an OLAP cube that builds its dataset on the fly by processing messages coming over a streaming interface.
It's much smarter to do that, then write every last transaction to disk, and then query the transactions after the fact. That'd be the natural way to thing about it if you used a Relational database.
Essentially, it sure sounds like he's written a generalized packet filter, that can compute interesting functions on the data. Think snort, think ethereal, think iptables, think policy routing. Now apply those kinds of technology to "The price of this stock", "the location of that soldier", where those values are embedded in a network packet frame somewhere.
While each single application of this sounds trivial to implement, if he has done it in a generalized way, that can keep pay with larger systems, bully for him.
The irony of all this for me is that at a former job, I used to process medical data exactly this way. It sounds like the HL7 interface issues we used to have. You couldn't possibly take a full HL7 stream and process it, so you'd filter it down to just the patients that this department was interested in. Then only process messages about those patients.
There were rows that even about those patients you weren't interested in that you had to filter out. You spent a bunch of time filtering, and re-filtering.
We wrote the raw messages to disk, and spooled them to ensure we didn't miss messages due database problems (if the database was down, you had to spool until the database came back up, it was unacceptable to miss patient records for database maintience).
Kirby
Political pressure was put upon NASA to lauch a vehicle during this launch window. I forget the details behindwhat it was. If I remember right, there wasn't another window for several weeks if they missed this one.
Second, the O-Ring was blantantly known. There's a reason the demonstration the NASA Engineer put on, where he pulled the O-Ring out of water and pulled on it was so blantantly bad. It's my understanding from reading Richard Feynmann's comments that he insisted be added as an appendix to the report, that essentially the right people in NASA knew it was going to blow up. However, they justified it, with "Well the O-Ring is three times as thick as it needs to be, so the problem it is showing where it has a 1/3rd erosion is not a problem". You can read up on it here
Feynman essesstially accuses them of using previous success as a evidence that all future launches will be a success. That's not good science or good Engineering. I think Columbia was screwed from the moment they made orbit (they might have been able to abort pre-orbit, post foam collision. I'm not sure on that). However, Challenger, they KNEW, they had an important piece behaving oddly in a way they didn't understand, while launching under extreme conditions. That's not being particularly safe.
While I agree with you on Columbia, I strongly disagree with your characterization of Challenger.
Also, I'm absolutely positive those parts haven't been sitting in a warehouse since 1965. We were fairly busy with the Mecury and Gemini missions in that time frame. My guess is they got invented no earlier then 1975, and made no early then 1980 or so. I'd have to go look into the history, but I'm reasonable sure the drawing hadn't even been brought out before 1972 or 1973, screw making parts to a specification.
Kirby
You can easily adapt the RedHat scripts to run on Slackware. Personally I would recommend setting up nagios or some other software monitoring. Everytime something goes wrong on a machine, we write a script to monitor that. Now, every few things go wrong unnoticed.
We write monitoring scripts that run via nagios, that check that out. Within 10 minutes of a drive failing I have a page, within 5 I have an e-mail (there's a five minute latency on nagios recognizing the problem, and about a five minute latency from the time the paging company gets the page until the pager goes off). That's pretty much the worst case scenerio.
I'd really much prefer that to not having a RAID array. We've used that system (*knock*,*knock*,*knock*), for 4 years, and with about 5TB of filesystems at work, we've never ever lost a RAID'ed filesystem (worst case, was the SCSI locked up due to a driver failure, but I think that would have happened even with no RAID configuration, the machine had to be power cycled, but the filesystem was still in tact).
We have lost several, incredibly important filesystems that weren't RAID'ed. Technically speaking the filesystem wasn't important, however, the downtime was really bad during the rebuild/recover phase. The first time we lost $10K due to a failed IDE disk that was 4 years old, we convinced the boss that he should really purchase us mirrored SCSI disks for all the OS drives, it was a cheaper one time cost.
If you have spare drives arround, you can configure mdadm to automatically add them into the system. Unlike the standard md tools, you can have one spare for any number of md arrays.
Kirby
Next, I'm saying, that I'm confident, that if a phisher can figure out how to write to your /etc/hosts file, it's merely a matter of time, until they write to where ever your cert's are installed. They will install a cert that makes them the equivilent of Verisign. There's a file on your machine that is all you have that makes you trust Verisign. I can create one of those files, call it "Phisher Cert's R US".
Then any site that has a cert signed by "Phisher Cert's R US" will not give you an alert in IE.
If you you aren't actively checking your certs files, that could be a serious issue. To the best of my knowledge, your cert files aren't cryptographically checked in any manner. I know you can just add a cert to your own machines to make self-signed cert messages go away.
If you have to contact Verisign in order to authenticate your cert with them, that's not a problem either. They control your DNS via the hosts file. They will direct you to thier site and feed you bogus information. What a wonderful thing.
The problem is to the best of my knowledge nothing will alert you that your cert files has been tampered with or been added to. However it's signed, I have to be able to add certs to them myself. Phisher's they can just set themselves up as a cert provider you trust.
Kirby
Besides all that, I'm fairly sharp about my security, and I know most of the fundamentals of the math behind it, and I wouldn't be shocked if my bank switched SSL keys because their old one just expired. Imagine the bedlam that would ensue if everyone did freak out, just because a key had changed.
Now, if they hijack a DNS server, or break into Verisign and get the secret key they are in (or more likely, one of the smaller SSL Key providers that have default keys on Microsoft IE installs).
I don't remember the exact details of how you use the certs on your desktop machine, if at any point you have to connect to Verisign, they have you. They control the IP where you believe Verisign is located. The trick will be you having to establish cryptographic trust of files you us, an every bit of information between you and completing the transaction. Them being able to control any stage of the transaction, and they can wreak havoc on you.
Kirby
Okay, I understand the whole AUP piece. I understand that it could be a problem for the network.
What I'm not sure I understand, is how a simple program could "put the University's core business at risk". If that is a publically funded University, I really object to that statement (it's not a business, a public service. It's nice if it's self funding, but the objective it not to turn a profit), if it's a private University I suppose it is in fact a business. I really don't see how this will in any way interfere with teaching students and colleting fees. While I suppose the degradation of internet service and the raising of ISP charges would affect the bottom line, it surely doesn't affect the ability of the facaulty to interact with students.
Kirby
Second, if anyone who isn't part of my organization can get anywhere near that machine, I'm already so incredibly compromised it's not even funny. There are several layers of firewalls between it and anyone not physically in the server room. There is an application server that is allowed to access it, and that's pretty much it. Any hacker worth his salt would have complete control of all of my machines by the time he could get past the firewalls to get to that machine.
Kirby
I don't have a copy of the SQL standard handy to look at to see what behavior it describes.
Kirby
Sorry, I know, I'm just being pedantic, the joke is more obvious an a lot funnier the way you are presenting it.
Kirby
Just hard coding /bin/ls is just as suseptible to the problem you are talking about. The real problem here is that you are using shell scripts to do real work. Stop that. If you really want something to work, write it in a real language, you really have control of. That's reliability. Shell scripts are nice, they are wonderful. However, it's preicsely these sorts of problems that lead me to believe that re-writing the scripts in python, C, or perl is a good idea. Especially if you avoid "system", "popen" like the plauge. In those cases, you control the environement much better, and have native data structures with well defined interfaces. Instead of using "ls", you use "readdir" and a loop of some kind.
Kirby
They are not the same. However for 99.999% of all sane schema designs they are in fact the same. However, count(*) has to pull the data to ensure that there is a non-null field. More then likely if it was clever, it would not pull the data if there was a primary key on any of the tables in the select statement.
Kirby
The problem is I can't extract the documentation, either because said documentation doesn't exist, or because the vendor involved feels that the documentation is a propritary and won't give it to us. They want to get into the business we are in, and thus won't help their clients help give data to us. If it was just standard to get data in XML format, I could easily get that done.
I really don't need clear documentation. Honest, what I deal with is fairly simple. I need to extract roughly 10-15 fields of information. I know what they are up front. 95% of the time, I get way, way, way more data then I ever needed. It's not like I have to parse every last bit. I just need to extract the 10 fields I need out of the up to several hundred I was sent. My problem is never that I can't find the data in the file. My problem is always, that the export didn't handle the case where a delimiter was in a data field. (I have an amazing number of people who give us data that have commas in comma delimited data who failed to use quotes or any other escape). XML if you use a simple off the shelf library will solve all of those problems. I have a lot less to worry about if the idiot who designed the export file format, was an idiot.
Next, sure, the STL isn't the end all be all of data structures. I hate some of the limitations you mentioned. I'm a professional C++ programmer, and I see a lot of what you are saying (I don't use a lot of STL algorithms, or other things, precisely because implementing a class to write a functor so I can avoid writing a three line for loop seems like overkill). However, if you need a linked list, a dynamic resizing array, a simple dictionary, or a simple set. 90% of the things I work on those semantics are exactly what I need. Writing your own linked list, binary tree, or resizing array is just stupid. The STL code will take your code and eat it for lunch speedwise, and when you realized you needed an extra feature, it will already be implemented. Yes, it can't do circular lists.
I've seen an amazing number of people who think: "Linked lists aren't hard, I can write my own that's at least as fast the one in the STL, and I can debug it easier". Uh huh. Sure. Right! Take the one that is already speed tested and well tuned, and really well debugged. When it comes right down to it, the only thing you'll need to customize to make it go faster is the allocator can make a large difference. About the only thing that drives me nuts about the STL is that it can segfault if you have a bad "operator less than".
Kirby
I hate to beat you with reality, but declaring "XML has nothing over a properly documented file format", is fairly akin to my above solutions. NASA worlds most anal retentive organization, has lots large amounts of data because they lost the documentation. I deal with lots of companies that didn't originate their data export. They have no control over it, and no documentation. There's nothing anyone involved can do about it (sometimes because the vendors are out of business, othertimes because the vendors feel that giving out their documentation for the export gives up a competitive advantage).
XML has beauty two ways. First off, you'll never have to write a proper parser every again in your life. Yeah! Second, the documentation should exist in the DTD. The DTD should wander around. Even if the DTD doesn't exist, the file format is still easy to parse, even the portions you have never seen before, because it is a well specified format. Unlike CSV. Unlike the most of the 250 odd file formats I have had to deal with in real world situations. On top of all that, why on earth do you want everyone and their brother to make up their own silly file format. Dedicate the resources to make one good one that can cover 90% of the cases. (I know there are cases where XML won't fit).
Next, I agree with you, "XML Databases" and "XML Language Modeling" are silly constructs. Unless you know the data could end up with someone you'll never meet or do business with, it shouldn't be in XML.
Kirby
As an aside, I've seen very, very few cases where Multiple inheritence that wasn't of the "virtual base class that defines only methods" (a rough description of Java's single inheritance + the "Implements" model) type that made much sense. Multiple inheritance was fairly vile for a long time, especially before C++ was standardized. Personally, I never ever use it (for the problems I work on it's not terribly natural). Almost always, I can use templated derived classes, or implement some functionality in helper functions up and down the class hierachy and then pick the functionality I want in the classes you can actually create instances of. That generally eliminates all the uses of multiple inheritance I see people use, and makes the code easier to maintain (I've generally found that if you can eliminate multiple inheritance you have simplied the API and programming model, which is uniformily a good thing in my experience).
Maybe I spent too much time writting Objective C, and just got used to designing code with Protocols, and single inheritance in my former life.
Kirby
Then there are the 100 (it's probably closer to 250 formats, of which only 100 are still being used) other formats we accept in. Most of them had to be manually reverse engineered. We are exporting data out of a third party software. The people on the other end of the phone didn't write the software, they don't know how the software works. So we get to deduce all of that from them.
The 10th time you come across a variable length record file, where the sample didn't have all of the record types that exist, you'll be more then a touch peveed about it. Okay, so *K records are 432 bytes, *M records are 345 bytes. Aggg! I've never seen a *I record before. Whoopie, I get to go add one line of code recompile and deploy the software to overcome the lack of file documentation. You'll think to yourself, you know, if this was in XML, I could just skip this whole sub-tree. All the data I need, I've already found, so just clue the parser in to read to the end of the data, and I'll start on the next record.
In the end, I'm much rather throw a bit of RAM and some CPU at the problem and teach the software the semantics it needs to know to read XML DTD's to figure out how to extract data, instead of trying to reverse engineer a binary file format (gotta love people who use file formats for export but forget that Endianness exists so the same file format isn't portable across platforms). Figuring out the parsing of each file type is the part that is time consuming. Once you get the actual state machine to parse the data built, it's cake to actually extract the information I need.
There's a reason everyone documents a file format. Why not just use a general case one until you see that there is in fact a speed problem. 90% of the time, speed of parsing the damn data, and the storage overhead it incurs is minimal relative to the amount of resources wasted dealing with writting efficient parsers for each file type. The documentation and validation of the data is built in to XML (yeah, no more error detection of the system that are newline oriented, and the input system accepts carriage returns in the fields, or just corrupted on particular row of a huge CSV file). The file format is well maintained and can express all data desired (I've seen more then my fair share of file formats that if you put the delimiter in a datafile, it won't bother to escape it, I love those).
Just like I've learned to love the STL. The STL is better, faster, more memory efficient, and well organized then any code I'll probably ever write in my life. Your arguments sound very similar to the idiots I see expousing how they can out do the STL, right up until I make them benchmark their code against the most naive STL implementation you could imagine of their code.
I'm fairly sure the same is true of the XML parsers. Sure the files themselves are overly verbose. However, barring just ludicrust constraints (either enormous files, or really limited resources on the device), does it make any difference?
I'd never use XML internally to transport data around (in that case your arguments make some sense). My problem, is that there are thousands of people who might want data from me. I'd much rather use XML and tell them to use a stock XML parser then give them the EBNF to read my and some sample Lex/Yacc on how to process it. I would provide an XML export to any third party who wanted one. It'd be extremely simple, and it'd solve a ton of problems for me if I could just send and recieve data in XML. I'd have saved several man years here at my job if everyone who handed me data did it in an XML format. Sure, I'd have to have spent $5-10K extra on CPU and drive space (given that I've spent on the order of $250K-$500K it would have been no big deal).
Kirby
Prior to the previous post, you have said, "It can't catch zero day exploits", which it can, and has in the past. I've pointed that out, and offered to cite a source on it (It's "Honeypots" by Lance Spitzer, I don't have a page reference right off hand).
Second, you might not find that valuable data, however "valuable" is in the eye of the beholder.
So setup a completely patched honeypot, watch that one. Christ, they haven't, but that doesn't mean it can't be done or isn't interesting. One of the more interesting things, if you track down the original paper and read it, is that 2 of the cracks didn't get cracked via binary flaws, they were brute force password attacks (which in and of itself is interesting to me at least). Plenty of people do could setup production. I'll bet google does. I'll bet Yahoo does. I'll be American Express does. They have machines that are there to be attacked, and serve no other purpose.
I'll bet they have machines setup in the middle of their internal network that are specially logged via a transparent bridge (I've set one of these up before), that sits and captures all packets that cross the interface (make sure it doesn't munge the MAC addr is about the only trick). It's in the DNS server. It's fully operational just like the 10 other machines just like it. It just sits there in the middle of any number of other machines. When traffic crosses that bridge that isn't arp traffic, bells and whistles go off.
The reason they use unpatched machines is to keep the deterance factor low. So people will easily be successful in the attack. My guess is that Amazon, AmEx, Yahoo, Google and any number of others, have machines they want to get attacked with the full security setup. Specifically so they have machines that are safe to pull off line once they realize a hack is being attempted. I wouldn't be shocked to see that they have a network of such machines that communicate with each other. That way the entire system looks busy enough to be a live system to not give up the goat to quickly to the hacker. So data is flowing thru the system, but just not data you really care about. It wouldn't be too incredibly hard to just replay data from yesterday thru the system. In a well designed message passing system, that's all you have to do. Treat it just like every other machine, make sure it has load that is passing thru it. Log all the packets via transparent bridges that have not TCP/IP configured. Just plain jane Ethernet 802.3 repeaters (use a Linux box it's trivial). Put in scads of harddrive space that writes really fast. Spool it to tape with a big tape drive. Production honeypot on a production system, that is indistinguisable to a blackhat from the production system until after he has broken in to a larger number of machines. Honeypots are left easy to break into, specifically so they will succeed first. So have an easy set, and a hard set. Geez.
This point of data is interesting to me, as it clues me in that I can't just update a Windows machine every over the internet from a fresh install. I'll have to have the security patches, or I'm screwed. However, it appears with a Linux box, assuming I shutdown enough services, it appears I can feel relatively save updating it via the network even from a scratch install (generally I never ever do an install off known media, but it's a warm fuzzy to that I have less and less to worry about a hack being available and me not having the update immediatly).
Kirby
Last time I checked, SS is somewhere between 12% and 15% of my paycheck (Yes, I only pay 6% or 7% of it, but I could negociate a better salary from my employer if they didn't have to pay the other half. I'm smart enough to know that the other half they are taking from me is mine, but they are smart enough not to put it on the paystubs of ever last worker in the U.S.).
I'm fiscially responsible, and given another 12%-15% of my paycheck back, I could retire 20 years earlier then the SSA will pay me a dime, and as bonus, because people in my family generally don't live long enough, I could even enjoy some of the benefits of the system I'll end up paying into right up until the day I die! Maybe, I'd even get to pass that money on to others in my family to help them form the nest egg they could use to retire on. I can't ever use a lot of the retirement funding they set up tax shelters for. Unless I'm special in my family, I won't ever get the 401K or SS benefits. I still put money into the 401K as I can use it in other ways.
This is simple demographics problem. The simple problem is that the boomers didn't pay in enough while they were younger (and also that a lot of people got benefits that never paid into the system when it was originally started). Some of the rest of the problem is that people are living much longer, and are consuming more resources then had been anticipated.
It was an ill conceived system. If you think that individuals can't manage to plan for their future. Take a good long hard look at what the Gov't is doing to try and plan for an entire countries future. I'd much rather leave it to individuals, and let them participate in other private sector mechanisms (like say, insurance).
Kirby
I'm not saying they have discovered them all, but they do discover them. Any number of honeypots are intentionally put into the middle of existing production networks by people do have valuable data, specifically so a blackhat will attack it with all it's best tools so they can be aware. What are dead giveaways to blackhats would be avoided in those situations.
The Honeynet Project does do somethings to make it fairly obvious that you are being captured. However, don't fall into the trap of believing that blackhats are all knowing, ominscient gods of computing. Some of them, are very, very good at what they do. Any number of master criminals get caught because they've been lured into doing something silly in both the real world, and in the computer crime world.
Kirby
You really should read up on the honeynet project sometime before saying silly things like this.
For starters, they have in fact found previously unknown exploits (at least one, but possibly several). I forget the exact details off hand, but in "Honeypots" (A pretty decent book), it is covered. They cover it in the section about different types of honeypots and what they are good for. They discovered a hole in a network service that was previously unknown on Linux machines several years ago when the project first started. I can cite it tomorrow if you really don't believe me (the book is at home, I'm not). A lot of blackhats give out zero days as a way of gaining credibility. While it wans't a zero day, a honeypot was one of the first things to figure out how one of the Major worms worked (Code Red I think, but it might have been one of the others).
Also, black hats need a platform to mount their attack from that they can easily own without worry. So they attach home networks knowing that they can complete own a box and wipe the logs. Meanwhile, they can mount attacks from those machines onto others that are important. They need the intermediate machines to be anonymous. They might want to attack "American Express", or "Amazon.com". Anyone with any brains doesn't attack those from the IP's known to be in their basement. They find other machines that will have no logging, or logging that can be completely compromised to use as a base of attack. Then the trail to find them dies at these random machines on the interent.
Besides that, any one wanting to implement a "Andy Worhal Worm", needs to find a set of machines that have an exploit available. In order to find those, one has to start attacking random machines on the internet. The honeypot project could accomplish that (I don't know that they have, but it would be a very good use of it).
Finally, I don't have any important machines, so information about random machines on the internet fits me to a "T". I am more interested in what the script kiddies are doing, and what sorts of attacks they are making. The honeynet project does provide details about what JRandom guy with an IP on the internet can expect to be hit with.
Kirby
You can only draw those conclusions about water because someone has done all the scientific measurements before you.
We didn't figure out gravity all at once. Some guy started dropping balls and measuring time. Some guys started measuring the time it took to roll down planks. Eventually they made lots of measurements that were "big boiling pot of useless variables", and figured out that air resistance makes a difference. That if you measure incredibly accurately, that the latitude and longitude (more specifically your distance from the center of the earth) matter. Even more accurately, what time of year does matter (our distance from the sun changes). They sorted out the patterns in the data. What they are doing is called "basic science". It isn't sexy, and it isn't useful right away. However to start something that a is a "science", you have to start by making measurements and then explaining them. Explain to me roughly speaking, how one makes "Scientific" measurements on the internet where you have control groups? How precisely does one setup a second world wide interent that is identical in all ways except one has an extra Linux machine on it? Maybe if they continue to make such measurements, they might figure what the variables are.
That's precisely what they are doing. I'd have to read the actual statement they made to see how well they are lying with statistics. My guess is the statement they made was accurate and accurately captured what it was they measured.
Also, I'm going to guess they used the same RedHat distributions (or at least had all of the old ones, and some new ones), and they used all the same old IP's (or at least used all the old ranges, and some new ranges). So I'd further venture to guess that your "boiling water" analogy is incorrect. I've read about these guys quite often. They are fairly "scientific" about what they do, and how they do it. The biggest problem they have is man power to setup and analyze the machines and attacks. Which is really a function of their other big problem, a serious lack of financial resources. What they are doing on a large scale would result in really useful measurements. Sure what they are doing is on the level of "Grade School Science Projects" in terms of the scale and quality of science. However, that doesn't make it any less "scientific".
As to this:
Useful science, is called "Engineering". Useless science is all over the place. Science is about forming a hypothesis, setting up a way of measuring your hypothesis, then analyzing the data after the fact. This sure seems to fit the bill. Useless Science, is how all science started. Next you'll tell me Linux isn't at all like Unix, because it started out life as a useless terminal program.
Kirby
They have an experiment they run, and they measure the outcomes. The measurements over time have changed. They compared the measurements.
That's pretty much the textbook definition of "scientific" and "statistics".
No, this "study", might be an anecdote (I'm unaware of how many machines they have). However, it is a "fact", N months that putting an unpatched Linux system on the Internet used to on average last X minutes. A more recent measurement shows that it now lasts M * X minutes before being compromised. I'm fairly sure these people have several measurements at several points in time (I've read similar measurments like this from the same people a number of times).
That's a controlled experiment (technically speaking, the old measurement is the "baseline"). It's an interesting fact. It doesn't mean "Linux is getting more Secure". It means that on average it appears that a Linux machine without security patches lasts longer before being compromised. That could be because of the cost of beef in Tokyo. It could be because Linux is more secure. It could be because Linux is a low priority target for blackhats. It could be because the IP ranges used this time are known honeypot addresses by the blackhats (which is one of the few causes of problems that would make this "fact" useless to me).
It's not a measurement of causation. It's not a measurement of security. It's a scientific measurement of a length of time. Just like measuring the length of daylight outside. You can measure that scientifically. It won't explain seasonality. It won't explain the tilt of the earth. It won't explain the nature of quantum mechanics. However, it will be an accurate measurement of what it is: "How long the sun was up". Sure it's not the worlds most fact that Linux machines are lasting longer before being successfully attacked, but it is novel for those of us who have Linux machines on the Internet. However, it's lack of being the end all be all theory of Linux security, that doesn't mean it isn't a well defined measure.
Kirby
If I was a perl-head, and that was my reference material (which it would be), I'd have known the answer in an interview. Especially if I had just started learning perl. (I know some perl, but not enough to actually read up on it. C and C++ interest me enough that I consider reading technical documentation about them "fun").
Heck, that was my guess as to what it was (I knew it had to be one of a handful of concepts: pattern matching, a section describing the difference between scalar and associative context, or the first section that mentioned CPAN in that likelyhood order). Pattern Matching is what perl was invented for if I remember my history correctly. That page wasn't chosen at random by the questioner, it was chosen for a specific reason. They didn't pick page 12, they didn't pick page 275. They picked the page that had documentation about the essense of what Perl is, and a page that would likely be referred to early and often.
When we interview people, we have questions of that nature (not insane questions, but questions that we don't expect anyone to answer correctly, but if you do we feel that you can easily be assimilated).
Kirby