owenomalley · Slashdot Mirror

Other approaches to scalable SQL on Researchers Create Database-Hadoop Hybrid · 2009-07-21 09:32 · Score: 1

There are also two Hadoop subprojects that either support SQL or will shortly. They both translate SQL queries into map/reduce programs. They are:

http://hadoop.apache.org/pig/
http://hadoop.apache.org/hive/

Re:C++ port of Java Hadoop? on Open Source Solution Breaks World Sorting Records · 2009-05-17 10:59 · Score: 1

There isn't a C++ port of Hadoop's map/reduce, but there is a C++ interface to the Java code. It is used by Yahoo's WebMap, which is the largest Hadoop application. It lets you write your mapper and reducer code as C++ classes.

The Hadoop Distributed File System (HDFS) also has C bindings to let C programs access the system. If you want another alternative, the Kosmos File System (KFS) is also a distributed file system and was written in C++. Hadoop includes bindings for HDFS and KFS, so that the application code can transparently use either at run time depending on the path (hdfs://server/path instead of kfs://server/path).

Re:Not quite as impressive as it sounds on Open Source Solution Breaks World Sorting Records · 2009-05-16 05:06 · Score: 4, Interesting

In sorting a terabyte, Hadoop beat Google's time (62 versus 68 seconds). For the petabyte sort, Google was faster (6 hours versus 16 hours). The hardware is of course different. (from Yahoo's blog and Google's blog)

Terabyte:
Machines: Yahoo 1,407 Google 1,000
Disks: Yahoo 5,628 Google 12,000
Petabyte:
Machines: Yahoo 3658 Google 4000
Disks: 14,632 Google: 48,000

Yahoo published their network specifications, but Google did not. Clearly the network speed is very relevant.

The two take away points are: Hadoop is getting faster and it is closing in on Google's performance and scalability.

Re:Perhaps a good addition to data warehousing on MapReduce Goes Commercial, Integrated With SQL · 2008-08-26 09:30 · Score: 5, Informative

The correct project name is Hadoop. It was factored out of Nutch 2.5 years ago. And Yahoo has been putting a lot of effort to make it scale up. We run 15,000 nodes with Hadoop in clusters of up to 2,000 nodes each and soon that will be 3,000 nodes. I used 900 nodes to win Jim Gray's terabyte sort benchmark by sorting 1 TB of data (100 billion 100 byte records) in 3.5 minutes. It is also used to generate Yahoo's Web Map, which has 1 trillion edges in it.

Hadoop Distributed File System on Making Use of Terabytes of Unused Storage · 2008-02-09 17:06 · Score: 2, Interesting

You could put a Hadoop Distributed File System (HDFS) on them. HDFS allows you to use the storage as a single file system that is stable and reliable. We have multiple 2000 node clusters with petabytes of user data on them. Because the blocks are each replicated to 3 hosts, if a node goes down, your data on that node is not lost.

Re:there is no technological fix on Fighting Online Game Cheating in Hardware · 2007-07-01 06:18 · Score: 1

Actually, there is a technological fix, but it certainly has nothing to do with hardware. (The hardware approach is a really bad idea, by the way. It requires way too much dependence of the hardware and software platform, including blocking emulators.) The right fix is to support social networks and karma. It should be fine to create games that are limited to your transitive friends. Or create a game limit to people with good karma. Since it works for the most part with slashdot and ebay, I bet it would work for online games too. *smile*

they did have managed code, but pulled it out on Analysis of .NET Use in Longhorn and Vista · 2006-03-15 18:09 · Score: 5, Informative

We had someone out to interview last month who is currently at Microsoft working on Windows. He said that the major reason that Vista is so late is that they had to rollback all of the development to remove all of the managed code because performance had gone to hell. Every thing that had been done in managed code had to be reimplemented from scratch. Ouch.

Common email names on Where Do Dummy Email Addresses Go? · 2004-07-11 06:21 · Score: 1

When we were moved from @home, which was extremely crowded, to @attbi, which was lightly loaded, I picked up a bunch of email names for me and my kids:

owen@attbi.com
laurel@attbi.com
hazen@attbi.co m

That turned out to be a big mistake. It turns out that you have exactly this problem of other people putting your email address into the various forms. I still get noticable amounts of spam out of these addresses... *sigh* So when we went to @comcast, I left the first names behind. I'm just looking forward to the end of 2004 when they stop forwarding the attbi email addresses.

Owen

Re:Stanford Checker on MySQL & Open Source Code Quality · 2003-12-23 06:25 · Score: 2, Interesting

> I just think it'd be horrible if they used the
> GPL'ed GCC to develop their methods (having access
> to a full portable compiler onto which to do
> research and development is hardly a "small
> thing"), and then lock these same methods away
> from the community.

Yeah, that is the way it is going to go. Dawson and his students and employees use gcc for a parser and have no intention of releasing their tool under any open source license. They claim that they modified gcc to write out Abstract Syntax Trees (ASTs) that are then read in to their tool (the Coverity/Stanford checker), which Coverity is selling commericially. Richard Stallman has long fought to keep gcc from publishing useful ASTs to prevent things like this from happening, but it is obviously impossible to stop in the long run and he should just concede the point.

We should pressure Dawson and Coverity to at least release the modified gcc parser that will dump the AST. ASTs enable all kinds of program analysis tools, such as doxygen and static analysis tools. Furthermore, we should pressure FSF to roll the changes back into the GCC mainline.

oldie, but goodies on Multiplayer Linux Games · 2003-12-15 17:18 · Score: 2, Informative

I really like xbattle and xpilot. Both of them are really old and therefore will run on very minimal equipment. They both provide a lot of hours of fun however. Over the summer we had a couple of interns working for us and they both had fun with xbattle. *smile*

Have one, done that *smile* on Ph.Ds in IT - Good or Bad for a Career? · 2003-08-19 03:44 · Score: 1

Since I have a PhD and and I am working as a software engineer, I've been there and certainly done that. I was warned that getting a PhD would make me "over qualified" for most jobs. In my opinion all that happened is that I got "over qualified" for jobs that I don't want anyways and it hasn't been a problem at all.

That being said, you have to _really_ want the PhD or it won't happen. Getting a masters is very like getting a BS, but getting a PhD requires a very different level of determination. Don't start unless you want it. Certainly don't use financial arguments to justify a PhD.

The other thing is that if you intend to go back into industry, don't neglect the skills that are valued over there. Make sure to work in industry as well as a teaching or research assistant. And certainly keep programming on Open Source stuff! *laugh*

-- Owen

Getting tools open sourced from NASA on NASA Report Advocates Switch to Open Source · 2003-05-16 06:32 · Score: 5, Insightful

I work at NASA/Ames as a senior software engineer in the Automated Software Engineering group and I reviewed Patrick's report a month ago. Patrick's report is the result of his efforts to convince management that it would be a good thing to release the scientific computing software that he had written to the public.

I am in a research lab working on software engineering tools and most of us would love to release the tools that we develop as Open Source. Unfortunately, we need to get the administration's support. (We've been trying for over a year on a software model checker named Java Path Finder and haven't had any luck yet.) We have other stuff like an C++ AST language model (in XML/Java) that we are currently developing that would also be nice to release.

I can understand the administration's desire to keep the software ownership for itself, but the greater good would be for us to release the tools under GPL. Especially, since the opportunities for commericialization are much more limited than they were a few years ago. Releasing the tools as Open Source would make them available to many more people and dramatically increase the impact of the work. A further complication was mentioned in the report is that we have a lot of contractors (~40%?) and the IP ownership is determined by the particular contract. *sigh*

We also use a lot Open Source code, including linux, x11, xemacs, ssh, gcc, cvs, etc. and it would be nice to give something back to the community.

winning strategy on Analysis of Netflix's DVD Allocation System · 2003-04-23 04:23 · Score: 1

Since they use a fixed time window for their client history, it is easy to cheat this allocation scheme. Get 2 accounts and alternate which account is used each billing cycle. The "active" account will always have rented 0 movies the previous billing cycle and you'll be able to get your movies fast. The following month, you'll switch back. They really need to move to finer grain model that uses exponential decay over a longer time period if they want to prevent "cheating". Note that this form of cheating is exactly what the author of the article did to prove his hypothesis about their allocation scheme.

Re:Not sure if I believe this on Open Code Has Fewer Bugs · 2003-02-20 04:39 · Score: 1

*smile* I believe it. I used to work there and know the guy that wrote the original white paper. He can't name the other companies for the obvious reasons. But they do get LOTS of code from different companies. There haven't been any problems with code falling into the wrong hands. Trust issues are one of the important things for Reasoning to deal with. Reasoning has been offering code inspection services for the last 3 years and has built up quite a database of error rates across a wide variety of companies. They can't tell you specifics, but they say that your code has X errors/kloc and other people writing this kind of code have an error rate of Y.

Re:NASA...cutting edge?? on Linux In Space: Red Hat Rides The Rocket · 2003-01-31 05:02 · Score: 1

They do use very slow computers in flight. They need to. The error rates, which are caused by solar radiation, are high enough with the slow stuff and get much worse as the feature size decreases on the chips. There are some spots over the earth (south Atlantic, I believe), where they need to reboot all of the PC's after they clear the area, because the earth's radiation shield isn't as effective there.

Most satellites, especially if they are in higher orbits than the shuttle's use very slow/old processors. There are satellites being launched right now by both government and non-government organizations that have chips like the mil-spec 1750A where they measure clocks in single and double digit megahertz and memory in kilobytes. There are some very smart people writing that code...

I've also actually held the 30 pound laptops that they use on the shuttle. (I work for NASA. *grin*) On Earth, they are pretty unwieldy, but of course in 0 gravity having a lot of mass is a good thing to increase stability.

Saw his talk at FSE on Using Redundancies to Find Errors · 2003-01-22 16:53 · Score: 5, Interesting

I saw Dawson's talk at FSE (Foundations of Software Engineering). He uses static flow analysis to find problems in the code (like an advanced form of pclint). The most interesting part of his tool is in the ranking of the problem reports. He has developed a couple of heuristics that sort the problems by order of importance and they supposedly do a very good job. Static analysis tools find most of their problems in rarely run code, such as error handlers. Such problems are problematic and sometimes lead to non-deterministic problems, which are extremely hard to find with standard testing and debugging. (This is especially true, when the program under consideration is a kernel.) Dawson also verifies configurations of the kernel that no one would compile, because he tries to get as many possible drivers at the same time as he can. The more code, the better the consistency checks do at finding problems.

By making assumptions about the program and checking the consistency of the program, his tool finds lots of problems. For instance, assume there is a function named foo that takes a pointer argument. His tool will notice how many of the callers of foo treat the parameter as freed versus how many treat the parameter as unfreed. The bigger the ratio, the more likely the 'bad' callers are to represent a bug. It doesn't really matter which view is correct. If the programmer is treating the parameter inconsistently, it is very likely a bug.

He also mentioned that counter to his expectations, the most useful part of his tool was to find 'local' bugs. By local, I mean bugs that are local to a single procedure. They are both easier for the tool to find, more likely to actually be bugs, and much easier for the programmer to verify if they are in fact bugs.

He analyzed a couple of the 2.2.x and 2.4.x versions of the kernel and found hundreds of bugs. Some of them were fixed promptly. Others were fixed slowly. Some were fixed by removing the code (almost always a device driver) from the kernel. Others he couldn't find anyone that cared about the bug enough to fix it. He was surprised at the amount of abandonware in the Linux kernel.
It is extremely frustrating that Dawson won't release his tool to other researchers (or even better to the open source community at large). Without letting other people run his tool (or even better modify it), his research ultimately does little good other than finding bugs in linux device drivers. *heavy sigh* Oh well, eventually someone WILL reimplement this stuff and release it to the world.

On a snide comment, if he was a company he would no doubt have been bought by Microsoft already. Intrinsa was doing some interesting stuff with static analysis and now after they were bought a couple of years ago, their tool is only available inside of Microsoft. *sigh*

Re:There are only 9 unix problems. on SANS/FBI Release Top 20 Security Vulnerabilities · 2002-10-03 17:23 · Score: 1

And yet, it still happens. Sun still ships Solaris with the r-services client and servers running by default. telnet too. They are starting to move to ssh with Solaris 9, but they will still have the telnetd and r-service servers running by default.

Even on Sun's internal network, there are a whole lot of plain text passwords flying around.

Re:Try going to the record store on UC Irvine Cracks Down on P2P · 2002-09-29 15:52 · Score: 1

*smile* University towns in general may have a good selection of used record stores. Irvine does not. (I went to UCI for 9 years, getting my phd.) The mall across the street from UCI seems to have a rent pricing policy that excludes every store that actual students would want to visit. (This isn't guess work, I talked to a few of the store owners.) (*grin* The one notable exception was that in '94 they put in an In-N-Out. That did well.) Irvine is an interesting city, but it far from typical.

one way functions on Crypto with Epoxy Tokens, Glass Balls and Lasers · 2002-09-20 08:22 · Score: 2, Interesting

The article seems to be missing the point of one way functions. If you don't change the inputs to a one-way function, it is exactly the same as constant (ie. no good for verification of anything).

An easy application is for keys. If the lock has N input/output pairs recorded, getting in with a fixed example output would be hard.

A more advanced use of these things would be to have some way standard way of encoding a bill of sale including a datestamp into bits that could drive the laser inputs. Then save the resulting pattern(s) as proof that the vob was there at the time of the transaction.

However, that leaves a major hole. If the user destroys the vob, the store can no longer check if the signature was valid. To combat this, the user needs to be identified at the time of the transaction. As long as the vobs are registered in a central identity server so that the store can make sure the person is who they claim to be at that point. Additionally users have to record lost or destroyed vobs. The central identity server could use the N known input/output pairs to authenticate the user.

Re:I'll be using BEEP for ... on Will BEEP Simplify Network Programming? · 2002-07-15 17:14 · Score: 1

I think Luke has an extra hand somewhere...

Current examples on Would an Ad-Sponsored OS/Desktop Work for OSS? · 2002-07-10 03:22 · Score: 1

One current example of this kind of thing is Netscape. You can tell a lot about people's reaction to corporate control and ads by whether they are using netscape or mozilla. Netscape has the familar name, but mozilla feels less commerical and tied to a particular organization.

Another indicator is how many people run webwasher or the equivalent. It isn't everyone, but it is a lot. People will go out of their way to avoid ads.

Re:Variable Names on What is Well-Commented Code? · 2002-05-20 10:29 · Score: 1

I used to work for a company that analyzed other companies' source code for problems. I was looking through the code and found some really insulting and offensive comments in the CVS repository. We aren't talking "Karl messed this up", but "Karl is a s*-headed m*-f*...". We also saw some comments of the form "this still doesn't work, but it is 4am so i'm going home."

Ah, the joys of looking at other people's checkin comments.

I strongly suggest that if you are tempted to put in such comments that you reconsider. CVS repositories live for a LONG time.

Re:From his faq on Tom Lord's Decentralized Revision Control System · 2002-02-05 17:43 · Score: 2, Informative

Actually, I'm team lead on a CM system where all of the metadata is in Sybase. We use Sybase replication to keep multiple servers at different sites in sync with each other. (Sybase has a nice replication model that will store changes in a stable queue until the remote server is available again.) Anyways, using a real database means that our tool scales to insane levels (we see peaks on one project of 20,000file versions/day). We also get the ability to do live backups, etc. It is also very nice being able to write adhoc queries against the database in sql. (ie. in the last month, show me how many file versions were generated at each site on each day.)

While we keep all of the metadata in Sybase, we store the actual bits in the filesystem.

Re:My experience with Common Lisp on Common Lisp: Inside Sabre · 2002-01-16 06:01 · Score: 1

I'm a big fan of Franz lisp, I used it extensively in my work a couple of years ago. The debugger is very nice (recompiling a function in emacs with a keystroke into the currently running executable!!). The commerical compilers also are much better optimized than the free ones. My big problem with Franz is how expensive their compilers are. $60k/user is just a little insane...

The real problem is PG&E on Power Shortages And Tech Industry · 2000-12-08 01:00 · Score: 3

This actually isn't a problem with the high tech companies in silicon valley, although the ever increasing cpu ranches at companies like www.exodus.com don't help. The problem is that PG&E has shutdown 17 power plants in california because they have reached their air polution limits for the year. This is a completely artificial "shortage". I almost laugh when they tell the customers to not turn on their christmas lights until 7:30pm. My house has 2 strands of little lights. That works out to 2 normal 40 watt light bulbs. *sigh*

Slashdot Mirror

User: owenomalley

Comments · 26