The Google Search Server
An anonymous reader submitted a reasonably indepth review of
the Google search appliance. The guys from anandtech put it through it's paces, and included a variety of pictures and comments on one of those Google products most of us will probably never play with.
Javascript + Nintendo DSi = DSiCade
Microsoft's Ballmer Threatened To 'Kill' Google
Their solution was to create a list of urls for the appliance to crawl. If they had to do that for the search appliance, there is no way that googlebot, msnbot, or yahoo slurp is going to be able to properly index their site.
Your public accessable urls need to managed and canonicalized through judicious use of robots.txt, 302 redirects, site wide linking, and just plain thinking out the layout of your site.
I gotta say, I was looking for benchmarks, usability scores, maybe some test scenarios. Even better, compare this to other products available out there.
It looked promising at the start, but when you get to the last page it leaves you wondering if they forgot the hyperlinks for the rest of the article!!
So Google subcontracted a company called GigaByte to make this box.
I was disappointed to see GigaByte didn't use MegaByte to make some subcomponent.
First, it wasn't a review. They didn't review anything.
/. blurb. But it was about Google! Gooooooooogle!
Second, it was a Google Mini.
Third, they didn't "put it through its paces" at all.
Lousy article, misleading
bp
While this is an interesting article, it really isn't much of a review of the Google Mini. All they do is take it apart, take pictures, and tell you that they set it up after a little bit of trouble. There is nothing about how well it actually works. No benchmarks. No comparisons. They just say that it worked well and leave it at that. Anandtech has had more indepth reviews of mice before.
It is more information that I have seen anywhere else though.
Thats it, I gotta get me one of those just for the tee.
Yeah right, Like Im gonna write a sig.
It's really easy: It's "his", hers", and "its". Even a flower knows!
--cycling through grammar Nazi mode. Please wait.
Did it strike anyone else as insane that this thing only had one hard drive? For $3,000, where's the raid array? Ok, sure it's a search appliance and doesn't really hold any mission critical data, but if the hard drive crashes, how long is your search functionality going to be down? You'll need to get a replacement drive and rebuild your whole database (a slow crawl process). What about your configuration settings?
Maybe it takes a while for the documents to be indexed but you'd think they would have added it manually given the nature of the article.
From the Summary: "a reasonably indepth review of the Google search appliance."
If, by "resonably indepth review", you mean lots of pretty pictures and a narrative about opening the box and the case, then sure.
Rather than calling this a review, perhaps it could be re-titled "One man's demonstration of the Google search appliance."
That said, I'm a little concerned about how many URLs it can handle... 100,000? According to TFA, 40,000 documents overloaded this thing.
The article did not address how this could be overcome, except by eliminating some of the URLs from the crawl. How scalable is it?
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
A few months ago, we asked for a demo of the product. My main involvement was to help compare with our existing search strategy. Just to cut to the chase, we generally had a very positive experience with it. Searches would bring up what we wanted more often than not. Our existing search system, which was based around IIS and custom SQL code, was pretty good, though it couldn't beat Google for pulling up relevant pages. We did have a few quirky things happen, though.
We had a couple times when the appliance locked up and had to be rebooted. That was probably the most distressing as it had to be on 24x7 to support our organization and I wasn't looking forward to the help desk calls.
More amusing, though, was the way it crawled content. Google works like any other crawler - it goes around and clicks hyperlinks. Unfortunately it's not too bright, not paying attention to the text of the hyperlink, like if it said "delete" or something like that.
Unfortunately I had a poorly secured application that Google was able to sneak into via another link I wasn't aware of. It held the custom links for each of our departments to display a personalized set of links on the home page. Unfortunately it went through the admin tool and clicked every delete link it could find. I was paged the next morning and was fairly unhappy. My fault, though.
The irony is that the budget money evaporated and we aren't getting it after all.
This was an interesting review if you had never seen what a google appliance looks like, but it wasn't very in-depth at all.
I was certainly looking forward to some overclocking and linux installing. I mean, I'm sure they voided whatever agreement they had with google just by opening the case up, so why no go all out and give us the review we really want to read.
I didn't even realize the review was over until I realized there was no "next" button on that last page.
The Google Sandbox
;)
Who cares about the hardware, let's see the algo
Maybe CmdrTaco could use it to search for tips on apostrophe usage.
Ydco co
Just a matter of time before it's reverse engineered :)
We have secretly replaced these Slashdot mods' sense of humor with a rusty nail. Let's see if they notice!!
I heard that this google mini is using a modified Version of a linux distribution. Is the source code given by google somewhere?
I can search the 63,000 online documents with http://www.google.com/search?q=site:www.anandtech. com
to manage/limit file access? we just got one to index our companies docs. their (the files) access is managed by permissions. i've googled the web and not found a clear "how to" doc that helps the problem of IUSRs (yes i'm using MS IIS...:(... ) permissions opening the door for anybody who clicks a link to a doc.
So what os does this thing run and why is it not mentioned anywhere?
We need people to use the google toolbar, because that is one more bite out of Microsoft. Although Google works best with Microsoft, the more accessable and usable it is, the better equipped Google will be to do battle with Microsoft. It will finally be a relatively balanced (well more balanced than others) battle between Microsoft and Linux.
Read the article in PC Magazine (I think) "Why Google scares Gates".
Not only "land of the free" but "land of the lawyers" who love a good old 1st amendment smackdown. Shihar 153932
go to http://search.anandtech.com/ and do some googling. a bug?? http://search.anandtech.com/search?q=hardware To access the search results, you must issue a GET request to the Google Search Appliance via a search box. You can do this by copying and pasting the following HTML code into a Web page. Enter your server name and your collection name where indicated in the code. <!-- Search My Google Search Appliance --> <form method="get" action="http://enteryourservernamehere/search">
<table>
<tr>
<td>
<input type="text" name="q" size="25" maxlength="255" value=""/>
<input type="submit" name="btnG" value="Google Search"/>
<input type="hidden" name="site" value="ENTER_COLLECTION_NAME"/>
<input type="hidden" name="client" value="ENTER_COLLECTION_NAME"/>
<input type="hidden" name="proxystylesheet" value="ENTER_COLLECTION_NAME"/>
<input type="hidden" name="output" value="xml_no_dtd"/>
</td>
</tr>
</table>
</form>
<!-- Search My Google Search Appliance-->
The screw is threaded - it just can't be undone with a regular screwdriver.
Right.. Only unthreaded screws can be opened by a regular screwdriver.
I thought Google used pigeons ...
"I'm never quite so stupid as when I'm being smart" (Linus van Pelt)
I admin a full blown Google Search Appliance, the mimi's big brother.
If you want the specs:
Dual Xeon 2.6GHz
12GB RAM
4 250GB HD's in RAID(something) with a hot-swap spare.
Never tried taking off the cover though, since we want to keep the warranty.
All of the money you pay is a license for the software on the box, the system itself is effectively free, so once the 2 year warranty expires, you've effectively got a nice powerful linux box for free. You can keep running the software, but without any support.
As for performance, this thing works great, we have about 250,000 pages that it can index, both public and private (and it can do searches cleverly checknig username/pasword to see if you should have access to certain results), and we've had nothing but positive responses from our users. The results come up quickly, they're the results people want, and the results that management think should be at the top, are at the top.
What happens after the BIOS screen and before you "log in" to the web interface? Surely it runs some sort of operating system?
There are 11 types of people. Those who understand binary, those who don't and those who are sick of this lame joke.
Boy, here it is almost noon EDT, and nothing about Google yet! I was getting worried. Should we start a pool now, betting on which Internet trend the Slashdot fanboys will pick? Apple is now passe, I think. Google will fade soon. Tivo is WAY passe, at this point.
I want to delete my account but Slashdot doesn't allow it.
I am aware of what TFA said. My point is this: 100k URLs is not a lot; I was merely pointing out that 40k docs can be > 100k URLs, and this means that capacity become an issue very quickly.
I guess TFA being from the you-know-for-the-kids-dept explains it pretty well.
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
We evaluated on of those yellow Google search appliances (GSA) and experienced very mixed results. The appliance is very easy to set-up and launch an initial scan of our website.
.slashdot.org,slashdot.org
The GSA will blindly search all web servers in your domain. When setting-up the GSA, you give it an initial page from which to start crawling and baseline domains. For example:
Inital page: http://www.slashdot.org/
Domain(s):
The leading dot on the first domain entry says to search all hosts in the domain.
Problem: GSA does not provide very good status of where or what it is searching. It only has a dashboard light to say it is crawling. No details.
Problem: We found that the GSA would get caught in an endless loop if it encountered a user website controlled by a database. It would endlessly follow the next and previous links to find every database entry.
Our university library subscribes to a number of electronic databases, such as, EBSCO PsychINFO, etc. The GSA indexed every possible look-up.
Our eval licenses was limited to 1.5 million pages. Some of these databases contain hundreds of thousands of pages. Solution: Those setting up their own web server must employ proper robots.txt files or risk having their entire server blocked from indexing.
signature pending slashdot approval
The pictures are pretty and I'll assume the thing works. Some folks, however, won't buy it because they don't want their intranets to work like you or I might expect. Let me explain.
I work for a large TLA govt agency. I've begged our people to get something like this. I know, from working with our folks and doing my own digging, that we have a wealth of knowledge tucked away, here and there, on local group shares and out-of-the-way internal web sites. And yet our internal search function is ludicrously bad. It works off "key words" that are simply a manually maintained (I think) list of useless, often off-the-mark descriptions of approved sites of general interest. Special-interest pages are not indexed in this way. The crawler, if you want to call it that, is terrible at doing its job. Enter a string of text and get a hit on a known, universally accessible web page containing that exact string? Not a chance. I test it occasionally and find that it remains as ridiculous as ever, with a level of functionality that would have been technologically uninteresting the better part of a decade ago but is, in this day, infuriating to users.
The reason for all this is that if our intranet were automatically crawled, well indexed, and truly searchable, people would be able to find things. People in Work Area A would be able to see how they might be impacted by something going on in Work Area B. Horrors! That would mean that management would lose much of their ability to keep employees selectively in the dark.
All this came to a head a number of years ago. At that time, our intranet content was maintained by IT. Anybody that wanted a site (literally anybody) could just get their first-line manager to approve the request and they'd get server space and some help setting up a page or two. The exchange of information that started happening was highly disruptive, so a "Communications and Liaison" office was set up that wrenched control of the intranet from IT and required (what seems to be essentially political) approval of the business case for anything that went online. No web sites unless the Communications gods approved.
Nowadays, the employees of one division are only vaguely aware that other divisions exist or have web sites. Each individual fiefdom is protected from the ravages of communications that don't strictly follow the org chart lines. I guess the executives in charge are happy in their insulated little worlds.
If you're going to sell an effective intranet search tool, you're going to have to face the fact that lots of large organization leaders (and you find the same attitudes in both the public and the private sector) would recoil in horror at the thought of having their intranet be effectively searchable. It's too threatening.
The problem is not google, is the way your app is designed!
Universal Resource Identifiers -- Axioms of Web Architecture : Identity, State and GET
In HTTP, GET must not have side effects.
In HTTP, anything which does not have side-effects should use GET
If somebody visited your site with a pre-fetching tool like the google web accelerator, you will also find the "delete" button being checked automatically like this. Change those deletes to use POST instead.
- sigs are for wimps.
Given the actual content of their review, I'm very surprised they didn't pull the drive and have a stroll around the filesystem. They've pretty much toasted the warranty as it is, anyway.
offtopic?
At anandtech's website,
to test the ability of their google search server,
I searched for the title of that article.
You would think it would point me to the article;
it did not.
"its paces", not "it's paces"
Would be interesting to see more info about the filesystem layout, OS and version, and the code. Apart from Google's engine, some hacker should try to piece together an open source solution ;-)
The team I manage has four of the Google appliances that are the big brother to the mini. These devices provide pretty good search results with minimal effort. They will do strange things when hitting a site that contains another search engine or pdf generation. Google refers to this a a "Search Vortex" and results so far are a death match with Google Device 1 , Web Server 0. Finding the content that causes this problem and removing it from the search can be painful. Overall the boxes are solid.
A company named Thunderstone based out of Cleveland, OH makes a way better (and cheaper) search appliance than Google's. FYI, they aren't new to the search engine industry either. Up until very recently, they were the search engine for Ebay and a few other significant sites as well. www.thunderstone.com
I like this kind of reviews. A bit of what packaging looks like (noone writes that, although it's quite interesting for me personally: how does packaging for a $10000 unit differ from a $300 maching), a bit of a view from the inside, a bit about the software. Nothing too complicated, because that would make the article dull to read. What the article provides is the general feel of the product.
One thing I wonder is that Google can probably use the included modem to download private company data which the server caches (if the company bought the server for internal use).
It's not clear from the article but I know that Google's server farm runs on Linux. Does the same apply for these machines and, if so, do they come with the source code to the GPL-ed parts of the server software?
Besides, nothing google does is newsworthy unless it's filing for bankruptcy or submitting to Microsoft and yielding to a hostile takeover.
I post at -1. Clearly I'm not a poster child for slashbot.
What would really be interested is to know what it runs. Most likely it is some gnu/linux system. If it is some kind of custom distribution with modded kernel then, according to GPL, Google must make source code for such modified kernel available. I am really surprised nobody actually got hold of this. I guess the indexing software is off limits since it is a separate application, not derived from anything GPL. But even custom kernel should be interesting.
Also would like to know how do "google does no evel"-fanboys find those "custom" screws that you can't undo with a normal screwdriver?
I am currently in the midst of setting up setting up a Google mini. I have noticed most articles mention that getting the *initial* crawl setup is quite easy. It is. Even this article mentions "The last thing that we worked on was making the Mini look like it is part of AnandTech.com. There are two ways to go about this in the Mini admin. One is to use their built-in page layout helper, which allows you to wrap the search screens with a custom header and footer. The other way (which we prefer) is to use the XSLT Stylesheet editor and modify the stylesheet to meet your needs." But the screen shots nor the article go on to mention this process of which, I have found very little information. Also, one pitfall is that the MINI offers only 1 collection, meaning that if you want to search multiple sites you will have to filter content by URLS, i.e /my_site1/:* for one collection and my_site2/:*. And keyword searches are made across the whole collection. Also, having a Google mini I have access to the support site and forums. Through out all the forums I have yet to see a Google associate reply. I have contacted Google four times stating that I needed help getting a correct xlst sheet working aside from their default. I seem to be getting Macro replies from Google stating that they do not provide support on XSLT. I think this is considered ranting. My apologies.
EnterFind appliance (the product I helped developing last year) is cheaper, handles native Windows shares(not just HTTP) as well as databases and has web-services API.
Our experience in evaluating the google machine for a large (100+ Million hit/day) site was less than positive. We'd have needed over 40 of their regular boxes to supply the search results, and there is no built-in cluster management. Since there is no access to the filesystem, this means we need to write the tool to interact with their web-based gui, and if they change bits with an automatic software update, too bad for us :-(
Needless to say, we declined. Results and response times were pretty good though.
I really think that the OS is windows ... the web browser it loads is Internet Explorer .. so I guess it should be windows ... the truth is I was expecting it to be Firefox or atleast Konqueror :-(
"In questions of science the authority of a thousand is not worth the humble reasoning of a single individual."
They look nice. Preformance seems good. I wouldn't mind getting my hands on one. But it would be quite useless for me :)
I actualy wonder what OS those boxes run.
Somebody knows?
Microsoft also sells boxes like those?
{{.sig}}
Clearly they just chose Gigabyte So it could be a G-appliance.
Mac toys and accessories blog
From TFA: We created a file to which a link to every article, news post and blog post that have been published on the site would be dumped. That file is cached for a few hours as we update the index 3 times a week.
"Don't mess with him, he taunts the happy fun ball."
We looked into the testing of the Google appliance for searching our printer ink site. We found using our Google ad sense account gave our printer ink customers the ability to search our site and suited our small business needs just fine. You can see our search box at the top of our site let's the search happy people search away. If they go somewhere else we felt being a directory will allow us to keep them coming back due to our printer help sections. Why buy a big Google appliance??? -- Especially with the fees. I know some techies would disagree and want better control over their pages, but so far we have had great results having clients actually find what cartridge they are looking for by model number or keyword specific terms.
Anyone else think the Anandtech server room has some lovely, lovely carpets?
it's not offtopic, it's flamebait.
Bow to your Google Overlord. ;)
I'm a bit disappointed. I would have absolutely loved to have seen what was actually on the hard drive as to get a better idea as to how Google actually thinks and organises.
I loath the google appliance. I liked it for the price and it was supposed to be like an appliance. Plug it in, turn it on and click a few buttons and off you go.
It locked up for me waay to many times even though google cites this as rare. I wasted way to much time on support for a device which should not need this level of babysitting.
When my contract ends, I'm switching to Nutch.
since he OS seems linux, should they not give access
to the source code etc.. to comply with GPL.
From the reviews so far it seems it is a closed system
and can be used only through the web browser.
Google has many production quality problems with its distributor. I had to return 2 units before I received a functioning unit the 3rd time. I benchmarked the functioning Google Mini the other day. I havent published detailed results yet, but I can tell you that the performance was very poor considering the performance expection from a brand like Google. While I think the appliance is very capable, neither the Google Mini nor the larger yellow appliance are suitable for wide enterprise deployment. I benchmarked the Mini at an average of only 3 transactions per second. Max of 7 TPS, Min of 1 TPS0. Load balancing with 2 boxes only increased speed of transaction time by ~30%. My company of 100,000+ users certainly can't use a system at this performance. I don't think my workgroup of 20+ people will be able to use it productively. We bought the box, but I think it will stay in the closet for limited uses. It has potential for h4xng with processor/mem upgrades - maybe even dd to new hardware. But until Google concentrates on appliance performance, their "Google Enterprise" initiative won't be taken seriously by the target market.
A Response From Google: Thank you for your message. I apologize that there is currently some ambiguity in our documentation regarding external stylesheet behavior. Google has recently begun shipping Mini appliances with patch "google-mini-patch1.bin" pre-installed; this appears to be the case with your Mini. Documentation for this patch describes the procedure for installing this patch (which will not be necessary in your case), but also describes how to add #TRUSTED_STYLESHEET rules to allow reference to specific external stylesheets:
I seriously considered getting a Google Mini for my law office. The desktop search stuff wasn't really doing it for us, and we have boatloads of work that we reuse on a regular basis -- pleadings/contracts/settlement agreements, etc. are sort of like code in that respect -- we always want to reuse our knowledge rather than reinventing the wheel. My concern was that the regular Google appliance was too expensive. The mini seemed reasonable, but I still was resisting the idea of paying that much for search.
In any case, I had searched high and low for a decent search function when I happened upon swish-e. I am exceptionally pleased with it. It can be found at swish-e.org.
I am not an uber geek, but I was capable of spending an afternoon monkeying with it to install it, set up regular indexing as a cron job, get it to properly read and index OpenOffice documents, and to launch them from the browser. This involved some frightening security settings, but I have a small enough office (three people) that I'm not too torqued about this. The wide open settings I used were not swish-e's fault, as near as I could tell. Rather, they resulted from my laziness -- "It works well enough now, and the likelihood of malicious use is pretty low, so fuck it".
Obviously, it could be set up a bit more cleanly on my end, but I am really, really happy with it apart from that. Currently, it runs on a used SCSI-RAIDed IBM Netfinity box that I picked up for a little under $500.
The time and money I spent on the hardware plus getting it running has paid immense dividends. I have benefitted in two primary ways:
First: my office minions use the network for storage and do not store anything locally. This means that everything is indexed (and can be found!) and because they like the search so much, they also (unwittingly, perhaps) give me the peace of mind knowing that our data also gets the other benefits of being on the network (everything is backed up automatically/regularly, etc.).
They like being able to find stuff, so the search has really encouraged saving stuff on the network. I could mandate this in other ways, but I'd rather have them drinking my Kool Aid than simply imposing the idea.
Second: My minions and I have saved tons of time using the search feature. Any good search does that. The additional bonus is that I no longer have to worry about the next version of Google Desktop or Copernic or installing it on various machines, blah, blah, blah. It's all centrally saved and configured. Administration is essentially zero since I am getting good search results on all the document types that I need - some old MS Office leftovers, Open Office, and PDF.
I don't see needing to change this in any significant way for at least as long as I keep the hardware. I think that the next time I'll need to touch it will be when the index outgrows the box serving the searches.
The box I'm running has dual 1.something gig pentiums with a gig of RAM. The drives are the weak link, with only 9.1 GB of space available for storage of OS, index, etc. The box also has redundant power supplies, redundant power supplies , redundant ethernet connections (100MB), and redundant ethernet connections (100MB).
The front end to the search is just a standard, "came with it" CGI script (swish.cgi). It works just fine. It gets called up as a webpage locally, and it spits our results.
On a final note, we are pretty aggressive in enforcing standardized file naming conventions. The naming conventions typically include te client name, the matter, a date, the type of document, and the subject of the document. Swish-e has document path, title, title and body searches off the interface we use, and you'll usually find exactly what you're looking for if you're reasonably specific.
On a final note, swish-e has been unsuccessful when I have used the following search terms "nubile blonde woman" and "willing to get with me". In that respect, swish-e has been an outright failure, though it is conceivable that the fault lies with operator error.
GF.
Lots of petrified grits