Nutch: An Open Source Search Engine
Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch.
In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?
Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.
The slashdot search page could definately use this kinda technology!
Good quote, too many chars. Seriously, the slashdot 120 char limit sucks!
open source pimp engine
I hope the authours of this project do their homework. My impression is that most of the good search and indexing schemes have already been patented, which will make it difficult to release such a project without stepping on someone's toes.
I'm pretty sure a search engine is supposed to be for whatever purpose the people making it want it to be.
My biggest concern is that the developers will simply be in a scramble patching up exploits, instead of actually making their technology better.
Last i heard google still doesn't accept bribes for page ranking.
inobtrusive adverts on the right hand column nonwithstanding.
do() || do_not();
Didn't google start out that way and then realize that is very expensive to maintain a search engine? Google also clearly differeniates between its ads and its results.
I guess the more the merrier but I wouldn't bank on this thing becoming more than a curiosity.
Also of note is that companies can still influence search engines in slimey ways - Google can be manipulated to make a page rank higher, although Google keeps an eye on this activity and works around it.
I hate liberals. If you are a liberal, do not reply.
This seems to me like the /. moderation system, with the pages being ranked based upon how the user feels about the site.
However, I could see some disadvantages to the system depending upon how it is set up, because one person could keep dinging a site to get its score to drop down.
I'm quite comfortable with how Google does this (present commercial links clearly marked to the side), and am not convinced a non-commercial (open source) alternative is needed.
It's "Business". Hope that helps.
I think that you absolutely have to have a closed source algorithm for ranking pages, because otherwise you'll get people who will simply tune their pages to be high on the list. I can see how making the majority of the search engine open source would be beneficial, but the algorithm itself? Its like saying "Here's the keys to my car" and thinking that, because everyone has access to the keys, no one's going to drive away with it. Sure, everyone has the opportunity to make your search engine better, but never underestimate the tenacity of a web-wanna-be-millionaire.
Two problems:
Here's what I expect to see on the webpage in a few months: "Currently Nutch is in the alpha stage- it doesn't index any web pages, doesn't return any results, and has no user interface. Programmer's needed!" Google has WON the search engine war, probably forever. Find some other mountain to climb, guys.
To me, accuracy is the most important "Relevance".
The problem with Google is that there are errors in it: you ask for something and sometimes you get something else.
A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.
Since it seems like Google will never fix this problem, I'm looking forward to something with all of Google's great features, plus accuracy.
Don't blame Durga. I voted for Centauri.
The only search engine I ever use is Google and it seems to find relevant data just fine. And the ads are small, discrete, and actually useful. What's the problem?
Support the First Amendment. Read at -1
i'm scho exschited to usche thisch on my blog schite. it'sch scho exscheschible, that even a noobie can hammer out a schuper-schweet nutch-hack in a couple of hoursch.
i proposche a new schite to catalog thesche hacksch called nutchhack.com.
Free and open code is good and all... but the one real cost of a search engine is RUNNING it. It requires a far from trivial amount bandwidth and hardware, and somebody has to pay for all of it. Unless someone comes up with a novel P2P solution (and many are trying) it just won't happen.
What they should be doing is pressuring the existing search engine companies for some integrity.
---If you can't trust a nerd, who can you trust?
paid for
by the sites which you find, otherwise basic economics breaks down and it will not work (abuse etc.).Thousands of companies provide $product - free search engines simply direct all users to one supplier of $product. That's not right.
Searching for a supplier of $product is not like searching for information - it is not something that can be done outside of payment by the supplier of $product.
The FAQ doesn't explain the name.
it reads more like some strange marketing propaganda than anything.
..
That project has no releases, has nothing in cvs and very scant details on what it even "is"
There are many many projects out there with so much more info available, why is this one that has not released anything getting so much attention?
anime+manga together at last.. in real time.
google is already ideal... the weight of search results is not sold, just text ads.
people are already 'googlebombing' to try and get better rankings by signing up tons of domains and cross linking them all with the keyword that they want to be #1...
if the algorithm that determined how #1 is determined was public, then the best possible strategy to cheat the system could be demised... instead of paying for weight to the search engines you would be paying to web developers to make the search engine think you were #1. and as a web developer i feel that.... oh... wait, proceed.
MARIJUANA, SHROOMS, X: ONLINE?! - E
take a look at the developers and contributors. these guys are all top notch. doug cutting, one of the developers there is the developer for lucene, one of the best libraries out there for developing application search engines in any language. not to mention overture, internet archive, and mitch kapor.. looks like an all-star team. can't wait to play the software.
Who's going to pay for them if its a non-profit open source project? Bandwidth doesn't grow on trees you know.
And slimy adverts? Google has slimy adverts? I thought they only had relevant adverts? Oh well I guess we need another dot.com that will go bust in 6 months or so.
Mac OS X and Windows XP working side by side to fight back the night.
I think the idea is good in principle, but could it actually succeed? Google gets hit with millions of request each day. They've got hardware that can support thousands of slashdottings a day and a fat pipe to feed all of that info out. That takes alot of money. Financing an open source project is difficult enough, but financing an open source service such as that would seem next to impossible. Ideas?
The other major problem would be that, with the ranking criteria being available for all to see, it would be relatively simple to manipulate page rankings.
"Google has WON the search engine war, probably forever. Find some other mountain to climb, guys."
At one time, Oldsmobile won the auto company wars. Where are they now?
IBM ruled the PC roost. Hmmmm....
Command-line OS's were king. But now???
Altavista and infoseek and Lycos were search engine kings at one time. Whither this trio?
The point is, it is not over.
Don't blame Durga. I voted for Centauri.
One of those coffee out the nose moments for me.
One of the biggest issues with running a search-engine, open-source or otherwise, is that you can't eliminate bias in the results. No matter what scheme you put in place to handle rankings, someone will find a way to take advantage of it. It's a fact of any major system - there's always a way to twist it. Part of the challenge that Google and similar sites face is that they have to work constantly to protect themselves from systems designed to take advantage of their algorithm. While a completely unbiased search service would be nice, I think it would require the impossible. It would require that no one out here took advantage of it to further their own interests, be they political, commercial, or otherwise. That's fairly unlikely.
With most of the major engines today including Google, they make an effort to prevent horribly unbalanced results (recent controversy over blogs outweighing professional sites in the rankings due to linking and other factors). Some even admit (again, Google does) to manually messing with the rankings a little. If you search for suicide methods, they will bend the engine to make sure you get reasons why you shouldn't commit suicide before you get the how-to. That's in their own public docs. It's also discussed in Wired.
I honestly don't know if open-source could do a better job. The algorithm might be better (likely, given the manpower), but would it really be that much fairer?
"Be proud to be a fighter" - Martial Arts Adage
It must be indexing this page again.
To read the second page of this article use subscriber code 079751240X.
Go to "Magazine subscribers: Enter here", then "Sign in using the account number on your subscription label" and enter the account number above.
Courtesy of TechDirt.com
I fail to see the point of such an endeavor. Without advertising Nutch can not possibly hope to become a serious contender with search engines such as google or overture. Advertising provides the money that enables search engines to have lots of bandwith to send those results quickly back to users, lots of computing power to quickly process each search, even the ability to hire people to research into new areas for better search results. Even if the search engine is selling its resources to other portals like google does with yahoo advertising would still be involved in the process. Yahoo would still need to be advertising on their site to bring in revenue to pay for the service. I think google's method is perfectly fine with small text based ads that are discrete. Why do we need to fix this?
Go Illini!!!
I think they're setting themselves up for something that will get too big and too expensive before it can get finished, and they'll have to figure out a way to (gasp) get some funding beyond donations.
I don't see a solution in one great open-source, independent search engine, but many individual specialized search engines, each mastering their own niche area of specialty stands a chance to compete, especially if run by people who focus on their areas of expertise. Alternative news search engines, music search engines, literary search engines, etc. each run by people who know what to filter in and out.
If Nutch.org could create the technology that would allow each of these search engines to exist autonomously, it could also be the hub/portal/start-page/blahblahblah that links all these engines and databases together.
Alex.
No offense to the open source community, but I'm not sure about how feasible an unbiased search engine is. The open source community does not like any bias towards commerical interests and has no problem pointing it out, but by the same token, they do enjoy plugging their own programs, which is completely understandable and normal for any community, however that does not make it unbiased or 'publicly biased'. The merit of the site is very subjective, in my opinion. I am in favor of a project such as this, but I just want to see it for what it is.
Most open source internet based projects (*cough* Slashdot *cough*) have tended to be rather biased towards themselves. It would be very difficult to remove all subjectivity from a project of this nature. How can the ratings be controlled? If it is done entirely by the 'public bias' what's to stop bots from altering the 'public bias'? Just a few questions that still need to be answered.
For citations of most websites, some of the citing people will link to http://www.someplace.com, and some will link to http://someplace.com.
Therefore, include a comparison of the pages returned by each query, and if they are the same page returned, then summate the reverse citations to calculate their total rank.
-----
Cast a Cold Eye
On Life, on Death
Horseman, pass by
--W.B. Yeats' gravestone
It'd be nice if they could make distributed. Kinda like P2P search engines, but for the web. That way, the main searching server farm wouldn't be tied to any company in particular. That would give Google a run for their money, and would keep Microsoft at bay for another while.
Being open an open search network, some peer servers could specialize in searching what they're hosting, making it possible to index otherwise dynamically generated content. These specialized hosts would act as "search plugins" for some otherwise hard-to-define content.
An authentication method (a la Freenet) would be needed, though. Some form of authority to prevent rogue peers from injecting too much crap in the results.
Overall, a good idea. If they make it, I'll run it.
-- Home is where you eat your heart out.
HTDig is written in C, configurable, and flexable.
Nutch.. written in java. No Thank you, I rather not have my machine become to a crawl.
Fact is, why cant these "developers" working on Nutch work on HTDig to add the features they want?
HTDig is really nice.. Search Engine, help index tool... Really nice... You can even configure the ranking system to fit your needs.
>> In the age of weighted rankings on search
>> engines for profits, there's an obvious need
>> for an unbiased search engine.
If you tell everyone, what your page rankings are based on... that doesn't make it hard for companies to modify their page to fit what the search engine is looking for to increase rankings or hits.
There are some companies that do this for Google as complicated as it may be
according to http://www.nutch.org/docs/credits.html the Internet Archive is hosting nutch, and Overture has given them hardware. Sounds pretty sweet. Probably not the 20,000 strong linux cluster google has going though.
Photos.
http://www.mnogosearch.org/
Mnogosearch is a viable web search engine software. It supports caching (a la google), cluster of db, supports easily external parser... Maybe that project should enhance and helps this excellent Free Software.
adulau
The answer: "What did Sean Connery say when he saw the reviews for 'League of Extraordinary Gentlemen?"
Trolls lurk everywhere. Mod them down.
One thing that'll help Nutch financially is that they can use their technology for more than a single page running on their own servers(and taking on the huge loads that implies). They can use open-source business models instead, offering licenses with tech support, custom versions, etc.
I was always kind of worried that we might end up with an internet controlled by Google, anyway. But we'll have to wait and see if it actually works or not, anyway. I sure hope so.
"Google will toss the words 'to' 'be' and 'or'."
That is the problem. The reason I put such words in phrases is because I want an exact match.
" It does this to eliminate words that show up to frequently and make the searches faster"
I would hope that Google solves this by getting faster servers, instead of producing bad results. Besides, if I did not want the results to include all the words in the phrase, I would not have included them in the phrase in the first place.
" If you really want that text, then either quote the whole thing, or place a '+' in front of those words"
I did quote the whole thing, and got 70% accuracy. By putting plusses in front of the words, I still got 70% accuracy.
"So there is no problem with it's acurracy when you understand the proper way to ask it for something."
Quotes around the phrase do not work. Plus in front of all the words fails too. What is the secret of "the proper way"? more importantly, why won't it do the most intuitive thing: try to match the phrase as it is typed?
Don't blame Durga. I voted for Centauri.
It's not the technology that prevents thousands of google clones to pop up. It's the simple fact that to initially succeed, you need either a lot of cash or heavy backers.
It't not like Google's pagerank is so unique that it's impossible to do better any other way. It's just that 1) you have to do better or equal, 2) people have to know about you.
Point 2 equals lot of cash.
How small a thought it takes to fill a whole life
the way I understand it is that in order to operate a search engine that sorts through millions upon millions of listings thousands of times every minute, someone is going to need a whole lot of bandwidth. Not to mention the cpu resources that such a task would require. CPU cycles and Bandwidth cost money, and no matter how altruistic the person's intentions, they've got to earn that money. That is where advertising comes in. If I'm not mistaken, google is pretty cool about not having slimy advertising. However, I'm not sure if they pocket any of the money recieved from those advertisements, or if they simply use it to cover the costs of operating the search engine.
This is great until it starts working and it is really good and someone offers a lot of money for it and it is sold.
While your first post (reply) was quite amusing, you lost a few points:
1) You should have used "nutchs" instead of "nuts"
2) You are a "FP Mastur", not a "FP Mastar"
3) ???
4) PROFIT!!1
Any useful search engine will have an algorithm for ranking page relevance. Because search engine placement is so important to business, there will always be people out there who attempt to optimize (and in some cases, abuse) their pages to boost search engine ranking.
The most useful search engine is the one whose biases match your own biases.
Could someone set up a mirror? I think they got slashdotted.
But what about the hardware and bandwidth? I read about the kind of horsepower running behind the offices at Google and find it hard to believe a competitive offering can be made.
Perhaps what is needed is a peer-to-peer style distributed search engine for the web?
What??? And nobody sent me the memo.... (Posting from Lynx from a *BSD shell)
Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
Actually, it's "Bidness". :-)
Yor momma!
The question is: "What did Sean Connery say when he saw the reviews for 'League of Extraordinary Gentlemen?"
Trolls lurk everywhere. Mod them down.
" An unbiased search engine is completely useless."
Unbiased is fine for me. When I search, I am just looking for matches. That is all. I don't care so much about ranking decisions as long as the search produces accurate results. (that is, words or phrases found in the resulting documents).
Don't blame Durga. I voted for Centauri.
How are they going to afford the massive hardware and bandwidth costs associated with running a tier 1 search engine?
obviously no deficiencies vs. no obvious deficiencies
Check out Lucene, the indexing and search engine used by Nutch. From what I've heard, Nutch is mainly the spider/crawler used to gather documents.
The project may start out as an un-biased ranking system. But, if it gets very popular, the cost of running and maintaining a search engine that gets much traffic at all will require some sort of funding. (case in point: Google)
Maybe if the thing was intended for use only by educational institutions, then some education grants could be used to support the infrastructure required to run a popular search engine? Or maybe it could be a subscription-based service? I dunno...couple of thoughts on how to pay for it anyway.
Bottom line: somebody's gotta pay for it and (usually) the easiest way to pay for it is through advertising.....which will (unforunately) probably lead to money-biased rankings.
Grub is another open-source search engine, I have the client running right now, its nice and distributed, I think this kind of idea is great.
i use linux and windows oh god how can i have an opinion
Ooh, what's this?
Overture Research has donated hardware and helped to fund development.
So, even an "open source," "unbiased" search engine is funded by a commercial search organization.
Three of these top 10 links were not accurate results. I searched on the phrase "to be or not to be", not variations or mispellings. Phrases that capitalize on it, but do not match it, are close (but not accurate matches).
Don't blame Durga. I voted for Centauri.
let's see where is the funding coming from. Project is funded by overture which is to be bought by Yahoo. More info is here. Hmm.. So i guess Yahoo needs a revival...
bin
look siG is kool
Hey Trebek, tell your mother I had a good time last night.
You suck, Trebek. I hate you and your ass.
[/Connery brogue]
I'll take the 'A's, hands up who wants to work on the 'B's...
Cheers, Paul
I use www.google.com (not www.goohle.com or gogle!).
The third result is a site with bee cartoons. It contains "2Bee", etc. Close, but not a match. (The word referring to that insect was not in my search request).
Link 9 goes to a book at Amazon called "Or Not To Be". That partial phrase appears throughout the link. However, the entire phrase that I asked for does not appear.
Link 10 is to the papermsce site. It contains no funny but false variations on the phrase, nor any fragments lerger than "to be" found here and there in the text.
Don't blame Durga. I voted for Centauri.
Hello. CmdrTaco here posting as AC because I lost the password to log in with. It's silly really, but I haven't needed to log in for such a long time, that I just can remember what the danged thing was! Anyways, I am in need of some assitance. Slashdot has a special backdoor for me only. If I get an AC post modded up to +5 insightful or interesting, and it comes from my subnet, then I my password will be reset and mailed to me. So if you could see your way to just modding me up it would be a HUGE favor! Honest injun! Thanks. And please keep reading Slashdot.
So you have an algorithm and software - so what? The hard part of any search engine is paying for the bandwidth and employees and hardware. Software does not make a web search engine - at least not unless the "web" is a very small bit of what's out there.
Google uses software, but google's software on its own would be useless without the massive amount of funding that keeps the lights on and the pipes open.
Nutch has four developers, one of whom is Doug Cutting who wrote several indexing engines. They count Alexa founder Brewster Kahle as a "friend" and are sponsored by Overture.
for (i=0; i<intMaxSearchResults; i++) {
if (searchResultURL.host="www.myfavoritedomain.com")
intSearchRanking = 1;
else
intSearchRanking = 1000;
}
I think having an open source search engine that people can modify and deploy would be an excellent thing, and here is why. Currently, google has the complete power to highlight or censor anything on the web. So far, they have used this power wisely, but that's no guarantee that it'll always be so. If they go public, you may find this power being used to increase the shareholders' wealth, rather than in the highest standards of fairness as it is today.
With that in mind, how would this project help? It would allow webmasters to quickly & easily modify it for their needs, and deploy their own niche engines; in other words, Google would be supplemented by 10,000 niche search engines, each focusing on a specific field (microsoft propaganda, for instance). This would create a balance of power, ensuring that no single search engine accumulates an insane amount of control over the web as a whole.
I made a PHP/MySQL library that prevents SQL injection & makes coding easier!
Commercial results are biased up even though they're not marked as paid. Try a search for anything whatsoever (except open source) and you'll get your first 3 pages filled with online stores.
Nutch
E.
Never rub another man's rhubarb - The Joker
*Sealed envelope is opened*
"When Sean Connery saw the reviews for `League of Extraordinary Gentlemen`, what did he say?" is the question.
"Lost not are all who wander"
"In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine."
Bias is inevitable -- we're talking about ranking, which necessarily means bias.
The question is: what bias do you want? What bias suits your purposes?
My ideal search engine would offer a variety of biases from which to pick.
-kgj
"Neither is that sentence. One subject per verb, please?"
To boldly say that it is TWO sentences, dolt!
I can't see this OS project getting too much traction. One quickly realizes when setting out to build a search engine, that it takes a ton of computing power in the means of pipe, drive space, and database space. I found out the hard way.
It may be fun for some small intranet stuff though....
You are right: one more of these was a bad result. That's 40% error.
Don't blame Durga. I voted for Centauri.
open source is for communists
My /. page became the #1 result for "Omkar" before I posted a single journal article. Google is great, but as this illustrates, it's certainly not infalliable.
While my own preference would be to use python as the spider just as google does I have no doubt that Java is up to the task itself, especially with actually skilled developers.
However I question the decision to use Tomcat. My limited experience with Tomcat showed it to be a resource hog that doesn't scale well at all. I couldn't imagine the Tomcat I played with surviving traffic anywhere near the amount that google gets regularly.
Has anyone used Tomcat in a high-load situation? How much RAM did you need? I wasn't convinced that the 512MB I had would be enough and ended up dropping Tomcat entirely. That was a year ago. Did I have a bad version? Is this normal for Java (doesn't seem to be to me)?
Why is it that when it comes to OS, everyone is bitching and screaming how bad monoculture created by Microsoft Windows is, but otherwise feeling warm and fuzzy and swear to god Google is and always be the only search engine they use?
:
The point is, are you really comfortable to have one, and only one, effective search engine? No matter how well it searches?
O'Reilly put it best
Actually, Nutch has no ambitions to dethrone Google. It's just trying to provide an open source reference implementation of search to help keep Google and other search engines honest, by letting people compare the results of an engine whose algorithms and methodologies are transparent and accessible. It also aims to give a platform for people outside of the search heavyweights to research new search algorithms.
In my experience, the Teoma search site's sponsor links (paid for linking) are easyr to differentiate from search results (in a different part of the page). What's more, they are almost always directly related to what I am really looking for, and sometimes exactly what I looking for...
I will give you an idea to start, it's something I've been thinking on lately.
When people browse the internet, a plugin will fetch the pages (like a cache), parse its contents, and send to another computer. This computer is like a "mini-server", holding a couple hundred of clients.
Searching is like a p2p search, where you send your query to this "mini-server", which we shall call Hub for now on, and it will lookup the words on its index and process the results.
A Baesyan filter determines what Hubs you connect to, clustering you with people with similar interests.
What's the benefit? Well, new pages are added automatically, you don't need a crawler and bazillions of bandwidth to keep an index, which is *the* biggest problem for any search engine. Disk space is decentralized, so storage isn't an issue.
And you can make all sort of connections since it's a browser plugin, for example, what other pages people visit when they look for certain words, what are the entry and exit links of a page, time of the day (what's the most popular Linux news site at the morning?).
Still brainstorming a lot, but hey, IAAAC.
That's nice that they want to open source the engine but that's the least of a search engine. They're going to need multiple high end servers to process the searches and plenty of bandwidth to get the results to the users.
How do they plan to pay for that? Apparently advertising is out. And we just had another monephobe complaining about lack of funds for his accounting software who expected people to donate because he couldn't figure out that maybe, just maybe he should find a way to sell his product in some form while also keeping one form free. I can get RedHat for free OR pay money to get a hard copy with some bonus stuff. Net result is that RedHat makes money and everyone is happy. Those who refuse to pay don't have to and those who are willing to pay have a reason to. Most people are not going to just give you money out of the goodness of their heart and accept nothing in return if they don't have to. Why do you think PBS gives you gifts with your donations?
I'd be more impressed with such undertakings if the owners weren't convinced the bandwidth fairy was real and that money will fall from the sky like mana.
When someone comes along who recognizes that the bandwidth fairy doesn't exist and that money needs to be aquired through marketing to get any real amount then I'll think twice before laughing it off.
Free is a pretty dream but free don't pay the bills.
Ben
Work Safe Porn
Nutch - Not Understanding The Capitalist Hegemony (I am just making it up
Without a sound revenue model they can't operate for more than a month. Google has indexed billions of pages and to operate at that level they have to spend a lot of money (Google recently leased an entire campus from SGI). To meet the Infrastructure costs alone you need some form of commercial revenue stream.
It seems like there would be a better choice than Java for the language when speed/efficiency is a must. Isn't the added overhead of the JVM going to decrease performance significanly?
Portability should be a mute point since the pages can be generated on the server, which could easily run an OS specific binary.
If you search for suicide methods, they will bend the engine to make sure you get reasons why you shouldn't commit suicide before you get the how-to.
Yes but if you search for: suicide +"how to", the first hit comes up is: HOW TO KILL YOURSELF
I see no code.
Since google hasn't really done anything to warrant not using it (and we really shouldn't be so paranoid as to assume they will) I could see a project like this becoming useful in terms of specialized searches.
How about a network of Linux or developer sites? Yes, there is google linux, but I have at times found it lacking (especially when I get a slew of German/Japanese sites and it doesn't always give me the language filter option).
How about sites that index restaurants,etc? Perhaps they would benefit more from a searchable index without a visible initial ranking (so customers don't bitch). Live in Eastern LA and want to grab Greek? use blahblahfoodsearch.com and look up "+italian +greek +take-out"
Eventually, specialized sites could cater to a niche, rather than taking on something the scope of google (with its no-doubt massive servers) straight away
ROTFL, they can't be serious! A few yahoos fresh from college talking about scalability and billions of pages and then they use the hype language with the worst performance track record of any non-scripting language?!
Man, what a sad joke this is.
Why don't they use Visual Basic? Or maybe perl? You know, for performance reasons. Muhahaha.
Search: "to be or not to be" shakespeare
10 good hits... amazing, huh.
I don't see jack shit on their site, anyone with a little HTML knowledge could produce what they have thus far.
On Google.com, it is VERY clear what are paid ads and what are "real" results. With MSN, for example, they list Featured Site (you pay MSN), followed by Overture (you pay per click), following by the Looksmart Directory Listings (used to just pay for submission, for the past year, Looksmart charges $0.15/click for those results).
After the "paid" listings come the Inktomi listings. Those crawler based listings include PFI (pay for inclusion, you pay for daily spidering, but no "boost" in rankings) and the Partner Connect program, where you get free traffic for a week, then negotiate a PPC price for traffic.
If you search MSN, you would get the impression that Featured is editorial (which is kind of is), Sponsered is paid, and Directory/Page results are "real" search results, where the Directory/Page are often actually paid results.
The paid traffic from Inktomi involves an XML feed of terms and results, and your "fake" entries are treated as real entries with a boost for being a paid player.
In addition, for various adult terms, MSN tells you to use a third party adult "search engine," which ISN'T a search engine. It is a big player in the adult space that pays MSN for all the traffic and lists their sites, but does it in a "search engine" look and feel.
That is manipulating the rankings. If Google were to say, charge for the XML entry (either PFI or PPC) into Froogle.com, and then shot the Froogle results interspersed with Google results, that would be manipulating the rankings for money.
That is the manipulation angle.
Now, are paid results any better/worse from objective results if those are manipulated by SEO professionals (so you pay an SEO to get "free traffic" instead of paying the SE for the traffic)?
It's certainly more manipulated.
With a free engine, you could tweak the rankings and sell ads on the side, but not have manipulated editorial.
It's about maintaining a wall between advertisements and editorial, and the only engine that appears to have that wall is Google, and even Google pushes the boundaries.
Alex
Simple: every other domain name was taken. And czxvb.com just didn't quite have the same ring to it.
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
M$FT wants to get into search techs. good enough reason right there to advocate an open-source alternative.
Patents about partical parts of software do not pertain in all countrys around the world. Ie there are many places where a developer could be based develop and say stuff the Patents basicly can be by passed.
a close relative of "felch"? : To eat the semen you just ejaculated out of your partner's anus.
Without wanting to beat the "security through obscurity" dead horse any further, I don't see what's to stop the Nurk developers from doing exactly what Google does: changing the algorithm every month or so, so that pages which are "nurkbombed" in one release don't count as high in the next one. Admittedly, nurkbombers might have a slight time advantage over googlebombers (since they'd have access to the source), but the principle that the algorithm evolves based on attempts to exploit it is independent of whether the algorithm is open or closed-source.
This would be great for large company intranets. The company I work for (60,000+ employees) has probably more than 1000 web servers spread out all over the world yet we have no way to search the content of all of them. Something like this would be great.
Prevent email address forgery. Publish SPF records for y
I was looking over the site and a number of things concerned me.
Firstly the choice of Java, personally I have no gripe about this. And reading that a choice was made to use language-independent formats is a good idea. My main concern is for the larger scaling and distribution over multiple machines.
At present I make the educated guess that a project on this scale, in Java, would still be best run on a `hardware base as uniform as possible', like UltraSparc 450's with a fibre back-plain.
My second concern is that there is so much choice of indexing and searching technique that there are sure to be some problem due to Patent restrictions.
Just browsing the US patent office gave me a couple of possible Patent nasties;
6,463,428 or 6,278,992. (And about 10 others I glanced at...)
Lastly DB, in the short time I've been looking at the code it seems to me that a choice was made to implement a DB build for the problem. Although this could be a good thing, it is usually better to reuse existing products. I found SleepyCat (DB4) to match the requirements. And if the choice is final read this. [1]
I hope these comments are useful to somebody at least.
[1] http://www.xlnt-software.com/xml_dl.html
'I am become Shiva, destroyer of worlds'
I have a few comments on this development:
An open search engine application is a nice idea, but unfortunately it's one of those applications which are essentially useless without an enormous ASP architecture behind it. An earlier poster indicated that it might be useful for searching and indexing intranets and the like, analogously to the Google Search Appliance. This is indeed a valid potential application, but then, HT://Dig exists already. Is this dramatically better?
Hmmm, well "businesses" use one of the recent variants of MS operating systems (NT,2000,XP), the latter two of which ship with an "indexing server" specifically for indexing intranet sites.
Yet said businesses are still willing to pay money for a google indexing server. Why?
My point is maybe a "free"(beer- to save money) indexing server isn't what businesses require.
This is exactly like the problem the mice had one day. They couldn't come out of their mouse hole because there was a dangerous cat prowling around. One day, as food was getting scarce and everyone was afraid to leave the hole, the mice called a meeting to discuss the problem. One excited young mouse came up with the most wonderful idea: Let's put a bell around the cat's neck, so that when the cat is nearby, the mice would have advance warning and could escape! All the mice got excited at this proposal, until a very old, very wise mouse came over and asked, "And who will tie the bell around the cat's neck?"
What I'm trying to say is: If the search engine is free software and companies don't pay to increase their ranking... who will pay for the bandwidth to host the engine? I can tell you this much:
Proposed solution? Make it a distributed search engine, like SETI@home, or the DNS.
This is much easier said than done because:
- RAID-like distributed storage technology would have to be developed, so that the indexing database could be distributed among all computers worldwide that donate bandwidth and storage. This would have to guarantee statistically that all the data will be available at any point in time even if people turn off their computers for extended periods of time. However, this technology could make reliable clustered storage a reality, and the resulting free software implementation could be licensed for corporate use for an exhorbitant price, which would go to the EFF, FSF and other organizations that develop free software and/or support the development thereof.
- An efficient P2P-like protocol, along with a network topology of some sort (like the DNS system has) would have to be developed to support the searching; It would have to be damn fast and, like before, very resiliant to computers being shut off, chunks of data becoming lost at any moment, etc. Furthermore, changes would need to propogate at blazing speeds so that new items on the Internet could be found shortly after appearing.
- Bandwidth and disk quota would need to be managed at each participating host, so that limits set by the user are not exceeded.
Governments, companies, universities and individuals would likely support an effort like this by donating some bandwidth and storage, rather than money.In the spirit of worldwide computing on the Internet, I hope this makes some amount of sense.
This is a good idea, perhaps the good people of bittorrent and p2p networks could lead some insight into how to get this working.
As for ranking manipulation, open source takes care of that... the one thing more important to a business than their own rank, is keeping the other guys down.
I just came across that new Google feature, the calculator.
I'd only wish that there was better documentation, 'radius of earth' isn't exactly something you stumble upon by accident.
Maybe with a touch of distributed net?
Could be the next killer ap...
google is so 5 seconds ago... we`ll all have our own nutch engines and have all the data we care about on our local storage every 2 microseconds or so.
nutch will be just one oldest/basic utils on Linux2013.
If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
Something a lot of folks are missing here is that search engines are used in applications, intranets, individual sites etc as well as Google type whole-internet portals.
/. for an old article, that's a using a search engine just as much as Google is.
When you click on 'Find Files' in Windows, or look for a song in your chosen P2P app, or look something up on your O'Reilly CD Bookshelf, or search
If you're interested in something for your own project, lucene is a great application-centric search engine. It's just a bunch of Java classes that you call from your application. Or you can use a website-centric engine such as htdig if you're dealing with an intranet or website rather than an app. They're both GPLed I think.
Vino, gyno, and techno -Bruce Sterling
I thought that's what dmoz was supposed to be??
Will Nutch ever be as good as other search engines?
...
...
...
How can I stop Nutch from crawling my site?
When will Nutch search images, pdf files, etc.?
Search for "linux" and you'll get your first 3 pages filled with pages about Linux. What's your point?
I don't want a search engine thaht use 80% of my CPU/RAM.
It's the storage library that keeps an index of pages. You need a display front-end and a webcrawler to go with it (there's some code around). It's GPL and it has some clever features.
The Success of an Open Source engine would be of crucial importance, especially after the latest development in the market: Yahoo! bought Overture and Google started working in the field of paid searches. The risk of the current trend is to produce a undistinguishible mixture of paid and free search that does not work anymore as a reliable catalogue or classification for a growing messy Web. The point is strengthened by the Web-directories decline (Yahoo! loosing importance and the small success of the Open Directory Project, exploited but not popular among users). If a new open source search engine will flourish, it is important to defend its integrity from its born. This can be achieved perhaps with copyleft protection.
Then it probably be good for finding babes in red swimsuits.
http://www.namazu.org/
Namazu is a open source full-text search engine. It has various document filters (HTML, Mail/News, PDF, MS Word, Excel, Powerpoint, man page, TeX, DVI, PostScript etc...) and mharc web-based mail archiving system adopts Namazu as its search engine.
What about implementing nutch as a distributed effort, with spiders patiently covering a thousandth of the web per install, and reporting to a consortium of active servers? You could cover the web in hours, not days, the way SETI covers its search space. Also, if each node agrees to mirror its logically closest neighbor's small dataset, the odds are a search launched from Tanzania could see results posted by the colloquium from anywhere on the 'net -- Thousand Oaks, Cedar Rapids, Mexico City, Oslo -- in seconds, regardless of who is on the web or when.
``Tension, apprehension & dissension have begun!'' - Duffy Wyg&, in Alfred Bester's _The Demolished Man_
*cough*
bleah. Something to ignore.
NUTCH! great idea- keep google on its toes!!!