New Search Engine Takes "Dyve" Into the Dark Web
CWmike writes "DeepDyve has launched its free search engine that can be used to access databases, scholarly journals, unstructured information and other data sources in the so-called 'Deep Web' or 'Dark Web,' where traditional search technologies don't work. The company partnered with owners of private technical publications, databases, scholarly publications and unstructured data to gain access to content overlooked by other engines. Google said earlier this month that it was adding the ability to search PDF documents. In April, Google said it was investigating how to index HTML forms such as drop-down boxes and select menus, another part of the Dark Web."
Rubber tubing, gas, saw, gloves, cuffs, razor wire, hatchet, Gladys, and my mitts.
this will help me get more porn, how?
The company partnered with owners of private technical publications, databases, scholarly publications and unstructured data to gain access to content overlooked by other engines.
I know why the other engines don't index these documents: they're behind pay walls. As the second link points out, Google already indexes (some) PDFs, but that doesn't help if the site doesn't want me to see the PDF. There are lot of topics, such as disability rehabilitation and linguistics, that I can't search for without Google returning a bunch of results from sites that require a subscription but to which my county library doesn't subscribe. (A tip-off for these results is that "Cached" doesn't show up.)
This will certainly defeat the practice of obfuscating links with e-mail addresses in them, by using a picture link or "click here."
If the search engine can read source code, it can certainly parse out an email address.
You never expect irony, do you?
Want to be a professional wrestler? Visit www.iyfwrestling.com
@iyfwrestling
I have pondered how or if this information could be made available. Looking good for open access!
I don't know about you guys but I prefer not to have to sign up or use the "pro" version for my web searching needs.
In fact why do I have to sign up to web search anything?
Besides this thing looks like it just gets in your way.
Thanks, but it's not a google killer.
They're also looking into indexing images based on whether they contain boobies.
You mean like these boobies? What about these great tits? And would you tap that ass?
Just what I needed: 40 million NEW search results to sift through; I already have to deal with the first 5 pages being useful, followed by 60 pages of, let's say 'pokemon glitch' that is really someone's blog that has 500 words slapped on the bottom (nothing quite as useful as finding out a website that came up on the search says at the very bottom of the blog 'boobs anal pokemon glitch asians etc'
Good news is I can finally PAY to be annoyed.
It's apparently not working right now. But give it all your personal information now, and they will get back to you.
basically it's like cavity search for the internet.
They want money for their service instead of following the mega successful Google, advert supported model? Good, they will be ignored just like the content they offer. This stuff needs to be liberated instead.
DMCA, Hollings, Palladium. What might have sounded like paranoia is now common sense.
Login? to search a "dark net".
You are fucking kidding?
I was right about Tesla crashing. I'll make another prediction.
Deap Dyve out of business in 1 year.
Cheers,
Kilgore Trout
P.S. : get the Cyrillic fonts enabled. Russia is invading the U.S.S.A. Finally !!!
I shall certainly try it out.
BUT, if it is anything like how badly Cuil went on its first week, it will fail.
Instantly, just by seeing the frontpage, i don't have high hopes.
You have to sign up?
Yes, i will try it, when i can be bothered signing up, WHICH would probably be never, as i will probably forget about it until the article posted here in a month saying how awful it is doing.
The summary is a bit misleading. Google has been indexing the textual parts of PDFs for a long time. According to the article they have now started indexing scans inside of PDF files, which requires OCR.
Google has been doing that for catalogs for a while now, but OCRing large numbers of scans obviously requires a lot more resources.
I know you! For years Anonymous Coward has been making all sorts of predictions. You can't improve your credibility from one specific event to make me believe you! And signing the e-mail with the name of a fictional character from Vonnegut doesn't help either.
Support the 30 Hour Work Week!!!
You lost me at "Sign Up Now"
See http://slashdot.org/comments.pl?sid=1024127&cid=25708401
Either this DeepDyve thing is the best search engine ever or they are smoking crack. They have a pro version for $45 a month. http://www.deepdyve.com/why_deepdyve/deepdyve_pro that's got to be some pretty good venn diagrams to be worth $45 a month...
http://www.popularculturegaming.com -- my blog about the culture of videogame players
That was my first impression too
the last few new search engines that have been advertised here at /. all required a login/account just to search.
how F'ed up is that?
Comment removed based on user account deletion
"In April, Google said it was investigating how to index HTML forms such as drop-down boxes and select menus, another part of the Dark Web."
-Great, now I can have 10,000 times more irrelevent search results to dig through!
Knowing Google's lust for data collection, the Soviet Union is still alive and well inside the psyche of Sergey Brin....
There is a difference! A "Dark" web (or more properly Dark Net) is designed to be private. The "Deep Web" simply accesses more information that has always been public, just hard to find.
There is a VERY big difference!
I'll wait for Google to assimilate DeepDyve before I'll check it out.
"There are lot of topics, such as disability rehabilitation and linguistics, that I can't search for without Google returning a bunch of results from sites that require a subscription "
To me that's a breach of Google's own guidelines.
Here are Google's guidelines:
'Make pages primarily for users, not for search engines. Don't deceive your users or present different content to search engines than you display to users, which is commonly referred to as "cloaking."'
In 2006 they blacklisted BMW for breaching them:
http://news.bbc.co.uk/1/hi/technology/4685750.stm
I've actually reported some of those "subscriber only" sites to Google, but not surprised that nothing much happened - since I suspect Google gets $$$ from them, and the unwritten guidelines is don't deceive users unless you pay us $$$ :).
As Google's user, I very rarely want to get search results for content that I can't access. If they want that feature, at least I should be allowed to opt in/out much like their "safesearch".
So much for Google's don't be evil eh?
You should try search.yahoo.com and search.live.com once in a while to see if they are better. So far they are about as good as Google. If Google becomes worse I have no qualms about switching.
Actually you can already Convert Scanned PDF Documents to Text with Google OCR, though it is not immediate, unless you have control of the indexing frequency of your site. http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/
I'll be sure to add DeepDyve to my list of blocked search engine spiders.
The dark web is dark for a reason. Some of us on the dark web don't want our content indexed at all. Others don't want our content indexed if the search engine companies can't be bothered to adhere to the HTTP specifications and recommendations.
Search engines like Google eat up gobs of bandwidth every day by indexing my websites, usually multiple times per day, even when nothing has changed on my websites.
The queer thing is, any updates I make to my websites never make it into the search engine results. Instead, all you see are old listings from the first time Google and the other search engines hit my websites, thus cheating the search engine users out of the time and bandwidth required to directly access my websites to look for updates.
For this reason, and others, I have blocked Google and other search engines from indexing my websites. I'm also checking the Referer header and blocking any that come from those brain-dead search engines. You might want to consider doing the same, at least until the search engine companies do the following:
1) Stop doing daily, or multiple times per day, full indexing of the exact same content. Currently, all of the search engines are ignoring the If-Modified-Since HTTP headers. They are also always using GET instead of the less bandwidth intensive HEAD directive for the follow-on indexing.
The correct way to do follow-on indexing after the initial index would be to use HEAD, followed by GET if and only if the If-Modified-Since timestamp has changed.
2) Start reflecting updates to the indexed websites in the search results. If the search engines have already indexed your website multiple times, there is no legitimate reason for them not updating the search results.
If the search engine companies refuse do to 1) and 2), then:
3) Start paying website operators for the bandwidth their brain-dead search engines are wasting.
Google and others are making billions of dollars each year in advertisements that show up next to the search results that include your content. Much of it is completely unrelated to your content. If the search engine companies are going to waste my bandwidth for no good reason at all, I want compensation.
This is one of the real reasons why comcast and at&t want to charge per-gigabyte metered rates. The smokescreen lie of it being because of P2P is transparent to those of us who have a clue.