MapReduce For the Masses With Common Crawl Data
New submitter happyscientist writes "This is a nice 'Hello World' for using Hadoop MapReduce on Common Crawl data. I was interested when Common Crawl announced themselves a few weeks ago, but I was hesitant to dive in. This is a good video/example that makes it clear how easy it is to start playing with the crawl data."
This will be my first (and hopefully not last) headfirst dive into MapReduce.
So you're hoping to find child porn, ghost fetish stuff, or both?
"Ubuntu" -- an African word, meaning "Slackware is too hard for me". - stolen from Dan C alt.os.linux.slackware
The problem I've got is that searches with Google and the like turn up a lot of junk that I'm not looking for, with the file search engines like FilesTube simply ignoring the numeric years specified in my search queries.
What I want to do is find PDF files of specific issues (Month and Year combinations) of certain magazine titles. But when I try these searches, the results contain a lot of years that I had not specified, with the year I did specify not falling anywhere in the resulting pages.
There are all kinds of ways I could use a regular expression to turn up the download sites for every magazine I want to find, but to the best of my knowledge none of the currently available search engines can take regular expressions. For example a copy of an old magazine that is available for download will typically be labeled with the file size, so I could include "[0-9]* *[Mm][Bb]" in my query. That would distinguish magazines available for download from those that are merely discussed online.
Just last night I sat up all night long looking for old magazines, only to turn up two from the era of my interest. There is no end to the pr0n that is available online, but I don't find most of it at all interesting anymore. What I do find interesting is my goal of completely recovering the magazine collection that my ex demanded that I pitch before she would agree to come visit me for the first time.
Request your free CD of my piano music.
I think any total newbie that tried to process all the crawl data would soon find that his first attempt would not terminate until after The Heat Death of the Universe.
Surely there must be some doc on how to make such jobs runs faster, use less memory as well as less storage?
Request your free CD of my piano music.
more than 50% of any given sentence sounds like gibberish. And yet you know someone somewhere is as excited as you were when you got your first floppy drive...
WGet is chugging away even when I speak. I'm gonna have to cough up for more storage.
Here is an SEO tips for y'all. I didn't discover it, but I stumbled across it just now:
placing the terms "index of", "parent directory", "name", "last modified", "size" and "description" on your web pages is a real good way to attract visitors.
I wasn't able to turn up any actual Apache directory listings for Penthouse Pet of the Year Corinne Alphen. They were all your typical pr0n site that not only weren't presenting directory listings, but none of the sites I looked at had any photos of her, scantily clad or otherwise.
Directory listings for well-known models though, turned right up.
Request your free CD of my piano music.
Hmm, similar article so I'll ask a question of personal nature.
I've recently created a crawler to collect certain information from a website, that would help me gather data sets for a small machine learning project.
While I've followed robots.txt and nofollow links, site's TOU was against it. After confirming with the admin, I was told that it's not allowed to gather information, as the site owns it (as it's written in the TOU).
The data however is publicly available, so you actually wouldn't have to agree to a TOU to collect the data, and as it's some data I wanted, I still concluded I should get a small sample (less than 1% of the total data, around 200MB) at least, to see if something's even possible to be done with it.
What are your thoughts /.? Should I have abandoned the attempt, have I done right or even should I disregard their plead and simply get as much as I please (during a long period of time, as to not hammer on it's bandwidth)?
I'm pretty sure the word "Traveled" was not in the original song.
Not sure if trolling (if not, well played), but it is.
Citation
add -.htm? and -.php to your searches
what this is.
This sig is not paradoxical or ironic.
Actually, entry Level EC2 is free for 1 year, and has been since Nov. 2010.
You don't need to pay for accessing it, but you still need to pay for the processing power, storage and RAM in your EC2
See here:
http://www.infoworld.com/d/cloud-computing/amazon-web-services-offers-ec2-access-no-charge-531
-- Terry