Do-It-Yourself Internet Archiving?

← Back to Stories (view on slashdot.org)

Do-It-Yourself Internet Archiving?

Posted by Cliff on Wednesday December 31, 2003 @05:20AM from the combatting-the-slashdot-effect dept.

A moron asks: "Web pages change and disappear all the time. For legal and historical purposes, I need to have accessible archives of the websites I maintain. I'm basically looking for a do-it-yourself version of Internet Archive's Way Back Machine which provides a simple versioning system and accessibility through web interface. Is there already software that does this? If not, what ideas does Slashdot have to make such a system possible? How should it work? What existing tools can be used together to make a workable system?"

"There are all sorts of tools out there that will archive web pages, and each have other necessary features such as making links relative. I don't always have filesystem access to pages, so tools that rely on such access won't work. There are some obvious tools that do part of the job such as:

curl
wget
httrack

But grabbing pages is only part of my, and I suspect many other peoples needs. The other pieces include intelligently archiving the pages, and making them accessible. If a page or a page element hasn't changed, there is no need to store multiple copies. The archives need to be easy for end users to navigate, search, and link."

29 comments

Min score:

Reason:

Sort:

In five lines or less... by Mr.+Darl+McBride · 2003-12-31 05:27 · Score: 5, Informative

For a small site with complete backups, make a script:
ARCDIR = `date +%y%m%d`
cd /var/www/archives
mkdir $ARCDIR
cd $ARCDIR
wget -r http://mysite.com
Add error-checking and season to taste.
If you want to be more efficient like the poster wanted, you could easily have it always fetch to the same directory and just use cvs to check in. This eliminates duplicate storage. There are many free web-based CVS browsers out there with date searching and similar features. Might not be quite as nice as the wayback machine, but it definitely does the job for free.
A lot of folks are doing a simple version of the above to maintain SCO mirrors so there's to be no history erasing before the trial. God bless you all -- it will make the case that much stronger for us.
1. Re:In five lines or less... by Mr.+Darl+McBride · 2003-12-31 05:31 · Score: 5, Informative
  
  A quick caveat:
  If archiving SCO or other such pr0n sites, or if you have no-robots policies set on your own site that you're archiving, you'll need to tell wget to be a little rude. He needs to go where robots aren't meant to go. I figure if you were going to visit every page yourself anyway, it's not so impolite. And besides, robots.txt is for other people. You know... the ones we make ride the back of the internet.
  To accomplish this: cat >>~/.wgetrc "robots = off"
2. Re:In five lines or less... by Anonymous Coward · 2003-12-31 06:43 · Score: 2, Informative
  
  It's possible to execute wgetrc commands on the command line:
  wget -re 'robots = off' http://mysite.com
3. Re:In five lines or less... by skookum · 2003-12-31 08:36 · Score: 1
  
  For this to work well, you'd need --mirror not -r. --mirror includes -r (recursion) but also turns on time-stamping and sets the recursion level to infinite, from default of 5. Time-stamping is crutial, as its what lets wget know what has changed and what doesn't need to be retrieved, assuming you're doing the cvs checkin thing.
4. Re:In five lines or less... by OrangeSpyderMan · 2004-01-01 03:31 · Score: 1
  
  Wow - +2 Informative for copy/pasting that - bet you're pissed you didn't log in and paste the whole man page now :-D
  
  --
  Try NetBSD... safe,straightforward,useful.
Use - by noselasd · 2003-12-31 05:39 · Score: 2, Informative

this
CPAN is your friend by B1LL_GAT3Z · 2003-12-31 05:42 · Score: 2, Informative

I highly recommend that you check out w3mir - which was found after a quick search on CPAN (The Comprehensive Perl Archive Network). I particularly like w3mir due to it's ability to compare against existing copies of your local mirror - which is more of what you're looking for. Using this in conjunction with a simple shell script (to tar and mv files, as so desired - hooked to a cron job) will create your very own, automated Internet Archive.

--
-- Kleptotherapy: Helping those who help themselves.
Re:wget by _iris · 2003-12-31 05:48 · Score: 1

Whoops, didn't read the extended article.

Put each site in a CVS repository. Check it out, wget the live site, check the copy of the live site back into the CVS repository.
wget is the right path by Yonder+Way · 2003-12-31 06:00 · Score: 0, Redundant

I think wget is the way to go, perhaps with the "-m -k" flags, and then check the whole directory tree into CVS using `date +%Y%m%d` as your version number.
1. Re:wget is the right path by Anonymous Coward · 2003-12-31 06:43 · Score: 2, Informative
  
  CVS with the -D option will do your date-oriented functions for you without needing special version numbering, I think.
  
  But personally I don't think wget and CVS are very helpful in this case. I think it would be better to use something like Perl or Ruby to write a custom spider, and then using cp -lR to make iterative snapshot copies of your working archive tree (you use cp -l because then your copies don't take up extra space). This way you can write hooks to test whether content has changed before writing the file. If not, you can leave your existing file as is and the hard links in all of the snapshots can efficiently point to just one real file.
  
  Also, then you'll be able to keep a list of files that were either changed (and therefore written to disk) or unchanged (and therefore left untouched). You can use that list to make sure that your current working directory doesn't have any orhaned files in it (files that you can no longer get to from any of the starting URLs and the spidered links).
  
  Interesting that I was just thinking about this problem yesterday, and someone asked it today on Slashdot. Now that we've had this discussion I'm somewhat motivated to attempt to write code to do what I've just described.
2. Re:wget is the right path by Anonymous Coward · 2003-12-31 06:48 · Score: 0
  
  I suppose I should note that someone else here linked to w3mir on CPAN, which already does all of this, except for the daily snapshotting.
this brings up an interesting point.. by Lord+Bitman · 2003-12-31 06:01 · Score: 1

I have wanted for a while now to be able to run a command "every time a file is created or updated" in a tree. I know how to do this on a per-directory basis, but I would love to run a command on the changed-file itself whenever an update occursas far as I know, there is no way to cause a "trigger" of this sort. Due to this limitation, I've instead been running the same command on every file on the server each morning. Yes, it's a cron-controlled script and I don't need to touch it, but having it scan and change attributes on every file every day seems sub-optimal.
Any solution which works for this would probably work for this guy too (file news.php has been updated, cp to news-md5-date.php)

--
-- 'The' Lord and Master Bitman On High, Master Of All
CVS? by Phleg · 2003-12-31 06:25 · Score: 2, Informative

If you have console access to the machines (or can at least make a script), CVS could be a viable solution. Just maintain a central CVS server and have the websites do CVS commits when timestamps on files change. On the other hand, this might not really work if you have dynamic content.

--
No comment.
1. Re:CVS? by osewa77 · 2004-01-01 03:00 · Score: 1
  
  Or he could use wget to download the latest copy of the page and then use CVS (or another version control system) to record the latest changes.
  
  There's no real need for console access, unless its a dynamic site in which case you need to store the source for your scripts as well as maintain versions of the database!
  
  At this point it's nothing more than keeping multi-versioned backups of your website and database files. Check out rdiff-backup
  
  Best of Luck.
Why make something harder than it should be by TykeClone · 2003-12-31 06:45 · Score: 1

If you're solely maintaining static sites, just keep copies of the site as published.

--
A fine is a tax you pay for doing wrong and a tax is a fine you pay for doing all right.
sourceforge by nocomment · 2003-12-31 07:18 · Score: 2, Interesting

Sourceforge is open source, why not go d/l that. YOu can use CVS as an easy way to switch around and do upgrades. YOu can develop a site, then run upgrade via cvs and if something unexpected breaks, downgrade via cvs. Once you get the infrastructure in place things like that would be a breeze.

--
/* oops I accidentally made a comment, sorry */
/* http://allyourbasearebelongto.us */
1. Re:sourceforge by tf23 · 2003-12-31 11:18 · Score: 3, Informative
  
  Why not recommend gforge rather then sf? Sourceforge's code is untouched for a few years now, right? While gforge is opensource and being currently developed.
  
  --
  http://slashdot.org/~tf23/journal
2. Re:sourceforge by Christopher+Cashell · 2003-12-31 21:08 · Score: 2, Interesting
  
  Actually, SourceForge is *not* Open Source any longer.
  
  Check the URL you referenced, and you'll notice that the last release was made on 2001-11-04. And the code released there is actually even older than that, as the release date got updated when they moved it from the original Alexandria project.
  
  SourceForge intentionally killed off public development of the SourceForge code, and then did an excellent job of convincing people that it was still an Open Source project. They kept promising and promising that the development code and CVS would be released, but it never was.
  
  Take a good look at the URL you provided. . . where the's the web-accessible CVS repository? Heck, where's the CVS repository at all? Where's the development mailing lists?
  
  It was all killed.
  
  As another poster mentioned, GForge is a *much* better option.
  
  --
  Topher
Linkrot by sakusha · 2003-12-31 07:35 · Score: 2, Informative

Bloggers are acutely aware of this problem, they link to pages that change or are moved to paid archives, they call it "linkrot." I've started to provide a .pdf capture of linked articles on my blog, as well as the original link (which I usually take down if I notice it's disappeared).
I like Adobe Acrobat for this job, you just point it at a URL, tell it how many levels you want to archive, and go. You can even archive externally linked pages if you uncheck "stay on same server," or you can select other options like "Archive Whole Site."
1. Re:Linkrot by Glonoinha · 2003-12-31 07:45 · Score: 2, Funny
  
  Just point Adobe Acrobat at http://www.google.com and back up the entire Internet to one big .PDF file.
  
  --
  Glonoinha the MebiByte Slayer
For dynamic sites... by dcocos · 2003-12-31 07:42 · Score: 2, Interesting

You may want to consider something that caches the pages as they are displayed. This will add overhead and doesn't scale, but would allow you to keep a copy of the pages as they were dispayed. You could atleast use it for a subset. For example you use JSPs to serve up pages from a DB but the resulting page is different depending on parms to the page. wget isnt' going to capture all of this, so when the page is generated you write it out with a timestamp ( you build some intellegence so the page only gets written once a month or something) then you archive all of the written out pages and you have a copy of the site as it appeared. Another reseason this may be a good idea is because if you upgrade somesoftware* (in this case Java) You will be able to the as it was rendered under Java 1.3.1 and not how it renders under 1.4.2. Finally if you turn this feature on while debugging, you can really help the developers out, b/c now instead of hearing from the users "My page didn't work, it was weird somehow" You actually have a copy of what was sent to thier browser.
downloading webpages by schnits0r · 2003-12-31 08:08 · Score: 1

I use teleport pro.

--

-------
Support Indy Music. Buy
1. Re:downloading webpages by exhilaration · 2004-01-01 08:54 · Score: 1
  
  Teleport Pro: $40
Use the Archive's crawler by Danton · 2003-12-31 12:22 · Score: 2, Informative

How about using Heritrix, the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler?

--
"Web Users Should Not Engage in Promiscuous Browsing" --CERT
1. Re:Use the Archive's crawler by gabe · 2004-01-01 06:50 · Score: 1
  
  From the FAQ:
  
  I need to crawl/archive a set of websites, can I use Heritrix?
  
  Eventually. For now, the crawler is still in early development, and only if you are comfortable grabbing code directly from CVS, wrestling with incomplete documentation, and running into undocumented limitations, would you want to use the current software.
  
  --
  Gabriel Ricard
webpage versioning by thatchman1 · 2004-01-01 16:48 · Score: 1

You could always extract the site with adobe acrobat and have a 'distilled' in tact, usable site in a single file.