Do-It-Yourself Internet Archiving?
A moron asks: "Web pages change and disappear all the time. For legal and historical purposes, I need to have accessible archives of the websites I maintain. I'm basically looking for a do-it-yourself version of Internet Archive's Way Back Machine which provides a simple versioning system and accessibility through web interface. Is there already software that does this? If not, what ideas does Slashdot have to make such a system possible? How should it work? What existing tools can be used together to make a workable system?"
"There are all sorts of tools out there that will archive web pages, and each have other necessary features such as making links relative. I don't always have filesystem access to pages, so tools that rely on such access won't work. There are some obvious tools that do part of the job such as:
But grabbing pages is only part of my, and I suspect many other peoples needs. The other pieces include intelligently archiving the pages, and making them accessible. If a page or a page element hasn't changed, there is no need to store multiple copies. The archives need to be easy for end users to navigate, search, and link."
ARCDIR = `date +%y%m%d` /var/www/archives
cd
mkdir $ARCDIR
cd $ARCDIR
wget -r http://mysite.com
Add error-checking and season to taste.
If you want to be more efficient like the poster wanted, you could easily have it always fetch to the same directory and just use cvs to check in. This eliminates duplicate storage. There are many free web-based CVS browsers out there with date searching and similar features. Might not be quite as nice as the wayback machine, but it definitely does the job for free.
A lot of folks are doing a simple version of the above to maintain SCO mirrors so there's to be no history erasing before the trial. God bless you all -- it will make the case that much stronger for us.