Do-It-Yourself Internet Archiving?

← Back to Stories (view on slashdot.org)

Do-It-Yourself Internet Archiving?

Posted by Cliff on Wednesday December 31, 2003 @05:20AM from the combatting-the-slashdot-effect dept.

A moron asks: "Web pages change and disappear all the time. For legal and historical purposes, I need to have accessible archives of the websites I maintain. I'm basically looking for a do-it-yourself version of Internet Archive's Way Back Machine which provides a simple versioning system and accessibility through web interface. Is there already software that does this? If not, what ideas does Slashdot have to make such a system possible? How should it work? What existing tools can be used together to make a workable system?"

"There are all sorts of tools out there that will archive web pages, and each have other necessary features such as making links relative. I don't always have filesystem access to pages, so tools that rely on such access won't work. There are some obvious tools that do part of the job such as:

curl
wget
httrack

But grabbing pages is only part of my, and I suspect many other peoples needs. The other pieces include intelligently archiving the pages, and making them accessible. If a page or a page element hasn't changed, there is no need to store multiple copies. The archives need to be easy for end users to navigate, search, and link."

3 of 29 comments (clear)

Min score:

Reason:

Sort:

sourceforge by nocomment · 2003-12-31 07:18 · Score: 2, Interesting

Sourceforge is open source, why not go d/l that. YOu can use CVS as an easy way to switch around and do upgrades. YOu can develop a site, then run upgrade via cvs and if something unexpected breaks, downgrade via cvs. Once you get the infrastructure in place things like that would be a breeze.

--
/* oops I accidentally made a comment, sorry */
/* http://allyourbasearebelongto.us */
1. Re:sourceforge by Christopher+Cashell · 2003-12-31 21:08 · Score: 2, Interesting
  
  Actually, SourceForge is *not* Open Source any longer.
  
  Check the URL you referenced, and you'll notice that the last release was made on 2001-11-04. And the code released there is actually even older than that, as the release date got updated when they moved it from the original Alexandria project.
  
  SourceForge intentionally killed off public development of the SourceForge code, and then did an excellent job of convincing people that it was still an Open Source project. They kept promising and promising that the development code and CVS would be released, but it never was.
  
  Take a good look at the URL you provided. . . where the's the web-accessible CVS repository? Heck, where's the CVS repository at all? Where's the development mailing lists?
  
  It was all killed.
  
  As another poster mentioned, GForge is a *much* better option.
  
  --
  Topher
For dynamic sites... by dcocos · 2003-12-31 07:42 · Score: 2, Interesting

You may want to consider something that caches the pages as they are displayed. This will add overhead and doesn't scale, but would allow you to keep a copy of the pages as they were dispayed. You could atleast use it for a subset. For example you use JSPs to serve up pages from a DB but the resulting page is different depending on parms to the page. wget isnt' going to capture all of this, so when the page is generated you write it out with a timestamp ( you build some intellegence so the page only gets written once a month or something) then you archive all of the written out pages and you have a copy of the site as it appeared. Another reseason this may be a good idea is because if you upgrade somesoftware* (in this case Java) You will be able to the as it was rendered under Java 1.3.1 and not how it renders under 1.4.2. Finally if you turn this feature on while debugging, you can really help the developers out, b/c now instead of hearing from the users "My page didn't work, it was weird somehow" You actually have a copy of what was sent to thier browser.