What's in Your HTML Toolbox?
Milo_Mindbender asks: "I've just ended up in charge of cleaning up an old and rather large website created by some non technical people. It has all the usual problems: paragraph tags with no ending tag; mixed case file names that work on Windows but not on a Linux webserver; files with mixed Windows/Linux/Mac line endings; duplicates or partial duplicates of files created when working on pages; and the list goes on. I'm wondering what tools you guys keep in your HTML/website toolboxes that work good for cleaning up this sort of mess. Things like pretty-printers, HTML 'lint' programs, dead file detectors, batch renamers (that change links and the files they point to into OS neutral names), and 'diff' programs that ignore HTML whitespace. I'm particularly interested in batch processing tools that actually fix problems (not just report them) because I've got a lot of files to deal with and don't have the time to edit every one by hand. So what's in YOUR toolbox?"
CAPITAL ONE!
[...]
Wait, what was the question again?
Javascript + Nintendo DSi = DSiCade
I know many of the geeks out there have forsaken Perl, but it is still, in my opinion, an indisposable tool. I am currently fixing up a website similar to the one you described, especially in terms of the HTML problems. Write a Perl script to fix capitalization, closing of tags, etc. But understand that if code is not written well to begin with, than in many cases, it is impossible to automate the process of fixing it. You are going to have to do some things by hand.
Depending on how bad it is, consider rewriting the HTML and CSS part of the website from scratch. It may be easier than fixing old code.
There are two approaches: live with it and make as few changes as possible, or bite the bullet and do a complete rebuild. To do a cleanup, checkout tidy - it does a good analysis of the existing pages and can generate CSS that is OK, but not beautiful. If you want the final pages to look the same, but be standards compliant, see meyerweb.com and read his books on rebuilding pages ("Eric Meyer On CSS" and "More Eric Meyer on CSS"). Pragmatic is his keyword: lots of examples and he makes sense.
Good luck. You're going to need it.
Been there, try this
You can also do batch file processing with vim by using the following commands: vim *.match.files.* then once in vim: :argdo:%s/[^m]//ge | w
this would remove the funky windows line endings (mind you, ^m = ctrl-v ctrl-m in vim).
Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
HTMLKit has a lot of great options for developers, and a good plugin system.
"Better to be vulgar than non-existent" -Bev Henson
> :argdo:%s/[^m]//ge | w this would remove the funky windows line endings (mind you, ^m = ctrl-v ctrl-m in vim).
:-)
Or, in emacs
M-% (AKA Meta(usually Alt)-Shift-5)
Query Replace: ^M with [nothing]
P.S. Note that ^M is not Caret-M. It is a single character. I usually just copy it out of the file, and then do it in emacs.
My toolbox has a little white pill that I take every time I get a hankering to work with HTML. It fixes me up right quick.
I got my Linux laptop at System76.
The disaster that was "s.gif" (or "trans.gif" in some circles) used as a layout tool was horribly over-used - and the 'net is a worse place because of it. In most projects now, I seek to replace all instances with a "compatible" approach.
.spacer{
I create a class:
line-height:0;
font-size:0;
}
Then I replace all those hundreds (and sometimes THOUSANDS) of references to s.gif with the following:
I use a span sometimes, as required - if the DIVs alone cause layout issues.
Say hello to faster web pages instantly!
How many escape pods are there? "NONE,SIR!" You counted them? "TWICE, SIR!"
Oops Sorry!
<div class="spacer" style="width:Xpx; height:Ypx;"></div>
How many escape pods are there? "NONE,SIR!" You counted them? "TWICE, SIR!"
in that situation, server side includes are just as useful, but faster and more secure.
.shtml, or keep the .html name but set the files as executable (chmod a+x *.html) using the XBitHack.
If what you need is very simple (including footers would count as simple), here's more information about server side includes (SSI). Either rename your pages
If you want something more complex, you can use SSI to include a mini-CGI script into the middle of your HTML. CGI scripts can be written in any language, even a shell script:
#!/bin/sh
echo Content-type: text/html
echo
echo (insert HTML here)
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
Firefox with the IE Tab (or IE View), Web Developer, View Formatted Source, and HTML Validator extensions.
No, really, stop laughing.
Frontpage, once you convince it to stop the WYSIWTG crap, has three tools that will make fixing a non-technical user's webpage easy. (Never, ever, let a non-technical user use Frontpage without supervision. It's worse than Word.)
I'd be shocked if there aren't better tools out there -- but by and large either they don't do as much, or they cost a significant chunk of change.
(Hey, you, with the laughing -- point me to a app that can do #1 with compatible replacements for #2 and #3, and, er, you'll get good karma for being so mean and laughing.)
Tidy is great as others mentioned. Will even allow if you feel confident to cherrypick the data you want to scavenge with XSLT.
Separating grain from chaff
A static HTML project has numerous index2.old.html, index2.html, index_2.html, project2.html.old and so on - files that you just aren't sure are useful?
Copy the project directory (touch all the files) and do a wget -r on the tree; by looking at the access time, you'll know all internal referenced files. Alternatively, scan the webserver logfiles to know which files are useful.
Be sure your filesystem is configured to register access times if you pick the first method...
(As a bonus, a close peek on the 404s might give you some answers on mis-used capitalization of filenames.)
Lynx / Links / ELinks
Can be used to dump the text data of old and unmaintainable HTML documents; most useful when trying to scavenge only the text contents to put in a database or so.
Agreed. I dismissed people who kept suggesting vim as "crazy UNIX people." I still felt that way about a week into playing with it, but soon after, I realized how powerful it is once you've figured out how the keystrokes work. Since then, I've used vim on every computer I've worked with and gvim (the GUI-enhanced version of vim) is my primary editor on my Windows box.
vim has excellent syntax highlighting, predictive typing, line numbers, search and replace (with regular expressions), code folding, spell-check, built-in help, and more.
Give yourself two weeks with an open-mind, and you might be surprised about it. The easiest way to get started is to type vimtutor from almost any shell account.
Nope.
/, one must simply write well-formed documents. Well-formed HTML (with all tags closed) also uses br /.
br is not now br
em and strong are still alive and well as of XHTML 2.0.
b and i are still available in XHTML 1.0.
There is no HTML 4.1. Presumably you meant 4.01 strict, which is pretty much XHTML 1.0 Strict.
Tidy, as others have already mentioned, will be your very best new friend.
Install the 'Web Developer' extension for Firefox, and use some of the HTML/CSS validators in the Tools submenu.
Get a good handle on regex searching & replacing (if you're doing this from Windows, I suggest Funduc's "Search & Replace").
If you're migrating your GIFs to PNG (which I would recommend), then you need to get yourself pngout, to compress them to their smallest possible size (Photoshop SUCKS at this).
And as someone else said, make an empty new standards compliant template, and get to cutting and pasting; it can be a *brutal* initial process, but you'll probably save yourself time in the long run, depending on how clean you want to eventually get the code. If you just want it to be standards compliant, then you can just do a clean up job. If you want to do it 'right,' you'll want to develop a new template and coding style to properly integrate the HTML and CSS. Things like not putting everything in a DIV (a sure sign you're a newbie to CSS), just to style something. Figure out why you should be using H1, H2 tags (& TBODY & TH tags if you're using tables for outer layout), etc, without having to use a lot of unnecessary DIVs all over the place. Inline styles = bad.
Figure out why XHTML may not be the best choice over HTML. Know which DTDs to specify. Know the difference in IE6 between standards mode and quirks mode, and which DTD to use to make IE6 behave. Know that IE7's quirks mode is supposedly identical to IE6's; you supposedly won't get the new 'more-standards compliancy' in IE7 without a DTD.
Oh yeah - the guy who posted about replacing spacer gifs with 'spacer DIVs'? Don't do that to yourself, okay? Yikes.
Learn about usability and readability. Learn about typography, and how light-on-black text should be sized differently from black-on-light. Thinking about grey text on black or grey text on white? Don't be stupid. Make the stuff readable! Learn that sans serif fonts are more easily read at screen density (opposite of print). Learn why Verdana is usually not your friend (go for Trebuchet MS or even Arial).
Oh, and learn to intent your freaking HTML!
Some nice resources:
Activating the Right Layout Mode Using the Doctype Declaration
Quirksmode - a GREAT resource. Awesome info here. Memorize it.
This is worse than image spacer, please go die in a fire
"The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler
The great thing about web standards is... there's so many of them!
Or you could just use the padding / margin features provided by CSS.
margin-top: 1px;
margin-right: 2px;
margin-bottom: 3px;
margin-left: 4px;
or margin: 1px 2px 3px 4px;
padding-top: 1px;
padding-right: 2px;
padding-bottom: 3px;
padding-left: 4px;
or padding: 1px 2px 3px 4px;
q<register> to record a macro, q to finish recording. Execute the macro with @<register>, then you can execute it again with @@. Obviously the @ commands can be prefixed with a number to repeat them that many times, 5@@ would repeat the last macro 5 times, for example.
Game! - Where the stick is mightier than the sword!
Err.. this approach just doesn't work. Images are inline elements, you can't replace them with an equivalently sized block element and expect the page layout to be the same. And setting the CSS 'width' attribute of an inline element doesn't work in Explorer, so the entire approach is flawed. Sorry.
As much as WYSIWYG editors some times suck, Dreamweaver is alright. I like that it helps with the organization but also lets me get as geeky as I'd like.
Pretty Pictures!
How is this better than an image spacer? Elements have padding and margin properties, use them!
My biggest web devel tool is Firefox, with the Web Developer extension and the HTML Validator extension. The former does all sorts of amazingly neat things like letting me get precise info about any element within a page (using "Dispaly Element Information" under the "Information" menu, CTRL+SHIFT+F for short), showing me the HTTP response headers to any given page, add custom styles to a page, validate links, check for Section 508 accessibility compliance, resize the window for simulating lower screen resolutions, and on and on and on!
The latter does instantaneous HTML validation using Tidy and displays any errors or warnings on the "view source" page. It also gives me LINE NUMBERS in the view soucrce window, which is a blessing. The beta version (which I prefer) lets you pick between the Tidy algorithm and the W3C's SGML parser. The SGML parser version gives the same errors as the W3C's own online validator, but without any need to submit the page through an online form.
As for editing HTML, I generally use SciTE or one of its derivatives (eg Notepad2). Sadly, those aren't available under Mac OS X, so when I need to work on a Mac box I use Smultron. THAT, however, is just an editor. People get religious about their editors, so my advice is just to pick one that suits you and ignore anybody what sniggers at you.
There is a small utility called dos2unix which changes MS-style line endings in text files to Unix style. /usr/bin/mac2unix is symlinked to dos2unix on my Gentoo box, so I guess it can fix MacOs line endings too.
# cat
Damn, my RAM is full of llamas.