Good Web Development Environments with UTF-8 Support?
A Pride of Lyings asks: "I'm having a devil of a time finding a good editor or IDE aimed at HTML/XHTML/CSS/JavaScript/JSP/XML that meets the following criteria: CVS integration (VSS integration would be nice but not required); stellar UTF-8 support (internationalization is a big big deal now); correctly recognizes and highlights HTML, JSP, JS, and CSS within a single file; does some rudimentary auto-completion; is easily configurable; runs on Win2k (oy vey); supports bookmarks of various kinds; supports code collapsing; and affordable. I'm at a loss and rather fatigued from kicking all of these tires, so I'm throwing it open to you: what do you use for your front-end work? What makes it good?"
Eclipse found at www.eclipse.org. Does much of what you want. First class CVS integration, VSS integration. It has all kinds of plugins for editors. XML and JS for starters. It is primary a Java editor, but it very extensible. I consider it the new Emacs. It is also getting better all the time, with widespread developer support.
My company is just about to implement a full UTF8 i18n web development environment, so I have fought a lot of the wars!!
1. Do a hard-line review on every underlying application layer and/or middleware for UTF8 compatibility. We discovered the dB was UTF8 compliant, but the local client dB drivers had issues sometimes. Do individual tests with all types of UTF8 data.
2. Always test with some multi-byte charsets, like Chinese or Japanese, whichever one you might support soonest. Using multi-byte Japanese helped flush out some problems that might have killed us later. It also helps flush out applications that say they are UTF8, but wind up doing some UTF8->ISO8859->UTF8 conversions. You may never implement Japanese/Chinese/etc, but you will definitely be UTF8 compliant.
3. Arial Unicode MS is a great UTF8 font which has all sorts of languages. Other Unicode fonts may only have European languages. Warning: approx 22MB in size. Yes, it is MS Windows centric, but if you are going to pass around "please translate this" documents, everyone using Arial Unicode MS will pay major dividends later. You wont get translations in other random charsets to convert (or discover during testing)
4. If you can, try to store translations in a database and retrieve them as needed. If all translations are in a database, its easily transferable to your next development environment, without having to parse through gobs and gobs of dictionary/translation files. If you are worried about performance of hitting the dB all the time, do a nightly "pre-processing" of "static" content and partially generate each languages content pages in html.
4a. As for translation documents, I have built Excel spreadsheets with columns for language name and translation. Using Excel, you can create SQL scripts that will insert the data into your dB.
5. Forget what I said about translation documents, and try to build interfaces into your code to update text on the web application for different languages. It will eliminate the document passing around, and someone can see the results of their translation pretty quickly.
6. If you HAVE to store data in another charset, always display as much UTF8 to the user, and only convert at the backend. We used a charset converter helper application (Chilkat Charset.Net) to devolve our UTF8 text to ISO8859 for one recalcitrant CRM
7. Be wary of the "Byte Order Mark" for UTF8 text files, "ï". Its a character triplet at the beginning of SOME UTF8 files. UTF8 compliant versions of Notepad save it, but dont display it. You can see it VI, but it may not look like it. Use of it is inconsistent, but you will run into it every so often. In our testing, we noticed IE liked to see it at the beginning of the HTML file for use when you have Auto-Select for Encoding in your IE client. (Even if you explicitly set charset to UTF8 in your meta tags)
Em@il me at ckmehta +at+ hotmail DOT com, if you have any embarrasing questions, you dont want to post.
I guess my point was that it's not automatic. And it also applies only to syntax highlighting, not :set filetype=. This means you can't use filetype-specific mappings and auto-template insertions, e.g., p{key} can't insert a paragraph tag in the HTML portion and a bracket full of paragraph styles in the CSS portion.
There is no need to use a SlashDot sig for SEO...
I develop in Japan, mostly for a Japanese audience. We use Apache -> Weblogic & Oracle 9iAS -> Oracle 9i, all glued together with a pile of other stuff. Never quite found an editor that solves every problem, but Eclipse and jEdit are both pretty good as a start, as is Oracle jDeveloper if you're an Oracle shop. Half the time I end up putting HTML pages together using Visual Studio (*gulp*), as the HTML editor's predictable in Japanese. I predict that whatever you choose, you'll end up running something else alongside it, and then something else alongside that, etc. Only advice I have is that encoding standards are great in theory, but the implementation of them is uniformly appalling, no matter where you look. One hint is to get a native language speaker to proofread non-Western character based pages, as they can look perfect but still be garbage. Another tip is, if you're doing Japanese, develop on a native Japanese OS. I guess that applies to Chinese too. Don't trust a foreign language-ified 2000 to behave exactly how native 2000 would. Welcome to the party...