Ed+Avis · Slashdot Mirror

Re:It's about tools, libraries on XML Co-Creator says XML Is Too Hard For Programmers · 2003-03-18 03:46 · Score: 1

Oops, Slashdot ate the XML I typed. (Why can't it convert characters to 'lt' and 'gt' entities?) But I hope you get the idea.

Re:It's about tools, libraries on XML Co-Creator says XML Is Too Hard For Programmers · 2003-03-18 03:44 · Score: 1

Using regexps to parse XML is okay for a one-off, but it is, if not silly, then _unwise_ for larger projects.

For example if your input XML looks like this

One
Two

you might create regexps to parse it. But it would be equally valid for the XML to say

OneTwo

or even

OneTwo

These bits of XML are equivalent to the first, modulo whitespace (and the third example is exactly equivalent to the second, since whitespace after a tag name is ignored). Will your regexp parsing code handle them just the same?

Making your regexps general enough to handle all of these cases is a real pain, even if you know that elements will never nest inside each other. And if you are trying to match a nested data structure to arbitrary depth, this can't be done at all with just regexps.

Better surely to use an existing parsing library which has already been debugged and can smooth over all the syntactic variations, and which won't stop working when the line breaks come in different places or someone adds attributes to one of the elements.

Trying to parse XML with regular expressions is like trying to parse C source code with regular expressions. It's okay for quick tasks, like grepping through your source tree for a particular variable, but shouldn't be used for the task of reading in a whole document.

Re:It's about tools, libraries on XML Co-Creator says XML Is Too Hard For Programmers · 2003-03-18 03:32 · Score: 2, Interesting

There are two more methods: interfaces like SAX where you read individual tokens, and callback interfaces like Perl's XML::Twig where you can efficiently scan the whole file and only construct in-memory trees for the parts you're interested in.

The best method might be a lazy programming language where you can say

tree.a[4].b[6].contents

and only when this expression is evaluated will the necessary bit of the tree be parsed.

Re:Meta XML on XML Co-Creator says XML Is Too Hard For Programmers · 2003-03-18 03:02 · Score: 1

Yes, I don't see why editing XML documents should always require you to look at the element names and characters. That view is often useful, of course, but sometimes a more concise representation of the document could be better for editing. So elements might be shown as green boxes while elements are shown as red boxes. Or whatever.

Re:Maybe he should have read Knuth on XML Co-Creator says XML Is Too Hard For Programmers · 2003-03-18 03:00 · Score: 1

Hello? If you don't put a closing XML tag then the parser will run to "infinity". Of course the end of file will terminate it, but still, in thoery this requires infinite resources.

I don't know what you mean by 'in theory'. A finite input file requires finite resources. Period.

An infinite input file could require infinite memory to parse it. So what?

Say you have a 1 TB XML data file and all you want is some header information. Well if the entire data set is enclosed in a tag, then the WHOLE 1 TB data file must be read into memory, or somehow indexed.

Not so, it depends on the parser you are using. Some implementations such as DOM will try to read the whole file. But equally you can use a token-based parser such as SAX, read tokens from the file (start tag, end tag, content, attributes) until you get the information you want, then stop processing.

Re:Maybe he should have read Knuth on XML Co-Creator says XML Is Too Hard For Programmers · 2003-03-18 02:57 · Score: 1

Well of course, if you have a big document it will use memory. I'm not arguing with the proposition that XML is memory-hungry (at least for some parsers and some documents). I was disputing the original poster's claim that parsing an XML document can take _infinite_ memory or _infinite_ time, which is clearly not the case.

(It's possible the original poster did not mean to imply this, his statement was not very clear, talking about 'handling errors'.)

Re:Maybe he should have read Knuth on XML Co-Creator says XML Is Too Hard For Programmers · 2003-03-18 02:55 · Score: 1

TeX's grammar is odd, because you can redefine character classes and how to parse particular tokens. There are whole chapters in The TeXbook about how TeX processes input characters and the various mechanisms you can use to alter its behaviour.

So the grammar really is bound up with the language. You can't parse TeX code without also evaluating it, as far as I know. I didn't express myself clearly by saying 'TeX is a Turing-complete language', after all, C is Turing-complete but you can write code to parse C. But there are some languages like TeX and Perl where you can redefine bits of grammar on the fly, and the full power of the language is available to do this. So you cannot in general parse code in these languages without risking non-termination.

I don't know what Chomsky's Type 1 grammars are (you described the others in your message but not Type 1), but I'm pretty sure TeX's grammar is not decidable. To see whether a TeX document is syntactically valid you sometimes have to execute it. So there isn't really a line between 'syntactically correct' and the meaning of the program.

All this AFAIK, I am not a hardcore TeX hacker.

Re:But XML is great for computers... on XML Co-Creator says XML Is Too Hard For Programmers · 2003-03-18 02:46 · Score: 2, Interesting

(Replying to AC post, please mod it up if you can.)

I admit that interfaces like DOM are rather clunky. But your regexps would break if a new field were added to /etc/passwd, or probably even if the format were changed to allow comments. So files like /etc/passwd become fossilized over time.

The answer is a better interface for reading XML files, one that knows about the format (which is described in a DTD or other grammar) and can present a neat interface like

passwd.user["abc01"].real_name

(or whatever the syntax of your preferred language looks like). DOM is so awkward because it knows nothing about whether a element would be present, or whether there might be more than one of them, or whether whitespace before and after the element is significant, so it has to provide an API to explicitly wade through all that just in case you want it. A tool like FleXML which knows that must appear exactly once and in a particular place can put it into a single field.

(Actually FleXML isn't ideal for this example because the parsing code it generates will stop working when the file format is extended, if new elements started appearing inside . But if you made the generated code only a little bit slower it could skip over these extensions to the file format, so existing apps would continue to work when new things were added to the DTD.)

The answer I think is for programming languages which better support XML, which can read a document and put it into the language's native data structures. Libraries like Perl's XML::Simple try to do this, but they do so without any knowledge of what the legal documents are, so the resulting interface is still rather awkward.

Re:But XML is great for computers... on XML Co-Creator says XML Is Too Hard For Programmers · 2003-03-18 01:48 · Score: 5, Insightful

You mean like most other non-xml config files in /etc, like say hosts, DNS zone files, named.conf, passwd/shadow, hosts.allow/deny, sendmail.mc or resolv.conf (etc. etc.)? These have standard layouts, text-based, can be edited by hand and can be easily parsed.

You just gave the best argument for adopting XML as widely as possible. Yes, all these can be parsed (with the possible exception of sendmail's config files which may be Turing-complete) but they all require *different* code for each config file. If they were in XML you'd still need different semantic code, of course, but a whole wodge of syntax issues (how do I quote strings, how do I escape newlines, how do I mark nested scopes, what happens when the string delimiter character occurs inside a string, how do I deal with comments, what is the character set, is there a formal grammar for the document, etc etc) would be dealt with. Maybe not in the way that you or I think is perfect - IMHO XML is a little bit verbose compared to say Lisp- or Tcl-style encodings. But they would be dealt with *once*. No need to learn a new or almost-the-same-but-slightly-different set of syntactic conventions for every single config file.

Maybe XML is over-used for a lot of things, but making up your own file format is definitely over-used a lot more. Simple line-oriented files are reasonable to have as plain text, for everything else please avoid the temptation to reinvent the wheel by devising a new syntax and block structure.

Re:Maybe he should have read Knuth on XML Co-Creator says XML Is Too Hard For Programmers · 2003-03-18 01:40 · Score: 5, Informative

XLM parsing (just like the TeX language) has a problem that when there are problems in the input files, the situation diverges into two different caes, one requires an infinite memory and the other infinite time to deal gracefully with errors.

WTF? Perhaps you could explain more about these two cases. As far as I know, general XML parsers such as Expat do not require unlimited memory to parse any finite input document, nor do they require infinite time.

The Document Type Description (DTD) system is equivalent to a BNF grammar for XML documents. It's not quite as flexible as a full BNF because it enforces that elements are correctly nested, but I don't see this as a bad thing.

And yes, DTDs are machine readable. Other grammars for XML documents such as DSD, XML Schema or Relax-NG are also machine readable.

Just as with BNF grammars and flex(1), you can take a DTD and generate an efficient parser from it using FleXML.

Comparisons with TeX aren't really appropriate because TeX is a Turing-complete language, and so impossible to parse automatically in 100% of cases (unless you want to allow that your program will sometimes fail to terminate, ie hang, on particular input files). I don't know what you mean by your subject line 'Maybe he should have read Knuth'...

Re:"Definitive"? on The Definite Desktop Environment Comparison · 2003-03-18 00:13 · Score: 1

Blackbox is not a desktop environment, it is a window manager. There is not a corresponding file manager, web browser, email client, control panel and set of applications. You could mix and match these and review a combination of different programs used together, but such a combination would be marked down for inconsistency in UI. And rightly so.

Window Maker might count if you used it together with GNUstep applications. However it doesn't seem that GNUstep has the momentum or application base of GNOME or KDE.

Re:A good idea, with some problems on O-STEP In The Limelight · 2003-03-17 03:52 · Score: 1

What happens if they set the escrow value at $50M but only sell $40M of software? Then the early adopters lose out, since now they are stuck with a product as proprietary as MS Word, but not as popularly supported.

In other words, just the same as if they bought proprietary software that was not escrowed. It certainly doesn't make the software a worse proposition.

Re:Audience on CIOs Looking At OSS · 2003-03-17 03:40 · Score: 3, Interesting

Maybe CIO Magazine is not read by real CIOs but by wannabes, similarly to how Just Seventeen is not read by seventeen-year-olds.

Re:What's wrong with PowerPoint? on Using Memory Errors to Attack a Virtual Machine · 2003-03-17 00:28 · Score: 1

You're quite right, no message is effectively delivered unless with PowerPoint. Like this.

Re:Design "Consultants" on Design Guru Critiques Apple Retail Store · 2003-03-16 22:35 · Score: 1

If traffic drives on the right then pedestrians should surely walk on the left, to face the oncoming traffic. Conversely if traffic drives on the left then pedestrians should keep right.

Re:Random Programming on Analysis of SCO vs. IBM · 2003-03-14 05:37 · Score: 1

The decisions of programmers are indeed mostly random, that's why we have testing. As an insightful Slashdot contributor pointed out, you can consider programming as random mutation and then the test suite as natural selection - so software evolves rather than being designed. This is not quite the whole truth but it is pretty close, at least for any developer who has the comfort of a good set of regression tests.

Re:SuSE... on SuSE 8.2 Announced · 2003-03-13 23:44 · Score: 1

That's surely the most sucky thing about Mandrake - inability to report bugs. Even once you've managed to find the bug tracker at http://qa.mandrakesoft.com/, (it doesn't seem to be publicized much), you can't enter any bugs against the current Mandrake release. If you do report a bug in the current version you get told that the Bugzilla is for Cooker only.

Red Hat aren't always that responsive with their Bugzilla reports, but at least they provide some way to report bugs. Mandrake give the impression of not caring whether their current release is buggy or not. (And no, online forums are not an adequate substitute for a bug tracker.)

Re:Outlook Spam Control on Forty Percent of All Email is Spam · 2003-03-13 08:25 · Score: 1

My point is that there may be legitimate messages which contain the words 'penis', 'adult' etc. Not that your filter isn't a good 99% solution, but it's ironic that many messages discussing email filtering solutions would get blocked by it.

Re:Neato on Red Hat Announces Enterprise Linux · 2003-03-13 07:15 · Score: 3, Interesting

I've usually found the word 'Enterprise' in the title to be a sure indication of a crap product. It sounds so 1999.

Re:Outlook Spam Control on Forty Percent of All Email is Spam · 2003-03-13 06:58 · Score: 1

By the system you describe, the message you just posted would be discarded as spam if you sent it to a mailing list rather than Slashdot.

Re:Technological solutions will be easiest on Forty Percent of All Email is Spam · 2003-03-13 06:48 · Score: 1

I'm not suggesting paying real money for sending messages, only 'payment' of a few seconds of CPU time using Hash Cash or a similar system. This will not be a problem for anyone except spammers, who will no longer be able to send millions of messages. If each message requires about five seconds of computing time to generate the postage, a spammer couldn't manage more than 20k messages per day from a single machine, which is not enough to make it profitable.

(Although I don't see how wanting payment for use of a resource belonging to me - my time and my network bandwidth - is particularly communist.)

Re:Technological solutions will be easiest on Forty Percent of All Email is Spam · 2003-03-13 05:47 · Score: 2, Interesting

Yes, a faster relay server can send more messages than a slower one. Spammers with access to fast machines can send more messages. But even if you have a very fast machine the number of messages you can send per second is far, far less than currently possible.

All this depends on the existence of open relay servers which take messages and compute the postage for them, presumably to support legacy email clients which don't add postage for themselves, and moreover are misconfigured to accept incoming messages from anywhere. Presumably these servers would not be any more numerous than open SMTP relays are now.

You're right that mailing lists are a problem. Such addresses would have to be explicitly whitelisted by their subscribers - or maybe if you tell your mail program 'I am subscribed to misc-discuss@goatse.cx' then it would accept messages which had valid postage for that mailing list address as well as those with valid postage for your own address.

For systems like AOL there is no extra load on the server because the postage can be added at client machines - you see the hourglass for a few seconds after pressing 'send', or more likely, the postage is computed in the background while the message is in the outbox. At least, this is how I think it is intended to work: the Hash Cash site doesn't say specifically whether postage should be computed at the client or on the mail server. But IMHO doing it end-to-end is better.

Re:What say you "just hit delete" crowd? on Forty Percent of All Email is Spam · 2003-03-13 05:28 · Score: 1

It's not practical to keep my email address secret because I want people to be able to contact me, and I cannot decide in advance what this set of people will be.

However, at the moment some senders take advantage of this to send spam messages which are of no interest to me. So I would like to have some limited 'barrier' such as a small postage charge (in cash or, more practically, in CPU cycles) for each message. Keeping my address secret is much too high a barrier because then not even real people would be able to contact me.

We want a system that allows worthwhile, useful messages to get through, even if they are from previously unknown senders, while blocking junk mail. Keeping your address secret errs on the side of blocking too much; the current system errs on the side of blocking too little. A good solution IMHO is to make the sender assert that his message is useful and not junk and put his money where his mouth is by making a small payment for each message.

Surely you are not suggesting that every email address should be kept secret?

Re:What say you "just hit delete" crowd? on Forty Percent of All Email is Spam · 2003-03-13 05:22 · Score: 1

But you seemed to be saying that fixes for the spam problem were not necessary because people could just keep addresses secret instead, and that would fix it.

'There's a solution, it's in using email intelligently.' Well that is a solution for you personally but I'm sure you agree it is not a decent answer to the spam problem for the Internet as a whole. The fact that some people get by, managing with some effort to keep addresses secret and avoiding spam, is not a reason to stop looking for a more robust and permanent answer to spam.

Re:Technological solutions will be easiest on Forty Percent of All Email is Spam · 2003-03-13 05:19 · Score: 2, Interesting

Even if there are thousands of open, postage-adding relays, this will be an order of magnitude less spam than the current situation of thousands of open relays that don't need to add postage. Really, which is worse: spammers abusing a host to send hundreds of messages a second, or spammers abusing a host to send one message every five seconds? Whichever way you look at it, open relays or no open relays, requring computationally expensive postage will greatly limit the number of spams that can be sent.

You are right that mailing lists would be a problem, but most non-technical users don't subscribe to mailing lists surely? They use web discussion forums or whatever. I don't see customer notifications as a problem, surely each customer doesn't get more than three or four notifications each month and that is certainly manageable. Sending out huge numbers of messages to _all_ your customers isn't feasible, and that is the point.

In a perfect world we would have real cash payments for mail (IMHO); one cent per message or something like that, with the possibility to waive payment for known senders. But that is hard to implement so hash cash is a compromise solution. In any case you have to compare the disadvantages of a hashcash-based system with the current spam-ridden Internet mail system, unless you have an alternative to propose.

Slashdot Mirror

User: Ed+Avis

Comments · 4,579