Help/Opinions on Parsing OFX FIles?
innerweb asks: "I am looking for help and advice on using and parsing the OFX (Open Financial Exchange) file spec using C/C++ and/or Perl. I have read the standards, downloaded the DTD (ofx version 2), and tried to parse several files from different banks. They have all failed in my normal parsers (commercial and OSS), yet they load fine in Microsoft Money. It is not so complicated that I can not hand roll my own, and I have much of it working that way as a proof, but I would rather stick with something that is standards based, as this is a standard that in my opinion ought to work with standards based tools. Am I missing something here, or is this truly a file format that is broken as a feature?"
"I know the files are malformed when they come down, as they are missing the normal XML and SGML file headers ?XML or !DOCTYPE to define the dtd to use to parse the file. I know that the document is not 'well formed' as I understand it, as most of the tags in the datafile are not closed (open tag, but no corresponding closing tag). When I fix these errors, the files seem to parse. yet, I know that from what I have seen, MS Money takes in the same raw data and parses it. Microsoft lists the OFX file format as XML in some places and SGML in others. The OFX website seems to be saying this is SGML, not XML (XML is a subset of SGML in most cases, but the way it is *used* sometimes it is not really SGML at all.)
I have been reading like mad for a few weeks on OFX format files and usage, but not getting much useful information. I have worked with SGML in the past and XML, so I am at least familiar with these *conventions*. I need to be pointed in the right direction, and or told what I am doing wrong/overlooking. I know it is probably something obvious, but somehow I am not getting it.
Thanks in advance for any help that you can throw my way."
I have been reading like mad for a few weeks on OFX format files and usage, but not getting much useful information. I have worked with SGML in the past and XML, so I am at least familiar with these *conventions*. I need to be pointed in the right direction, and or told what I am doing wrong/overlooking. I know it is probably something obvious, but somehow I am not getting it.
Thanks in advance for any help that you can throw my way."
What makes you think that CheckFree, Intuit and Microsoft would make it easy for a bunch of OSS developers to work with "their" standard.
Fix the XML, then parse. Microsoft's parser is probably broken in a way that it doesn't look for closing tags (think, regular expression matching for the open tag, then extracting data until it hits the next tag). It's not beyond Microsoft to break their own software just to make it difficult for others to use the format, but it does make it harder on us developers to keep going.
So, I'd try finding a way to rebuild the XML file before parsing it, but only after detecting if it needs it.
"Victory means exit strategy, and it's important for the President to explain to us what the exit strategy is." G.W.Bush
You did check Gnucash's importers to see how they did it, Right?
If you have written code that works better than the open source code you have tried, but you'd rather use said open source code, isn't it obvious that you should send some patches? That's how open source works, you know.
Sincerely,
Pan Tarhei Hosé, PhD.
"Homo sum et cogito ergo odi profanum vulgus et libido."
http://freshmeat.net/search/?q=ofx§ion=project s&Go.x=0&Go.y=0
Also, have you tried any SGML parsers? Despite XML being a 'subset' of SGML, SGML allows many constructs not allowed in XML.
It's SGML, not XML. Unless you insist on doing it the hard way with a real SGML parser, I can't see what's wrong with using your own hand-rolled one. As you've already recognized yourself, it shouldn't be too hard.
Other alternatives would be to have an a preprocessor that converts it to XML, or maybe use some too-tolerant XML-parser. On the other hand, if the file format isn't XML, I can't see why it would be easier to treat it as if it were.
Am I missing something here, or is this truly a file format that is broken as a feature?
Mu. Yes, you are missing the distinction between XML and SGML. No, it's not broken as a feature, it just predates XML.
...would do anything straightforwardly, if they had a chance to "improve" it?
Got time? Spend some of it coding or testing
You've got me slashdot... Next time I'll be having trouble parsing some XML file and/or have programming question kind of "why this program doesn't print Hello World", I'll make sure that all the geek world knows that, by posting the question on slashdot... it's kind of that usenet thingy, right?
.. it may just be for his own personal use .. he may not release it to the wild, under any terms ..
or, to put it your way "who cares if you have herpes if you're a wanker, anyway?"
; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
Get someone to show you how to use google.com while you're at it.
They use LibOFX, available under the GPL
It's spelled 'OSX'
Based on that, I'm not even sure if it's even an SGML application, or if it's just something that looks like SGML?
Open Finanical Exchange Specification 1.0.2, Section 2.3.1: But regardless, I'm suprised your SGML parser chokes so badly on this. It can't be worse than the crap that passes for html on most sites. What parser are you using?
And, um, you do know tags don't always need to be closed in SGML, depending on the DTD, right? Your writeup makes me wonder if you really understand SGML...
Indeed. The same thing happens with QIF. The spec is open, and after reading the first page you find out that most banks produce files that don't adhere to the spec. Solution: adjust the parser-- it's a simple fix with QIF. The problem is probably that they all use the same software to create the slightly defective QIF, and MS is aware of this (more likely is that they use MS software to produce the QIFs).
I've already started on a really bad OFX parser. You are welcome to my code. It has hooks to my budget program in it, but you can pull that out easily enough.
xmlparse.py
The site recommended:
GigsVT responded:
I think you missed the point. There is something the Slash devs can do as with every other web developer. Once more, it might even be wise if the Moz devs do this or an extension is created for the purpose. In the case of the Slash devs, they could use Javascript to automatically save the contents of all open forms automatically to cookies every timed period or every so many key strokes, whichever comes first. The contents could be encrypted with one's password he uses to log into the site, so that the form is still tied to the user, and even if you're on a public machine, your form data is both saved from loss, and secure from prying eyes. The data is saved, but /. doesn't need any special access to your HD, and you WON'T lose that entire post, even if your browser crashes due to some other issue, such as a plugin it spawned corrupting something it shouldn't have. When you return to the form, the cookie is recognized, and you're prompted that you can recover form data if you like.
Now, Mozilla doing this, for a secure environment, would likely require either using the master password feature it uses for saved password data already. This wouldn't work as well in kiosk mode unless everyone who came up to the kiosk was asked to create a one-time, there-only password and username so they could recover their form data, should such a problem arise. (The usernames would expire after a short period of time, and their cookies/form data thrown away as appropriate.) If you don't remember your old username and password, just make up a new one. This might be a bit more troublesome and consuming than an average user would want so this could be implemented in an extension to Mozilla which would only be activated if someone went to the trouble of going under the Tools and checking this addition to the Form Manager.
And, back on topic, Money isn't being propreitary. The format is SGML, not XML, as others have stated. You're jumping on the Microsoft-bashing bandwagon that was started by the article writers' initial ignorance. Closing tags can be mandatory, optional, or forbidden in SGML.
You like splinters in your crotch? -Jon Caldara
It uses the GPL, so it's worthless.
I built Checkfree's first OFX parser, and yes we rolled our own. After looking at the DTD's, and finding numerous mistakes, we decided that building our own would be easier, and IF the DTD's ever got cleaned up, we could use an SGML parser, in a latter version.
Sounds like the DTD's haven't gotten cleaned up.
You could try running it thrrough the HTML tidy program and see what happens.
Graham
Graham