Slashdot Mirror


Help/Opinions on Parsing OFX FIles?

innerweb asks: "I am looking for help and advice on using and parsing the OFX (Open Financial Exchange) file spec using C/C++ and/or Perl. I have read the standards, downloaded the DTD (ofx version 2), and tried to parse several files from different banks. They have all failed in my normal parsers (commercial and OSS), yet they load fine in Microsoft Money. It is not so complicated that I can not hand roll my own, and I have much of it working that way as a proof, but I would rather stick with something that is standards based, as this is a standard that in my opinion ought to work with standards based tools. Am I missing something here, or is this truly a file format that is broken as a feature?" "I know the files are malformed when they come down, as they are missing the normal XML and SGML file headers ?XML or !DOCTYPE to define the dtd to use to parse the file. I know that the document is not 'well formed' as I understand it, as most of the tags in the datafile are not closed (open tag, but no corresponding closing tag). When I fix these errors, the files seem to parse. yet, I know that from what I have seen, MS Money takes in the same raw data and parses it. Microsoft lists the OFX file format as XML in some places and SGML in others. The OFX website seems to be saying this is SGML, not XML (XML is a subset of SGML in most cases, but the way it is *used* sometimes it is not really SGML at all.)

I have been reading like mad for a few weeks on OFX format files and usage, but not getting much useful information. I have worked with SGML in the past and XML, so I am at least familiar with these *conventions*. I need to be pointed in the right direction, and or told what I am doing wrong/overlooking. I know it is probably something obvious, but somehow I am not getting it.

Thanks in advance for any help that you can throw my way."

49 comments

  1. You're surprised? by Smallpond · · Score: 0, Troll

    What makes you think that CheckFree, Intuit and Microsoft would make it easy for a bunch of OSS developers to work with "their" standard.

  2. I think you found your answer by ciroknight · · Score: 2, Interesting

    Fix the XML, then parse. Microsoft's parser is probably broken in a way that it doesn't look for closing tags (think, regular expression matching for the open tag, then extracting data until it hits the next tag). It's not beyond Microsoft to break their own software just to make it difficult for others to use the format, but it does make it harder on us developers to keep going.

    So, I'd try finding a way to rebuild the XML file before parsing it, but only after detecting if it needs it.

    --
    "Victory means exit strategy, and it's important for the President to explain to us what the exit strategy is." G.W.Bush
    1. Re:I think you found your answer by brunes69 · · Score: 2, Insightful

      Microsoft's parser is probably broken in a way that it doesn't look for closing tags

      Or, you could say that Microsoft's parser is much more robust in how it deals with malformed documents.

      Seriously, you can't blame MS for writing a better XML parser. Just because a parser knows that

      <foo>this<bar>that</foo>
      ...is not valid XML, does not mean it needs to choke and die. MSXML can easily validate a document as well as parse invalid ones.

    2. Re:I think you found your answer by GigsVT · · Score: 1

      Well, yeah you can blame them.

      When a company with monopoly power writes software that accepts broken, non standards compliance input, it's creating a new standard, one that isn't published, since people will generally only work on their software until it produces an output that works with the monopoly's software.

      It's just a trick to take what was an open standard and turn it into one with secret formats and rules only MS knows.

      Robust would mean it wouldn't crash on broken input. That's fine. It should output an error or a warning and try to recover. Silently accepting broken input creates a new standard, one that is not well defined.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    3. Re:I think you found your answer by brunes69 · · Score: 1

      When a company with monopoly power writes software that accepts broken, non standards compliance input, it's creating a new standard, one that isn't published, since people will generally only work on their software until it produces an output that works with the monopoly's software.

      This is a non sequitur. Forget the fact that probably 60%+ of the internet is stillusing HTML 4.0 (which is non-compliant) - people are going to create invalid input regardless.

      Also, I don't really even get what you are asking - you want the *browser* to throw and error to tell the user that the HTML compliance *on a website he doesn't own or control* is invalid? How is that supposed to help anything?

      People *always* create invalid input. It is the job of the engineer to handle this input the best he can. No, this does not always include spitting out an error and saying "give me the input THIS way".

      I suggest you read this (specifically, see number 6).

    4. Re:I think you found your answer by ciroknight · · Score: 1

      I don't want a browser to throw an error when it can't parse a page, I would have liked it if way back when, it wouldn't have rendered a page that was broken. That way, the web developers could instantly tell something was wrong, and fix their code.

      MS embraces broken standards by making an XML parser that parses XML files that are non-standard compliant, and in this case (MS Money), I think it should throw an error, instead of parsing. That would have instantly made people call up their financial institutions and say "Hey, I got this error message, what do I do?", provoking them to fix their code. It's very simple.

      As for people always creating invalid input, that's bullshit. People create it because "it works" and not just to spite the engineer. Software, and computers in general, operate with a very simple contract: "I, the computer, will do anything you tell me to do, just as long as I can understand what you're telling me to do." If the user wants to let down his side of the contract, so should the computer.

      --
      "Victory means exit strategy, and it's important for the President to explain to us what the exit strategy is." G.W.Bush
    5. Re:I think you found your answer by GigsVT · · Score: 1

      This is all about context and intent.

      The browser should output a warning or error when fed bad HTML, but it could be in some debug window that isn't normally shown. I guarantee if browsers had this function, the web would be much more compliant than it is today, since instead of just "playing with the tag soup until it looks right on IE", they could actually see where they misnested an element and fix it, or whatever they need to do to make it valid.

      I'll use a company that is not much higher in MS in most people's view as an example.

      Adobe arguably has a monopoly in some areas. Yet, their software doesn't accept totally broken files generally (at least no where near the extent that MS does).

      As a producer of files, Adobe generally produces files that follow the PDF/PS standards that they have laid out. Their software does accept some broken files, but it doesn't seem to be their goal to create a new standard with hidden specs ("ebook" silliness aside).

      The intent is the key difference here. Adobe wants to encourage the use of an open standard that they have defined and given full documentation on, and even licensed for anyone to use, free of charge.

      MS wants to create broken standards that only they know the true definition of. The way they have done this without making a PR mess is to take existing standards and "extend" them in undocumented ways. Their monopoly power then encourages producers of said files to only work to the MS standard instead of the true one.

      About your URL:

      First to skip ahead to number 6, since that's what you pointed me at. My point is that with a structured language, there's no such thing as malformed unambiguous input. If you misnest a tag, how the hell am I supposed to guess what you meant? Should compilers also guess the 50 different ways you could screw up a "for" loop, and attempt to produce output anyway?

      Besides, his comments are aimed at things you get the user to enter, not things programmers do. Programmers should be held to a higher standard of creating unambiguous input, the input we create is many times more complex than the user's. That includes web developers.

      This also caught my eye (unrelated):

      Store (encrypted) information in cookies even before transfer to the server, so information is preserved from all but the most serious "melt-downs."

      This seems useless to me. If my computer loses power right now, I will lose this very long message I have typed into this very small box on this form on slashdot, and there's nothing the Slash devs can do to help that.

      Their point about local applications saving data every few seconds is more reasonable. A lot already do this... vim for example. I'll assume they didn't really mean "continuous save"... I don't want to wait 10 seconds for the text I type to appear because my hard disk happens to be bogged down. Hard disk write caches would also have to all become battery backed for this to work effectively.

      The comments on electrolytic capacitors to keep the entire computer up are unreasonable. 1 farad is one amp-second. You'd need at least 60 or 70 1F 5V supercapacitors. Each one is $2 in bulk. You'd also need a special section of the power supply to step up the constantly falling capacitor voltage. Have fun.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    6. Re:I think you found your answer by Knights+who+say+'INT · · Score: 3, Insightful

      Tsk. It's called the "be tolerant in what you accept and strict in what you send" rule.

    7. Re:I think you found your answer by innerweb · · Score: 1
      <foo>this<bar>that</foo>

      ...is not a problem, as that is guessable from a generic parser's perspective, but something like:

      <foo>this<bar>that<bar2>thatagain</foo>

      presents problems with where to place the bar2. Is it a sibling to bar, or a child? By eyeballing the data, you can make an intuitive guess, and probably be right. Putting code into play that must always make the right decision on the other hand can not rely on intuition. Especially when it comes to money.

      BTW, MS's own XML parser chokes on these docs as well. Money can parse it, but, the msxml parser chokes on it.

      InnerWeb

      --
      Freud might say that Intelligent Design is religion's ID.
    8. Re:I think you found your answer by innerweb · · Score: 1

      Yeah, I was kind of hoping that the obvious answer was the wrong answer. The stuff is not too complicated, there is just a lot of it (DTD). I have a parser now that does not choke on the data. It needs much testing and time to make sure it really always works right. There is no room for error when it comes to people's money.

      It is also another piece of code that has to be maintained and updated as updates on the OFX standard come out.

      InnerWeb

      --
      Freud might say that Intelligent Design is religion's ID.
    9. Re:I think you found your answer by jrumney · · Score: 1
      Seriously, you can't blame MS for writing a better XML parser.

      It is not better to silently ignore the fact that you are dealing with what looks like corrupt data.

    10. Re:I think you found your answer by jrumney · · Score: 1
      BTW, MS's own XML parser chokes on these docs as well. Money can parse it, but, the msxml parser chokes on it.

      I suspect that if you use the MSXML SAX parser, or any other SAX parser for that matter, you can ignore the exceptions it throws up and carry on parsing. Perhaps that is what MS Money is doing.

      I definitely wouldn't expect a DOM parser to handle such a broken document.

    11. Re:I think you found your answer by JohnFluxx · · Score: 1

      It was meant for general user input.

      It's all well and good except when you have a feedback loop where the more tolerant you are, the worse the input gets as users get more and more sloppy.
      I don't want a 'tolerant' xhtml parser for instance. Otherwise we end up with the stupidity of html again.

    12. Re:I think you found your answer by arkanes · · Score: 1

      This is probably because these aren't XML documents, and Money doesn't use the XML parser to read them.

    13. Re:I think you found your answer by Anonymous Coward · · Score: 0

      In order to do that you would have to make your software bug for bug compatible with Microsoft's software, which is a nearly impossible task. What would you say if MSIE's better parser displayed a broken HTML page one way but Firefox's better parser displayed it a different way? That's right, Firefox would be considered to have the bad parser because it can't perfectly emulate MSIE including all the bugs. Even Microsoft's own security fixes can sometimes break websites that use broken output because those sites depend on specific behavior in MSIE.

  3. Gnucash. by Anonymous Coward · · Score: 2, Informative

    You did check Gnucash's importers to see how they did it, Right?

    1. Re:Gnucash. by ophix · · Score: 1

      worst. analogy. ever.

    2. Re:Gnucash. by ComputerSlicer23 · · Score: 1
      Are you just being silly now or what? There are two things to consider. First, you could use the GPL'ed code will act as a filter. Take in said crappy file, output good working file. More then likely, you could take the GNUcash code and write some "plumbing" or "glue" code to get this accomplished. Where 99% of the real work is done by the GPL'ed code already written code. You just write the 1% to give you access to the interface the way you want it. Thus you wouldn't integrate all of that into your application, you'd just filter your files thru it before passing them to your code written under the license of your choice.

      Second. The GPL isn't a patent license, it's a copyright license. US Patents protect the concept of how something is done. Copyright protects a particular expression of a concept.

      You can take a look at a GPL'ed implementation, and then re-implement your own from scratch using the same concepts. If you want to be really good about it, have someone else look at the GPL'ed implementation. They write specification for how it works, and then you write an implementation from that specification. Thank you Compaq for setting that legal precedence!

      Kirby

    3. Re:Gnucash. by GigsVT · · Score: 3, Informative

      In addition to the other excellent replies, it's a misconception that code ever unwittingly "becomes GPL".

      If you use GPL code in your application in a way that violates the GPL, you have violated copyright law. That's it. The GPL doesn't control the remedies, the legal system does. Generally the remedy would be in the form of a cash settlement to the copyright owner of the software you violated the copyright on.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
  4. Isn’t it obvious? by Pan+T.+Hose · · Score: 2, Insightful

    If you have written code that works better than the open source code you have tried, but you'd rather use said open source code, isn't it obvious that you should send some patches? That's how open source works, you know.

    --
    Sincerely,
    Pan Tarhei Hosé, PhD.
    "Homo sum et cogito ergo odi profanum vulgus et libido."
  5. Did you bother searching? by rrsipov · · Score: 1

    http://freshmeat.net/search/?q=ofx&section=project s&Go.x=0&Go.y=0 Also, have you tried any SGML parsers? Despite XML being a 'subset' of SGML, SGML allows many constructs not allowed in XML.

    1. Re:Did you bother searching? by Rick+the+Red · · Score: 3, Funny
      Despite XML being a 'subset' of SGML, SGML allows many constructs not allowed in XML.
      I don't think that word means what you think it means.

      Or is this word the one you don't understand?

      --
      If all this should have a reason, we would be the last to know.
    2. Re:Did you bother searching? by ricochet81 · · Score: 1

      inconceivable

      --
      Error: Id10t detected
  6. Um, the answer is in the link you posted. by joto · · Score: 5, Informative
    OFX is based upon SGML and, like XML, it is an attempt to take the best features of SGML and remove much of the associated complexity. OFX is not technically an XML application. The syntax of OFX differs from that set out for XML applications in that OFX omits end-tags

    It's SGML, not XML. Unless you insist on doing it the hard way with a real SGML parser, I can't see what's wrong with using your own hand-rolled one. As you've already recognized yourself, it shouldn't be too hard.

    Other alternatives would be to have an a preprocessor that converts it to XML, or maybe use some too-tolerant XML-parser. On the other hand, if the file format isn't XML, I can't see why it would be easier to treat it as if it were.

    Am I missing something here, or is this truly a file format that is broken as a feature?

    Mu. Yes, you are missing the distinction between XML and SGML. No, it's not broken as a feature, it just predates XML.

    1. Re:Um, the answer is in the link you posted. by arkanes · · Score: 1

      The link the asker posted actually specifically states this - it's not XML, it's similiar to XML but omits end tags to conserve space. Roll your own parser, buddy.

    2. Re:Um, the answer is in the link you posted. by innerweb · · Score: 1

      That is true. And on MS's site, it is called XML. No matter, which it is, by XML standards it is broken, and by SGML standards it is broken. When you create a DTD for SGML, you have to actually note in the DTD whether or not a closing (or opening) tag is optional. They have not marked any tag in the DTD I have (from www.ofx.org) as optional.

      If they had marked tag closings as optional then in the DTD you sould see something like this:

      <!ELEMENT SEVERITY - O %SEVERITYENUM;>

      The - mean the opening of the tag is not optional and the O means the closing tag is optional. In their DTD, they do have:
      <!ELEMENT SEVERITY %SEVERITYENUM;>

      which would mean that neither the opening nor closing tags are optional.

      Since a proper SGML document does not exist seperate from its DTD, then the data in the file must be marked up according to the associated DTD (which they have gone to the trouble or creating) or the document is broken. At least, that is how I learned SGML while working on legal documents and TEI. It may be yet, that what I learned is bunk, but I hope not.

      InnerWeb

      --
      Freud might say that Intelligent Design is religion's ID.
    3. Re:Um, the answer is in the link you posted. by yason · · Score: 1
      It's SGML, not XML. Unless you insist on doing it the hard way with a real SGML parser, I can't see what's wrong with using your own hand-rolled one. As you've already recognized yourself, it shouldn't be too hard.

      A lot gets wrong with a hand-rolled one: SGML is a huge standard and there are a number of constructs that can be used while still adhering to the DTD.

      A better suggestion would be to use e.g. J. Clark's superb-complete sp SGML parser (nsgmls) to read the DTDs and parse the SGML files: nsgmls is able to dump the resulting document tree in easily parsable linear stream (conceptually in SAX fashion).

    4. Re:Um, the answer is in the link you posted. by dschuetz · · Score: 1
      It's SGML, not XML

      Really? I was browsing what I thought was the official OFX site and ran across this:
      Since 2000, with the 2.0 specification, OFX has become XML 1.0 compliant and has added 1098, 1099 and W2 tax form download capabilities. ( http://www.ofx.net/ofx/ab_main.asp)

      Did it move from SGML to SML with 2.0 (5 years ago), or has it always been "XML" in spirit only?
    5. Re:Um, the answer is in the link you posted. by gstoddart · · Score: 1


      All very excellent points, but if you need an SGML parser to 'do it the hard way' (?) just grab a copy of James Clark's SP parser from here.

      You could probably use that as a starting point if your XML parsers don't like the doc format.

      Cheers.

      --
      Lost at C:>. Found at C.
  7. Or that mainstream banks... by leonbrooks · · Score: 1

    ...would do anything straightforwardly, if they had a chance to "improve" it?

    --
    Got time? Spend some of it coding or testing
  8. Heh by Anonymous Coward · · Score: 0

    You've got me slashdot... Next time I'll be having trouble parsing some XML file and/or have programming question kind of "why this program doesn't print Hello World", I'll make sure that all the geek world knows that, by posting the question on slashdot... it's kind of that usenet thingy, right?

    1. Re:Heh by HotNeedleOfInquiry · · Score: 1

      To which I would add "Welcome to the Real World"...

      --
      "Eve of Destruction", it's not just for old hippies anymore...
  9. He may not want to release his software .. by torpor · · Score: 1

    .. it may just be for his own personal use .. he may not release it to the wild, under any terms ..

    or, to put it your way "who cares if you have herpes if you're a wanker, anyway?"

    --
    ; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
  10. check out libofx by UncleBoy · · Score: 2, Informative

    Get someone to show you how to use google.com while you're at it.

    1. Re:check out libofx by innerweb · · Score: 1

      I did check it out. It looks promising, but it is GPL. The businesses that want the solution to include OFX do not want to have their internal code mixed with anything GPL.

      If it was just for me, or something I could release without worry, I would not hesitate to use it, but it would not be right to use it, distribute it and then not supply source code (which would be a breach of contract for me).

      Get someone to show you how to use google.com while you're at it

      Hmm. Nice.

      InnerWeb

      --
      Freud might say that Intelligent Design is religion's ID.
  11. LibOFX by Noksagt · · Score: 3, Informative

    They use LibOFX, available under the GPL

  12. Must be a typo... by Anonymous Coward · · Score: 0

    It's spelled 'OSX'

  13. Is it an SGML application? by Karma+Farmer · · Score: 1
    The link included with the article says, "OFX is based upon SGML and, like XML, it is an attempt to take the best features of SGML and remove much of the associated complexity."

    Based on that, I'm not even sure if it's even an SGML application, or if it's just something that looks like SGML?

    Open Finanical Exchange Specification 1.0.2, Section 2.3.1:
    SGML is the basis for Open Financial Exchange. A DTD formally defines the SGML wire format for Open Financial Exchange. However, Open Financial Exchange is not completely SGML-compliant because the specification allows unrecognized tags to be present... Although SGML is the basis for the specification, and the specification is largely compliant with SGML, do not assume Open Financial Exchange supports any SGML features not documented in this specification.
    But regardless, I'm suprised your SGML parser chokes so badly on this. It can't be worse than the crap that passes for html on most sites. What parser are you using?

    And, um, you do know tags don't always need to be closed in SGML, depending on the DTD, right? Your writeup makes me wonder if you really understand SGML...
    1. Re:Is it an SGML application? by innerweb · · Score: 3, Informative
      I will be the first to admit it has been a while since I worked with SGML, but IIRC, in the DTD for an SGML doc, you mark optional tags with an O, so that it would look like this:
      <!ELEMENT elemname - O (#PCDATA) >
      where the - means the opening tag is required, and the O means the closing tag is not required.

      Whether or not I think I remember it, OFX has sent me back to books I have had in boxes for almost a decade now. Normally when I pull old books out like that, they are picture albums for the family, not old programming and data manuals.

      InnerWeb

      --
      Freud might say that Intelligent Design is religion's ID.
  14. QIF by twistedcubic · · Score: 1

    Indeed. The same thing happens with QIF. The spec is open, and after reading the first page you find out that most banks produce files that don't adhere to the spec. Solution: adjust the parser-- it's a simple fix with QIF. The problem is probably that they all use the same software to create the slightly defective QIF, and MS is aware of this (more likely is that they use MS software to produce the QIFs).

  15. I've started on one already by Anonymous Coward · · Score: 0

    I've already started on a really bad OFX parser. You are welcome to my code. It has hooks to my budget program in it, but you can pull that out easily enough.

    xmlparse.py

  16. Quick save (OT) by OldMiner · · Score: 1

    The site recommended:

    Store (encrypted) information in cookies even before transfer to the server, so information is preserved from all but the most serious "melt-downs."

    GigsVT responded:

    This seems useless to me. If my computer loses power right now, I will lose this very long message I have typed into this very small box on this form on slashdot, and there's nothing the Slash devs can do to help that.

    I think you missed the point. There is something the Slash devs can do as with every other web developer. Once more, it might even be wise if the Moz devs do this or an extension is created for the purpose. In the case of the Slash devs, they could use Javascript to automatically save the contents of all open forms automatically to cookies every timed period or every so many key strokes, whichever comes first. The contents could be encrypted with one's password he uses to log into the site, so that the form is still tied to the user, and even if you're on a public machine, your form data is both saved from loss, and secure from prying eyes. The data is saved, but /. doesn't need any special access to your HD, and you WON'T lose that entire post, even if your browser crashes due to some other issue, such as a plugin it spawned corrupting something it shouldn't have. When you return to the form, the cookie is recognized, and you're prompted that you can recover form data if you like.

    Now, Mozilla doing this, for a secure environment, would likely require either using the master password feature it uses for saved password data already. This wouldn't work as well in kiosk mode unless everyone who came up to the kiosk was asked to create a one-time, there-only password and username so they could recover their form data, should such a problem arise. (The usernames would expire after a short period of time, and their cookies/form data thrown away as appropriate.) If you don't remember your old username and password, just make up a new one. This might be a bit more troublesome and consuming than an average user would want so this could be implemented in an extension to Mozilla which would only be activated if someone went to the trouble of going under the Tools and checking this addition to the Form Manager.

    And, back on topic, Money isn't being propreitary. The format is SGML, not XML, as others have stated. You're jumping on the Microsoft-bashing bandwagon that was started by the article writers' initial ignorance. Closing tags can be mandatory, optional, or forbidden in SGML.

    --
    You like splinters in your crotch? -Jon Caldara
    1. Re:Quick save (OT) by GigsVT · · Score: 1

      Regarding your last comment, after more people replied I did see that the article was misleading, but since this thread already went down this path I decided to continue.

      There are other places MS has done crap like article accuses, even if this isn't one of them.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
  17. License sucks by Anonymous Coward · · Score: 0

    It uses the GPL, so it's worthless.

    1. Re:License sucks by Anonymous Coward · · Score: 0

      you're worthless, dweeb.

  18. Not surprising by Undertaker43017 · · Score: 1

    I built Checkfree's first OFX parser, and yes we rolled our own. After looking at the DTD's, and finding numerous mistakes, we decided that building our own would be easier, and IF the DTD's ever got cleaned up, we could use an SGML parser, in a latter version.

    Sounds like the DTD's haven't gotten cleaned up.

  19. HTML tidy by gfim · · Score: 1

    You could try running it thrrough the HTML tidy program and see what happens.

    Graham

    --
    Graham