Slashdot Mirror


20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report (winbeta.org)

An anonymous reader writes from a report via WinBeta: A new report from scientists Mark Ziemann, Yotam Eren, and Assam El-Osta says that 20% of scientific papers on genes contain gene name conversion errors caused by Excel. In the scientific article, titled "Gene name errors are widespread in the scientific literature," article's abstract section, the scientists explain: "The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions."

It's easy to see why Excel might have problems with certain gene names when you see the "gene symbols" that the scientists use as examples: "For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to '2-Sep' and '1-Mar', respectively. Furthermore, RIKEN identifiers were described to be automatically converted to floating point numbers (i.e. from accession '2310009E13' to '2.31E+13'). Since that report, we have uncovered further instances where gene symbols were converted to dates in supplementary data of recently published papers (e.g. 'SEPT2' converted to '2006/09/02'). This suggests that gene name errors continue to be a problem in supplementary files accompanying articles. Inadvertent gene symbol conversion is problematic because these supplementary files are an important resource in the genomics community that are frequently reused. Our aim here is to raise awareness of the problem."
You can view the scientific paper in its entirety here.

9 of 349 comments (clear)

  1. Wait, what? by Cinnamon+Beige · · Score: 4, Insightful

    I think the thing that really surprises me is that all my professors told me it was hard to get published, when failing to make sure your data was correctly entered into whatever spreadsheet program you used to for number crunching (and creating graphs) was one of the dead basics of working there. Yet 1 in 5 papers has notable failures here? And nobody noticed before publishing them? What kinds of major errands have gotten in, then, if basic spot checks are getting failed?

    1. Re:Wait, what? by AJWM · · Score: 5, Insightful

      I don't know how much number crunching was actually involved here. I suspect the problem comes from using a spreadsheet as a database.

      Because databases are, you know, hard.

      And "errands" creep in all over the place. ;)

      --
      -- Alastair
    2. Re:Wait, what? by AmiMoJo · · Score: 2, Insightful

      It's a shame that when defining open document formats they didn't bother to define an open macro format too. We have Excel using VBA macros, and OpenOffice using Python and BASIC. Google Docs uses Javascript. Wouldn't macro portability be great?

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  2. LaTeX by 0100010001010011 · · Score: 4, Insightful

    Why in God's name are you using a Microsoft product for scientific documents?

    1. Re:LaTeX by Anonymous Coward · · Score: 2, Insightful

      LaTex and/or TeX take time to learn. Time that can be used to play with genes.

      Excel is a GUI, crunches numbers, makes pretty graphs and keeps all your data in nice grids and is easy to learn (until you hit the details like these format errors). (La)Tex is great if you want to write scientific papers that are nicely laid out and can typeset mathematical formulas (Knuth's itch). However, you'll most likely still use Excel to do the crunching/graphing/saving.

      How many excel sheets have you seen where it's only purpose is to keep lists/whatever in grid format?

  3. Re:It was user error, not a spreadsheet problem .. by dcollins · · Score: 5, Insightful

    It could rather be a conversion error. For example, if you have the original in a CSV file (possibly output from one program) where strings have no lead colon, and then load into Excel or LibreOffice, it will (by default) turn everything it can into a numeric format. One needs to be aware of that and ask that the column be converted to text -- which is easy to overlook if you have a column that's mostly non-ambiguous, but somewhere far below is a single date-like name.

    I've gotten hit by this many times with CSVs coming out of our school's learning management system, with long numeric student IDs that get turned into scientific notation in the spreadsheet application. In some sense that's easier to catch, because it will hit a whole column of data at once; but even so it's distressing how often I need to backtrack to resolve that.

    --
    We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
  4. Software trying to be too smart by WaffleMonster · · Score: 4, Insightful

    Type inference in excel has wasted countless hours of my time trying to make sense of corruption caused by third parties using excel. Has gotten to the point where we actively recommend people avoid excel when handling any data they care about. I do fault excel itself because these errors are pervasive. They could have better structured the data imports or made them less creative or asked users for more feedback or have the import do a pass over the entire datasets checking for outliers that may suggest a different type.

    When a critical mass is "doing it wrong" becomes pointless and counterproductive in the real world to continue to point fingers at users. Tools are supposed to be useful and if they tend not to be then that's on them.

  5. A minor ephiphany by vtcodger · · Score: 5, Insightful

    Without diminishing the other comments, it crossed my mind that the issue here is probably not whether Excel was used in the research. It's one of getting backup (supplementary) data into publishable form. That'd occur after the authors(s) had written their paper using their normal toolset for their work and gotten the paper through review. At this point, they are supposed to package up their data in some format dictated by the journal they are publishing in. Apparently .xls is an acceptable format -- which is not irrational. The format is documented and widely supported.

    Anyway, the authors are just cleaning up and getting on with their lives -- cleaning the glassware (if any), paying any bills, archiving their data and scripts, returning borrowed equipment, etc. They are going to convert their data to .xls using whatever quick and dirty tool they can find. I doubt they are going to type tens of thousands of genome codes in manually. They'll use some tool they got from a buddy or write something themselves in Perl or Python or whatever scripting language they know. And they'll check the output to make sure that Excel loads it and that it's about the right length and that the first page or so and the last page look reasonable. And off it goes.

    I don't think most folks outside of IT (and probably most in IT) are all that aware of Excel's flaky and sometimes bizarre data conversions. And, assuming that there's an unambiguous one to one translation between gene codes and excel mangled gene codes, this probably isn't a big deal. Anyone using the archived data will scratch their heads, maybe ask around, figure out what's happened, fix the data, and get on with THEIR research.

    --
    You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
  6. Re:It was user error, not a spreadsheet problem .. by OneSmartFellow · · Score: 5, Insightful

    Oh, yeah, the single quote - so obvious that the solution to the problem of Excel's stupid, stupid, stupid default behavior of *silently modifying your data* is to put a freaking Single Quote character in front of every cell data element.

    Why can't the default behavior be that the data is just not modified, at all, unless I tell Excel I want it to be.