Slashdot Mirror


20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report (winbeta.org)

An anonymous reader writes from a report via WinBeta: A new report from scientists Mark Ziemann, Yotam Eren, and Assam El-Osta says that 20% of scientific papers on genes contain gene name conversion errors caused by Excel. In the scientific article, titled "Gene name errors are widespread in the scientific literature," article's abstract section, the scientists explain: "The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions."

It's easy to see why Excel might have problems with certain gene names when you see the "gene symbols" that the scientists use as examples: "For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to '2-Sep' and '1-Mar', respectively. Furthermore, RIKEN identifiers were described to be automatically converted to floating point numbers (i.e. from accession '2310009E13' to '2.31E+13'). Since that report, we have uncovered further instances where gene symbols were converted to dates in supplementary data of recently published papers (e.g. 'SEPT2' converted to '2006/09/02'). This suggests that gene name errors continue to be a problem in supplementary files accompanying articles. Inadvertent gene symbol conversion is problematic because these supplementary files are an important resource in the genomics community that are frequently reused. Our aim here is to raise awareness of the problem."
You can view the scientific paper in its entirety here.

41 of 349 comments (clear)

  1. Wait, what? by Cinnamon+Beige · · Score: 4, Insightful

    I think the thing that really surprises me is that all my professors told me it was hard to get published, when failing to make sure your data was correctly entered into whatever spreadsheet program you used to for number crunching (and creating graphs) was one of the dead basics of working there. Yet 1 in 5 papers has notable failures here? And nobody noticed before publishing them? What kinds of major errands have gotten in, then, if basic spot checks are getting failed?

    1. Re:Wait, what? by AJWM · · Score: 5, Insightful

      I don't know how much number crunching was actually involved here. I suspect the problem comes from using a spreadsheet as a database.

      Because databases are, you know, hard.

      And "errands" creep in all over the place. ;)

      --
      -- Alastair
    2. Re:Wait, what? by pem · · Score: 5, Funny
      > Spreadsheets are wonderful things...

      Citation needed.

    3. Re:Wait, what? by MightyMartian · · Score: 2

      I use them a lot for doing budgets and cash flow forecasts. That's the kind of job they were originally designed for, and within that milieu they can't really be beat. But going much beyond, where you have to use more and more query-like functionality is where they start to break.

      --
      The world's burning. Moped Jesus spotted on I50. Details at 11.
    4. Re:Wait, what? by rtb61 · · Score: 2

      Spreadsheets are really useful tools but have limitations. One thing I did enjoy was using that spreadsheet layout for coding, it seems much more functional. One sheet with programming cells, another bunch of cells on the same sheet for output, various continuous variable checks, distributed where ever those checkpoints are required, multiple sheets with similar outputs in similar locations and notes all over the place. You could really turn a spreadsheet into a great custom coding interface, than once it works, clear out the spawl from the production version and leave it all in the development versions.

      When it comes to errors, gees, load up a custom dictionary and you are done. Switch to Libre Office for the spell check and share your new free open source custom dictionary, done and finished. Whether or not you have M$ Office installed, you should still install Libre Office for flexibility, it's free, does not take up much space and most definitely does not spy on you. Also leaves that easy option for private dual booting, Libre Office on Linux for private work away from the old digital one eye prying up where is most definitely does not belong, M$ backorifice on top of windows probe 10, ugh, the buggers don't even wear protection as their compulsory upgrades screw up people's systems, to load a new bunch of compulsory ads, wow, just wow.

      --
      Chaos - everything, everywhere, everywhen
    5. Re:Wait, what? by Dog-Cow · · Score: 3, Funny

      What kinds of major errands have gotten in...

      Grocery shopping, filling the gas tank and picking up the dry cleaning.

    6. Re:Wait, what? by AmiMoJo · · Score: 2, Insightful

      It's a shame that when defining open document formats they didn't bother to define an open macro format too. We have Excel using VBA macros, and OpenOffice using Python and BASIC. Google Docs uses Javascript. Wouldn't macro portability be great?

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  2. LaTeX by 0100010001010011 · · Score: 4, Insightful

    Why in God's name are you using a Microsoft product for scientific documents?

    1. Re:LaTeX by Anonymous Coward · · Score: 2, Insightful

      LaTex and/or TeX take time to learn. Time that can be used to play with genes.

      Excel is a GUI, crunches numbers, makes pretty graphs and keeps all your data in nice grids and is easy to learn (until you hit the details like these format errors). (La)Tex is great if you want to write scientific papers that are nicely laid out and can typeset mathematical formulas (Knuth's itch). However, you'll most likely still use Excel to do the crunching/graphing/saving.

      How many excel sheets have you seen where it's only purpose is to keep lists/whatever in grid format?

    2. Re:LaTeX by NotInHere · · Score: 4, Funny

      LaTeX is not free of problems either. They are just different. If you care, you take the time to fix them, if you don't you don't fix them. Simple as that.

      https://pbs.twimg.com/media/Ci...

    3. Re:LaTeX by GodelEscherBlecch · · Score: 3, Funny

      still use Excel to do the crunching

      Good god, I hope not. When I saw the title for this article I thought for sure it was referring to errors caused by the aggregation of questionable digits resulting from machine precision floating point operations, not something as simple as type conversions. Excel has been the bane of my existence for years because testers keep trying to use it to verify results from a data processing framework I wrote where the operations for some use cases involve 20+ digit decimals. No matter how many times I explain to them the concepts of machine vs. arbitrary precision, decimal precision vs. accuracy, rational vs. decimal representations of numbers, etc. the spurious 'rounding error / does not match the XLS' bug reports just keep coming. Drives me nuts. The idea that scientists may be making the same mistakes with important research is kind of scary.

      Then again, I am usually shocked by the amount of error considered tolerable in the scientific / EE applications of the framework. The real anal retentives are the financial use cases, which tend to include 'penny allocation' algorithms for distributing fractions of pennies left as remainders from dollar amounts in the 10s of millions, and they absolutely will file a critical severity issue over a .00000000001 discrepancy.

    4. Re:LaTeX by teakillsnoopy · · Score: 2

      As someone who proofreads engineering and medical papers for a living, MS Word is unfortunately the standard format (authors insert little images of complex typography into the main text; it's as bad as it sounds). Some equation-heavy papers use LaTeX, but they account for maybe 1% of the total.

    5. Re:LaTeX by dcollins · · Score: 2

      Non sequitor. Calculations need to be made somewhere in the workflow process, and LaTeX does not help with that.

      --
      We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
    6. Re:LaTeX by dcollins · · Score: 2

      As I said upthread, it's probably not a data-entry problem. More like a file conversion problem (importing text/CSV to the spreadsheet application).

      --
      We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
    7. Re: LaTeX by GodelEscherBlecch · · Score: 2

      No shit. I don't. What do you think 'arbitrary precision' means?

  3. Not strictly Excel's fault by Todd+Knarr · · Score: 5, Informative

    Those conversions look like cases where the column type during import was left at "General" instead of being set to "Text" as it should have been, telling Excel to try and infer the actual type from the format of the column's contents. It's an awkward situation where the user should be telling Excel what the data type for each column is, but it's not strictly Excel's fault for doing what the user told it to do. IMO Excel should be either changed to not have a default type and to not allow an import until the user's selected a type for each column, or it should throw up an error if it infers different data types for a column for different rows.

    1. Re:Not strictly Excel's fault by Anonymous Coward · · Score: 2, Interesting

      Just because the stupid default behavior can be changed doesn't excuse it from being stupid in the first place.

      (See also: ads built into Windows 10)

    2. Re:Not strictly Excel's fault by sexconker · · Score: 2

      Select top left cell, format as table, make sure "my data has headers" is checked.
      Click just below the header row to select and entire column of data (excluding the header row). Format as you please, validate as you please. You can even right click, insert a new column, then create a formula that verifies the data and outputs nothing when it's good and "OHSHIT" when it fails verification. Then you can non-destructively filter your table on "OHSHIT".

      If you're using Excel at least learn to use it. This shit is up there with people who hit Enter/Return twice to generate a blank line in Word.

    3. Re:Not strictly Excel's fault by blindseer · · Score: 2

      That's easy to say but people need to understand what they are looking for. As a Specialist in the Army I happened to be in the ops office while one of the sergeants was working on a spreadsheet that handled some sort of inventory. He knew I knew something about computers and so he asked me to come over to look at the funny formatting that Excel was doing on him. What I saw was a number that could have been a date, price, part number, or something else. I started asking the sergeant what he was trying to do with the number and what kind of a number it was supposed to be. It didn't take long before he became frustrated with me and I was told I had somewhere else I needed to be.

      This was probably something better managed by a database but databases are hard, and does Office even include a database product any more?

      --
      I am armed because I am free. I am free because I am armed.
  4. Further proof by ArchieBunker · · Score: 2

    That nobody reads all that shit. The only people reading scientific papers are the people writing scientific papers.

    --
    Only the State obtains its revenue by coercion. - Murray Rothbard
  5. Excel is (almost) a great grid oriented editor by Anonymous Coward · · Score: 2, Informative

    Many, many people use Excel as a grid oriented editor to capture a mix of text and numeric data. It's massive overkill, and the helpful features like automatic date conversion DO get in the way.

    But it's ubiquitously available (either in MS form or various and sundry clones)...so it gets used.

  6. Re:Excel can kiss my 5" wide anus! by FunkSoulBrother · · Score: 3, Funny

    The VisiCalc Song
     
    [ala' "Let's Get Physical", made popular by Olivia Newton-John]
     
    I'm savin' all of those back issues of "Byte"
    Making the micro conversion
    I gotta handle text just right
    Ya know what I mean?
     
    I took you to a local computer store
    Then to a compu-fair shopping spree
    There's nothing left to purchase now
    'less it's, programmability...
     
    [BEGIN Chorus (invoked later)]
    Let's get VisiCalc*, VisiCalc
    I wanna get Visi-Calc, let's invoke VisiCalc
    Let me hear your modem talk, your floppies squawk
    Let me hear your I/O rock...
    [END Chorus]
     
    I've used paper, I've used wood
    Tried to keep my pen on the table
    It's getting hard, this hardware stuff
    Ya know what I mean?
     
    I'm sure you understand what eleven's* do
    You know the software intimately
    You gotta know, you're bringing out
    the VisiPlot* for me...
     
    [Invoke Chorus]

  7. No surprise - same erorrs in finance & ops by RichPowers · · Score: 2

    In the year 2016, a disturbing amount of human activity is run through Excel instead of proper databases.

    A similar study from 2009 tested for errors in various operational spreadsheets and concluded, "Our results confirm the general belief among those who have studied spreadsheets that errors are commonplace." The Financial Times commented on the prevalence of spreadsheet errors in business, saying it's probably a function of training and organizational culture.

    I've heard from a few salespeople in the software industry that their biggest competitor in the SMB space isn't $BigCRMCorp, but Excel spreadsheets that have acreted over the years.

  8. It was user error, not a spreadsheet problem ... by perpenso · · Score: 2

    I don't know how much number crunching was actually involved here. I suspect the problem comes from using a spreadsheet as a database. Because databases are, you know, hard.

    It was user error, not a spreadsheet problem. Prefacing the names with a ' would have identified them as a string not to be interpreted as numeric or date. If the researchers couldn't manage this do you really think they could have used a database?

    That said, yes spreadsheet are overused and abused.

  9. Not even in top 10 mistakes by burtosis · · Score: 3, Funny

    Just do a google scholar search for large hardon collider. None of them will ever live that down, doubly so when it's the title

  10. Re:It was user error, not a spreadsheet problem .. by dcollins · · Score: 5, Insightful

    It could rather be a conversion error. For example, if you have the original in a CSV file (possibly output from one program) where strings have no lead colon, and then load into Excel or LibreOffice, it will (by default) turn everything it can into a numeric format. One needs to be aware of that and ask that the column be converted to text -- which is easy to overlook if you have a column that's mostly non-ambiguous, but somewhere far below is a single date-like name.

    I've gotten hit by this many times with CSVs coming out of our school's learning management system, with long numeric student IDs that get turned into scientific notation in the spreadsheet application. In some sense that's easier to catch, because it will hit a whole column of data at once; but even so it's distressing how often I need to backtrack to resolve that.

    --
    We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
  11. Software trying to be too smart by WaffleMonster · · Score: 4, Insightful

    Type inference in excel has wasted countless hours of my time trying to make sense of corruption caused by third parties using excel. Has gotten to the point where we actively recommend people avoid excel when handling any data they care about. I do fault excel itself because these errors are pervasive. They could have better structured the data imports or made them less creative or asked users for more feedback or have the import do a pass over the entire datasets checking for outliers that may suggest a different type.

    When a critical mass is "doing it wrong" becomes pointless and counterproductive in the real world to continue to point fingers at users. Tools are supposed to be useful and if they tend not to be then that's on them.

  12. A minor ephiphany by vtcodger · · Score: 5, Insightful

    Without diminishing the other comments, it crossed my mind that the issue here is probably not whether Excel was used in the research. It's one of getting backup (supplementary) data into publishable form. That'd occur after the authors(s) had written their paper using their normal toolset for their work and gotten the paper through review. At this point, they are supposed to package up their data in some format dictated by the journal they are publishing in. Apparently .xls is an acceptable format -- which is not irrational. The format is documented and widely supported.

    Anyway, the authors are just cleaning up and getting on with their lives -- cleaning the glassware (if any), paying any bills, archiving their data and scripts, returning borrowed equipment, etc. They are going to convert their data to .xls using whatever quick and dirty tool they can find. I doubt they are going to type tens of thousands of genome codes in manually. They'll use some tool they got from a buddy or write something themselves in Perl or Python or whatever scripting language they know. And they'll check the output to make sure that Excel loads it and that it's about the right length and that the first page or so and the last page look reasonable. And off it goes.

    I don't think most folks outside of IT (and probably most in IT) are all that aware of Excel's flaky and sometimes bizarre data conversions. And, assuming that there's an unambiguous one to one translation between gene codes and excel mangled gene codes, this probably isn't a big deal. Anyone using the archived data will scratch their heads, maybe ask around, figure out what's happened, fix the data, and get on with THEIR research.

    --
    You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    1. Re:A minor ephiphany by staalmannen · · Score: 2

      Since I am in biomedical sciences I probably can answer this : most are pretty computer illiterate and actually use MS Office for everything. Entering large data-sets in the spreadsheet is done by copy-paste and not by fancy scripts. In my lab, I am the only linux user, which sort of makes me have to resort to LibreOffice/Zotero for collaborative writing instead of doing LaTeX.

    2. Re:A minor ephiphany by Applehu+Akbar · · Score: 2

      What's wrong with simply preformatting the Name column as text?

  13. Re:It was user error, not a spreadsheet problem .. by Phillip2 · · Score: 2

    The problem is that the not all tools require a ' to stop this behaviour. And, in fact, adding this may well break these tools.

    Ultimately, this is not a new problem. It was first noted about a decade ago in yeast (which uses a lot of very date like gene names). It's a bit depressing it's still happening.

    I'm unconvinced that this can be classed as a user error, though. Excel is using a heuristic to determine the data type of a field (probably on a per cell, not per column basis). And that heuristic is failing.

  14. Re: It was user error, not a spreadsheet problem . by Phillip2 · · Score: 2, Interesting

    No, this would be a disaster. The acronyms have the advantage that they are, relatively, semantics-free., If we turned them into long hand, then they would describe the gene in some way. Which means that the descriptions in the knowledge would go out of date, or would have to be changed. It's a recipe for instability.

    It is very easy to laugh at biologists and think that you know how to manage data better than they do. In some cases, you may be right, but in this case it is not so. Identifiers are there to identify and not describe. This is something we learned with Linneaus, and have stuck with since.

  15. Financial 'Industry' and generally too. by hughbar · · Score: 2

    See http://www.forbes.com/sites/sa... for example. It's endemic. They'll probably derail civilisation as we know it eventually, bridges falling, weapons launched. Then, when we're back to the caves/trees and eating nuts and berries (not a bad life outside the cubicles, really), we will curse the evil god Ex-cel and provide blood sacrifices on altars (inscribed with A1, C2 etc.) to keep him away?

    --
    On y va, qui mal y pense!
  16. Re: Including this one? by Maritz · · Score: 5, Funny

    I can see how it helps with people who don't spell very well. But for people who can, it's an outright hinderance. Also, they should grow up and add profanity, The puritan dictionary is a ducking disgrace, utterly shot.

    --
    I do not want your cheap brainburning drugs. They are useless for work. And I am a working man today.
  17. Re:It was user error, not a spreadsheet problem .. by bickerdyke · · Score: 2

    Which is too late if the damage has already been done. And if the first 2000 lines look ok, no one is going to do that because, hey, the auto import worked!

    --
    bickerdyke
  18. Re:It was user error, not a spreadsheet problem .. by OneSmartFellow · · Score: 5, Insightful

    Oh, yeah, the single quote - so obvious that the solution to the problem of Excel's stupid, stupid, stupid default behavior of *silently modifying your data* is to put a freaking Single Quote character in front of every cell data element.

    Why can't the default behavior be that the data is just not modified, at all, unless I tell Excel I want it to be.

  19. Auto conversion is the bane scientists by Bohnanza · · Score: 2

    Once we ran all our British chemical patent info through software to convert it to good old-fashioned American English. All instances of Tungsten Carbide (WC) were converted to "Toilet".

    --

    -----

    Sorry, I'm only a 1336 h4x0r.

  20. It's actually 19.4% by radish · · Score: 2

    But they used the wrong rounding mode.

    --

    ---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"

  21. Re:It was user error, not a spreadsheet problem .. by jbengt · · Score: 2

    This is just a display problem, not even a conversion problem.

    No, it is a conversion problem.
    True, if you make the column text in the import .csv "wizard" , it won't convert.
    But if Excel does convert text to a date or to a number, you lose the original text and have to import again to get it back.
    Numbers displayed in scientific notation (or any other format) and dates displayed however Excel is set up to display them, are stored internally as just numbers, and the original text is not saved anywhere.

  22. Re:It was user error, not a spreadsheet problem .. by jbmartin6 · · Score: 2

    You are right, I stand corrected. I thought that the converted cell would revert back to the raw data if the column type was changed, or if one exported as CSV. A little testing shows I was wrong.

    --
    This posting is provided 'AS IS' without warranty of any kind, implied or otherwise.
  23. self-response addendum by epine · · Score: 2

    Penn Jillette on Donald Trump, Hillary Clinton, And Why He's All in on Gary Johnson — 2 August 2016

    I watched this video yesterday. There a fabulous exchange 24:30–30:00 on truth and naievity.

    You go through a period when you're sixteen, seventeen, eighteen when truth really obsesses everybody. And then I think you're supposed to kind of sort of grow out of it. And I didn't. It really remains of complete interest to me. ... I'm not bothered at all by people being wrong. ... I have such a naive point of view, to almost not believing it, that people can have information and represent the opposite of that. I just find that so appalling and, in a certain way, fascinating.

    Once upon a time I would have ventured that most Slashdot readers would want to view this. It had me thinking about my own life 1985–1995 where I watched the software industry turning into a train wreck, where every seventh train car is painted bright orange and lettered in an ominous Area 51 black stencil font "patch Tuesday", with sparks flying off wheels seized (and reseized) for so long they resemble lopsided pentagons.

    I used to think to myself "surely these are just temporary conditions due to the extreme rate of expansion of the software industry, and it will all settle back down to sanity as we crest the exponential growth phase". But no. Like Jillette, I was a die-hard naievitarian. Lesson learned.