20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report (winbeta.org)
An anonymous reader writes from a report via WinBeta: A new report from scientists Mark Ziemann, Yotam Eren, and Assam El-Osta says that 20% of scientific papers on genes contain gene name conversion errors caused by Excel. In the scientific article, titled "Gene name errors are widespread in the scientific literature," article's abstract section, the scientists explain: "The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions."
It's easy to see why Excel might have problems with certain gene names when you see the "gene symbols" that the scientists use as examples: "For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to '2-Sep' and '1-Mar', respectively. Furthermore, RIKEN identifiers were described to be automatically converted to floating point numbers (i.e. from accession '2310009E13' to '2.31E+13'). Since that report, we have uncovered further instances where gene symbols were converted to dates in supplementary data of recently published papers (e.g. 'SEPT2' converted to '2006/09/02'). This suggests that gene name errors continue to be a problem in supplementary files accompanying articles. Inadvertent gene symbol conversion is problematic because these supplementary files are an important resource in the genomics community that are frequently reused. Our aim here is to raise awareness of the problem." You can view the scientific paper in its entirety here.
It's easy to see why Excel might have problems with certain gene names when you see the "gene symbols" that the scientists use as examples: "For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to '2-Sep' and '1-Mar', respectively. Furthermore, RIKEN identifiers were described to be automatically converted to floating point numbers (i.e. from accession '2310009E13' to '2.31E+13'). Since that report, we have uncovered further instances where gene symbols were converted to dates in supplementary data of recently published papers (e.g. 'SEPT2' converted to '2006/09/02'). This suggests that gene name errors continue to be a problem in supplementary files accompanying articles. Inadvertent gene symbol conversion is problematic because these supplementary files are an important resource in the genomics community that are frequently reused. Our aim here is to raise awareness of the problem." You can view the scientific paper in its entirety here.
Those conversions look like cases where the column type during import was left at "General" instead of being set to "Text" as it should have been, telling Excel to try and infer the actual type from the format of the column's contents. It's an awkward situation where the user should be telling Excel what the data type for each column is, but it's not strictly Excel's fault for doing what the user told it to do. IMO Excel should be either changed to not have a default type and to not allow an import until the user's selected a type for each column, or it should throw up an error if it infers different data types for a column for different rows.
Many, many people use Excel as a grid oriented editor to capture a mix of text and numeric data. It's massive overkill, and the helpful features like automatic date conversion DO get in the way.
But it's ubiquitously available (either in MS form or various and sundry clones)...so it gets used.