20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report (winbeta.org)
An anonymous reader writes from a report via WinBeta: A new report from scientists Mark Ziemann, Yotam Eren, and Assam El-Osta says that 20% of scientific papers on genes contain gene name conversion errors caused by Excel. In the scientific article, titled "Gene name errors are widespread in the scientific literature," article's abstract section, the scientists explain: "The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions."
It's easy to see why Excel might have problems with certain gene names when you see the "gene symbols" that the scientists use as examples: "For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to '2-Sep' and '1-Mar', respectively. Furthermore, RIKEN identifiers were described to be automatically converted to floating point numbers (i.e. from accession '2310009E13' to '2.31E+13'). Since that report, we have uncovered further instances where gene symbols were converted to dates in supplementary data of recently published papers (e.g. 'SEPT2' converted to '2006/09/02'). This suggests that gene name errors continue to be a problem in supplementary files accompanying articles. Inadvertent gene symbol conversion is problematic because these supplementary files are an important resource in the genomics community that are frequently reused. Our aim here is to raise awareness of the problem." You can view the scientific paper in its entirety here.
It's easy to see why Excel might have problems with certain gene names when you see the "gene symbols" that the scientists use as examples: "For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to '2-Sep' and '1-Mar', respectively. Furthermore, RIKEN identifiers were described to be automatically converted to floating point numbers (i.e. from accession '2310009E13' to '2.31E+13'). Since that report, we have uncovered further instances where gene symbols were converted to dates in supplementary data of recently published papers (e.g. 'SEPT2' converted to '2006/09/02'). This suggests that gene name errors continue to be a problem in supplementary files accompanying articles. Inadvertent gene symbol conversion is problematic because these supplementary files are an important resource in the genomics community that are frequently reused. Our aim here is to raise awareness of the problem." You can view the scientific paper in its entirety here.
Just wondering
I think the thing that really surprises me is that all my professors told me it was hard to get published, when failing to make sure your data was correctly entered into whatever spreadsheet program you used to for number crunching (and creating graphs) was one of the dead basics of working there. Yet 1 in 5 papers has notable failures here? And nobody noticed before publishing them? What kinds of major errands have gotten in, then, if basic spot checks are getting failed?
Why in God's name are you using a Microsoft product for scientific documents?
due to the EULA MS owns your genes now!
Scientists are rewarded by "impact", so there's an incentive to publish early, publish often. Failing to do so leads to career death.
The "impact" does not count integrity, conservancy, or maintainability of data/code/other digital products. You don't get credit for your time and labor for improving the maintainability of code or the type-correctness of data fields in a table.
No wonder scientists often publish shit code and generate shit data.
Those conversions look like cases where the column type during import was left at "General" instead of being set to "Text" as it should have been, telling Excel to try and infer the actual type from the format of the column's contents. It's an awkward situation where the user should be telling Excel what the data type for each column is, but it's not strictly Excel's fault for doing what the user told it to do. IMO Excel should be either changed to not have a default type and to not allow an import until the user's selected a type for each column, or it should throw up an error if it infers different data types for a column for different rows.
http://www.treehugger.com/clean-technology/cigarette-butts-make-better-bricks.html
"There are about 6 trillion cigarette butts produced every year, creating about 1.2 trillion tons of cigarette butt waste. "
So 5 butts weigh 1 ton?
Selectively omitted non-factual click-bait Headline: "20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report"
Proper headline: "20% of Scientific Papers On Genes Contain NAME Conversion Errors Caused By Excel, Says Report" /. - we made you famous by coming here. There was a good reason for that. You are clearly getting worse. We don't need - or want - glitter and sparkles.
That nobody reads all that shit. The only people reading scientific papers are the people writing scientific papers.
Only the State obtains its revenue by coercion. - Murray Rothbard
Many, many people use Excel as a grid oriented editor to capture a mix of text and numeric data. It's massive overkill, and the helpful features like automatic date conversion DO get in the way.
But it's ubiquitously available (either in MS form or various and sundry clones)...so it gets used.
so basically any gene researching using shit tools like excel is wrong.
nice.
In the year 2016, a disturbing amount of human activity is run through Excel instead of proper databases.
A similar study from 2009 tested for errors in various operational spreadsheets and concluded, "Our results confirm the general belief among those who have studied spreadsheets that errors are commonplace." The Financial Times commented on the prevalence of spreadsheet errors in business, saying it's probably a function of training and organizational culture.
I've heard from a few salespeople in the software industry that their biggest competitor in the SMB space isn't $BigCRMCorp, but Excel spreadsheets that have acreted over the years.
...scientists are too fucking stupid to use Excel properly. Format your cells correctly. It's not Excel's fault. It's no different than any other occupation that needs to use the tool correctly.
"A plan fiendishly clever in its intricacies"- Homer Simpson
I don't know how much number crunching was actually involved here. I suspect the problem comes from using a spreadsheet as a database. Because databases are, you know, hard.
It was user error, not a spreadsheet problem. Prefacing the names with a ' would have identified them as a string not to be interpreted as numeric or date. If the researchers couldn't manage this do you really think they could have used a database?
That said, yes spreadsheet are overused and abused.
I think the thing that really surprises me is that all my professors told me it was hard to get published, ...
In well regarded publications. However at the other end of the spectrum there are publications that will print just about anything. How else do all the professors keep their jobs in a publish or perish world? There are not enough slots at the well regarded end of the spectrum to keep everyone employed.
Spreadsheets should be used to present data in the form of a view. Use databases to store data and spreadsheets to create views and statistics of it.
-SR
I think the thing that really surprises me is that all my professors told me it was hard to get published, when failing to make sure your data was correctly entered into whatever spreadsheet program you used to for number crunching (and creating graphs) was one of the dead basics of working there. Yet 1 in 5 papers has notable failures here? And nobody noticed before publishing them? What kinds of major errands have gotten in, then, if basic spot checks are getting failed?
It is hard-ish, at least in a number of fields, unless you are either a big name or doing something of popular interest at the moment.
Ultimately, though, this kind of error comes down to laziness more than vetting for quality--because a very minor editing mistake should not affect paper publication, but should be caught before publication. All it takes to spot the error is a careful reading of the paper by one of the professor, the grad student, a reviewer, an assistant editor, or an editor. And while hopefully they will add automated checks after this, the fact that that read is not taking place is discouraging. Scientists in particular should be reading every image or graph in their paper closely and asking "what exactly does this mean." The level of precision that goes into editing legal academic articles is insane (italicizing something wrong is a stoning offense), but you don't have to get to nearly that level of detail before at least three different people would spot an error like this.
Real lawyers write in C++
I think the real source of the problem is many scientists' insistence on naming everything using (what they think is) a clever acronym. it's become a virtual plague in most technical disciplines over the past two or three decades. I think it's directly related to the desire by scientists to somehow become famous beyond their little niches.
It's gotten so bad that, much of the time, the spelled-out names don't really even make sense.
#DeleteChrome
... instead of accepting that you are too incompetent to verify the results.
...mutations propelled life.
Table-ized A.I.
Just do a google scholar search for large hardon collider. None of them will ever live that down, doubly so when it's the title
The project managers who choose excel to enter the data are at fault here. They picked the app AND they did not do the proper quality control. How else can you really see it? It would be obvious at 20%, if that is true, that there were significant errors and the process itself needed to be changed. Instead of you have untold tens of thousands of lost labor hours of data collection for unreliable data in an industry where you certainly cannot settle for data variation. Blaming Excel is entirely lame and it's the same BS lack of accountability that everyone has these days, the same they complain about in their politicians and corporations, while offering none themselves. I see TONS of people use spreadsheets because it formats the data for them. They use it as Word Processor template more than a spreadsheet. This may not be the case here, but the point is you have to use the app within the limits of the app, not expect your needs to magically define the app without an investment of developer time and money to make that happen. I'm sure MS would be happy to work with the needs of an industry that large, wealthy and holding so much potential. Staring the conversation out by blaming their product seems lame. This is what happens when news is all for profit... the comment become the only thing worth read.. and then you realize half of them are just witty bots..........crapception complete
And is limited to how well you use it. As the article noted, OpenOffice and LibreOffice will do they same thing as well. They noted Sheets doesn't, but I can't get Google Sheets to handle dates consistently at all sometimes.
This is just an excellent example of what works for a large population of users can be a bad thing for a small set of users. After this paper, I expect the error rate to drop dramatically, given how easy the fix is.
Some scientists are fucking idiots.
It could rather be a conversion error. For example, if you have the original in a CSV file (possibly output from one program) where strings have no lead colon, and then load into Excel or LibreOffice, it will (by default) turn everything it can into a numeric format. One needs to be aware of that and ask that the column be converted to text -- which is easy to overlook if you have a column that's mostly non-ambiguous, but somewhere far below is a single date-like name.
I've gotten hit by this many times with CSVs coming out of our school's learning management system, with long numeric student IDs that get turned into scientific notation in the spreadsheet application. In some sense that's easier to catch, because it will hit a whole column of data at once; but even so it's distressing how often I need to backtrack to resolve that.
We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
Its never been good at math, now we know its not good at presenting data sets either.
Yes, it's quite annoying that acronyms are used, especially in an age where both random access memory and hard disk drive space is virtually unlimited. Even my solid state disk had many gigabytes more space than I need. Furthermore, bandwidth is not at a premium; my internet connection provides over 20 megabits per second through my asynchronous digital subscriber line modulator/demodulator.
That means genetic scientists can't be arsed to format the relevant columns as text?
I think you somehow managed to bolster the GP's point through some weirdly-inverted way. Acronyms are a form of compression to make communication easier. Cutesy acronyms whose long forms are twisted to fit the acronym do not ease communication.
.. fucks up data handled in program. Film at eleven.
Honestly, this isn't really news. I consider quite a few pieces of contemporary software on the brink of unusable.
That also includes modern Word processors, Excel and the Facebook UI. (I use facebook sporadically with a spoof account)
A software that tries to think for me without communicating this, that tries to babysit me (remember Clippy?) is bound to be somewhere between extremely annoying and dangerous, depending on the situation you want to use it in.
The first thing you do in Excel is turn off autocorrect entirely. I don't know if that works, but it should, otherwise Excel is more than worthless.
And, of course, in a spreadsheet autocorrect should be disabled by default. In fact, autocorrect should only run when users specifically tell it to, wether in a word processor, spreadsheet or whatnot - that's my opinion anyway.
This is not the first time that a modern Spreadsheet - or shall we just call it by it's name? - Excel has screwed up scientific data or other more-or-less critical data big time.
Excel is a powerful tool often used incorrectly or in situations where it simply doesn't fit. It's a highly complex tool and if it's got rounding errors or similar problems and doesn't tell you how to handle them when they arise with big numbers then somebody has built a shit product. Which with MS would not really a surprise to be honest.
I'd say a modern spreadsheet should warn you if you're using numbers that are to large for the programms math engine to handle without rounding errors or such. And you probably agree.
Conclusion: ... But I do also see the fault with MS and their mediocre software that has only been banking on it's quasi-monopoly for the last 18 years.
I do see fault in the scientists. After all, they should know the tools that they are using and not mess around like some n00b. Although it was probably some unpaid and overworked doctorate who entered the data and there's little to blaim those.
We suffer more in our imagination than in reality. - Seneca
Excel's auto-formatting causes me no-end of grief.
I receive technical sheets of tools from clients, and if you don't know how to process them, Excel will trim leading zeros from manufacturer part numbers and abbreviate long numbers to scientific notation and will convert sizes marked in fractions into dates . . . and in general turn data into meaningless mess.
And before you jump and yell at me that Excel isn't the right tool for data management - I know . . . There's no avoiding it, everyone out there uses excel.
I know all the moves and switches and tricks to avoid that, but for the life of me, I don't understand why isn't there a simple switch in the settings that will disable all that auto-formatting crap once and for all.
My favorite saying at work: If you're using Excel you're doing it wrong!
It's a gross generalization, but 98% of the Excel usage I've seen in in appropriate. I've been in a number of work places where XL has been used for stage gating designs, keeping track of stock, staff time sheets etc. Everyone of those usages are better served by a tool built for the job.
The other 2% has been used for plotting/analysing recorded data. I'll accept that Excel is probably the best/easiest tool for that!
Nobody reads them anyway
Type inference in excel has wasted countless hours of my time trying to make sense of corruption caused by third parties using excel. Has gotten to the point where we actively recommend people avoid excel when handling any data they care about. I do fault excel itself because these errors are pervasive. They could have better structured the data imports or made them less creative or asked users for more feedback or have the import do a pass over the entire datasets checking for outliers that may suggest a different type.
When a critical mass is "doing it wrong" becomes pointless and counterproductive in the real world to continue to point fingers at users. Tools are supposed to be useful and if they tend not to be then that's on them.
Here is why for those without the background.
Typed variables in a database completely eliminate this issue.
If the users had used a database, even one of the MS ones (one of which newbies can deal with within a week), that would have solved it which was the poster above's point that you were unable to grasp.
Of course they fucking could - how condescending can you get? These are scientists not the "beige box is a hard drive" crowd.
That was the best post of the thread! Combined appropriate experience and practical advice on an often overlooked element of data management.
Or maybe ctrl-A, right-click, Format Cells, "text". Then Excel won't change anything around, and will treat all cells as plain "text" and won't convert anything. Takes maybe 5-10 seconds.
Without diminishing the other comments, it crossed my mind that the issue here is probably not whether Excel was used in the research. It's one of getting backup (supplementary) data into publishable form. That'd occur after the authors(s) had written their paper using their normal toolset for their work and gotten the paper through review. At this point, they are supposed to package up their data in some format dictated by the journal they are publishing in. Apparently .xls is an acceptable format -- which is not irrational. The format is documented and widely supported.
Anyway, the authors are just cleaning up and getting on with their lives -- cleaning the glassware (if any), paying any bills, archiving their data and scripts, returning borrowed equipment, etc. They are going to convert their data to .xls using whatever quick and dirty tool they can find. I doubt they are going to type tens of thousands of genome codes in manually. They'll use some tool they got from a buddy or write something themselves in Perl or Python or whatever scripting language they know. And they'll check the output to make sure that Excel loads it and that it's about the right length and that the first page or so and the last page look reasonable. And off it goes.
I don't think most folks outside of IT (and probably most in IT) are all that aware of Excel's flaky and sometimes bizarre data conversions. And, assuming that there's an unambiguous one to one translation between gene codes and excel mangled gene codes, this probably isn't a big deal. Anyone using the archived data will scratch their heads, maybe ask around, figure out what's happened, fix the data, and get on with THEIR research.
You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
But really, if you've been using Excel that extensively, haven't you learned its quirks by now?
1. If entering text data, select the whole column and set its type. Or type a single quote (') before each text value.
2. If importing a CSV file, just make sure each text value in the file is ended with a tab (ASCII 9) character.
The problem is that the not all tools require a ' to stop this behaviour. And, in fact, adding this may well break these tools.
Ultimately, this is not a new problem. It was first noted about a decade ago in yeast (which uses a lot of very date like gene names). It's a bit depressing it's still happening.
I'm unconvinced that this can be classed as a user error, though. Excel is using a heuristic to determine the data type of a field (probably on a per cell, not per column basis). And that heuristic is failing.
No, this would be a disaster. The acronyms have the advantage that they are, relatively, semantics-free., If we turned them into long hand, then they would describe the gene in some way. Which means that the descriptions in the knowledge would go out of date, or would have to be changed. It's a recipe for instability.
It is very easy to laugh at biologists and think that you know how to manage data better than they do. In some cases, you may be right, but in this case it is not so. Identifiers are there to identify and not describe. This is something we learned with Linneaus, and have stuck with since.
Climatologists don't use Excel.
"The average reporter we talk to is 27 years old......They literally know nothing." - Ben Rhodes
See http://www.forbes.com/sites/sa... for example. It's endemic. They'll probably derail civilisation as we know it eventually, bridges falling, weapons launched. Then, when we're back to the caves/trees and eating nuts and berries (not a bad life outside the cubicles, really), we will curse the evil god Ex-cel and provide blood sacrifices on altars (inscribed with A1, C2 etc.) to keep him away?
On y va, qui mal y pense!
You'd be surprised how much overlap there is. Expertise in one domain is absolutely not correlated with expertise in another domain.
See also: (Medical) doctors. I'm sure just about everyone has been to a doctor that is gallingly close to computer illiterate.
Which is too late if the damage has already been done. And if the first 2000 lines look ok, no one is going to do that because, hey, the auto import worked!
bickerdyke
Oh, yeah, the single quote - so obvious that the solution to the problem of Excel's stupid, stupid, stupid default behavior of *silently modifying your data* is to put a freaking Single Quote character in front of every cell data element.
Why can't the default behavior be that the data is just not modified, at all, unless I tell Excel I want it to be.
This is all your fault clippy!
Here is why for those without the background.
Typed variables in a database completely eliminate this issue.
If the users had used a database, even one of the MS ones (one of which newbies can deal with within a week), that would have solved it which was the poster above's point that you were unable to grasp
And how would that helped when publishers requested their supplementary data in Excel format?
They are scientists and more likely than not the did use some tools for scientific number crunching ("R") or similar that kept and processed data in typed variables or even kept the raw data in an actual database. But you can't ship your Oracle server to reviewers by email and in lack of a document-like database format, peope are turning to Excel.
bickerdyke
Did they do that? Obviously you do not know and are making things up. Why bother to lie over something so trivial?
Given that Microsoft's target audience in every case is composed of business fucks, the issue is unsurprising.
then you only need a way to attach that database to that email with the excel file, too....
But yes, that would be the proper way. But we would still need a lightweight database-as-simple-document format. Heck, that even COULD BE Excel with a special type of data-worksheets in a document that enforce data structure and do not allow formatting. For all teh use cases where you don't need the performance or multi-user or transactions/data integrity/replication of an actual database server.
The task of mailing out a file with database-data should be much simpler.
bickerdyke
It sounds like Excel is introducing random gene mutations into the genome. It would be interesting to see how these mutations are manifested and passed along from generation to generation.
Oooh! I think I may have the topic of my next research project!!!!
It is well known for FAR longer than 10 years ago.
Back in 1998/99 inventory lists were corrupted the SAME WAY.
Misusing a tool is an old failure. This one may not be the researchers - though they are supposed to review the pre-publication documents, that still doesn't mean the publishers didn't screw it up for the final print.
The single quote won't help when pasting in multiple cells of unformatted data, however. One should always change the column's data type from something other than "General", such as to "Text", before pasting in unformatted data.
I'm a relative newbie to Excel but the first thing I learned was never to trust any of the cells where any calculations are performed.
As my spreadsheets got more and more complex I quickly realised small errors in one worksheet could manifest themselves in really ugly - but very subtle - ways. One simple-looking calculation on one worksheet could blow out an entire model if it there was even a small typo.
I suspect many people using Excel haven't learned this lesson yet. I was lucky that I noticed it myself before learning the hard way.
My solution was to have entirely separate worksheets where I would basically apply rough/simple TDD principles - have some known quantities and results in any complex calculation sections and make sure they were clearly visible at all time. That way as the spreadsheet evolves it can help you catch small errors before they ruin their day.
I'm sure pro Excel people have many more useful tricks.
So a fourth of all the papers, huh?
(I used Excel to convert to a fraction, for those people who don't understand percentages.)
Proper use of Schema.ini or renaming the file to .txt importing it and manually specifying which columns are text, date & numeric. Basically as far as the csv imports concerned it's the ODBC driver that's being too helpful
Build a Man a Fire, and He'll Be Warm for a Day. Set a Man on Fire, and He'll Be Warm for the Rest of His Life.
This is user error. The failure to format a cell or column is not a failure of the application, it is a failure of the user.
The exact same issue occurs if the data is entered into Open/Libre Office Calc. If you don't format the column and tell the application what type of data it is, then the application tries to be helpful and formats what are most often date abbreviations as dates.
SUrely, on a site full of techies and programmers, the concept of type declaration isn't so surprising,
One needs to be aware of that and ask that the column be converted to text -- which is easy to overlook if you have a column that's mostly non-ambiguous, but somewhere far below is a single date-like name.
My process for importing into a spreadsheet includes always making sure to check the data type for each column. If you do that, then it is very difficult to overlook.
The basic problem here is not that the tools function as designed, which is actually very useful. I use spreadsheets fairly regularly for mundane list-mangling tasks. The problem is people not knowing how to use the tools. You see this everywhere you go in academia. Literally every position on a college campus typically now requires familiarity with Microsoft Office, and just about every college campus has got a class in Office, yet an IT pro working at a college will spend a significant amount of time answering stupid questions about Office that anyone with even passing familiarity could answer, and which can trivially be answered with the help system. And you can identify the problem, and communicate it to their superior, and they still won't wind up enrolled in the Office class. They got the job on a fraudulent basis (claiming knowledge they didn't have) and the school would rather not educate instructors to bring them up to the requirements for some reason, probably because they'd have to pay them to attend the class which would qualify them to have the job they've already been given.
Perhaps before you're allowed to make a scientific paper, you should be forced to learn how to use a computer.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
...because it's not really Excel that is the problem. It's the user. They don't format the cell properly to accommodate the data.
Why can't the default behavior be that the data is just not modified, at all, unless I tell Excel I want it to be.
The default behavior is to treat the field as whatever you've told the spreadsheet that it is. By default, every cell is set up for numeric data types. The same is true on a CSV import. In either case there is a simple way to prevent it happening; select a text format for cells in the sheet, or select a text type for the column while importing. The problem is misuse of tools, not a problem with the tool. You wouldn't complain that a band saw is capable of removing fingers, would you?
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
The problem is people not knowing how to use the tools.
I disagree. The tool in question, Excel, used to be a tool for accounting, but it is not so anymore, since at least 2007.
The defaults "smart" conversions are harmful to general usage. Watch how regular users go around MS word auto-formatting and you'll see what I mean.
I personally turn off all of the automatic features since they ruin my flow and hamper my productivity. What I see from people who never bother to go to the preferences, is that they just OCD themselves out of the features, by already expecting to delete the automatic text that comes next...
Anyway: wrong tool for the job.
With things in the IT world ever-changing, it's difficult even for IT professionals to keep up sometimes. I imagine that scientists are only using the tools available to them in the best way they know how. I seriously doubt that much of their time is spent keeping up with whatever new shit excel implements. In fact, when I think of who gets shit on the most by the constant changes to the programs that we all use, it's scientists. We all know that probably most scientists have the brain-power to understand/adjust to changes in the programs that they use, but they probably don't have the time.
Politics; n. : A religion whereby man is god.
You aren't going to send a publisher your raw data in its original format... Excel is a super common way to transmit a subset of data from a database. There are new standards coming out next year for file formats for some sorts of studies. It should alleviate the issue some.
Wow! I'm amazed that an organization could make the mistake of calling the hadron collider a "hardon collider".
And why aren't the graduate students ruining their eyesight checking the data attachments for errors? That's their job.
Calculator ( a program ) has normal, programming, and scientific modes.... why can't Excel?
"Excel Is Mutating Our Genes!"
Yes! My thought on the article was, what is the problem? It's not like Excel alters the underlying data, all you have to do is correctly change the column type. This is just a display problem, not even a conversion problem. If someone is taking the data into some other tool it will be correct.
This posting is provided 'AS IS' without warranty of any kind, implied or otherwise.
Machines will rule the world, resistance is futile.
Once we ran all our British chemical patent info through software to convert it to good old-fashioned American English. All instances of Tungsten Carbide (WC) were converted to "Toilet".
-----
Sorry, I'm only a 1336 h4x0r.
If I open a new spreadsheet (without a template) the default behavior is to convert data types to randomness. Some numbers are interpreted as dates, and so on.
Not only that, but frequently, having converted it, I am unable to reset. Instead Excel will display #ERROR or some such nonsense, even though clicking on the cell will show the original data in the toolbar area.
Please tell me you understand why this is a design flaw. The default behavior for an application like Excel should be to preserve data, not convert it. Just as the default behavior for a Word Processor should not include making decisions about how I want text formatted without me explicitly formatting it - changing font, indentation, line spacing, etc. Yet, Microsoft Word does this too, and then doesn't even include an option to show me what formatting codes are responsible, so that I can remove them, and restore the text to un-formatted.
This is simply crap programming from MS, as per usual.
Since the beginning of Lotus 123 I've been entering "just text" into cells, either by formatting or originally by adding an apostrophe in front of the text. Just use the tools you've been given, properly.
It's not a conversion or a user error. The Excel sheet is applying random mutations on the data. It's called evolution. Look it up, buddy.
But they used the wrong rounding mode.
---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"
While this is not the best example, for a long time my colleagues in computational biology and bioinformatics have found researchers that use spreadsheets as databases, or as a sort of hybrid database-data exploration environment, the bane of our existence. There's not just issues going from CSV to Excel and back again, but auto-corrections that make subtle changes, peculiar date behavior, not to mention what becomes awkward habits of structuring information. The fact that so much information is locked up in a crude assembly of semi-formed spreadsheets distributed across hundreds of desktops with cryptic titles and versions is a nightmare.
Excel is where good data goes to die.
grabble grabble climate change grabble rabble corrupt datasets fnerrr, I told ya it wasn't a real effect
No, it is a conversion problem. .csv "wizard" , it won't convert.
True, if you make the column text in the import
But if Excel does convert text to a date or to a number, you lose the original text and have to import again to get it back.
Numbers displayed in scientific notation (or any other format) and dates displayed however Excel is set up to display them, are stored internally as just numbers, and the original text is not saved anywhere.
Auto-correct is a vast conspiracy fueled by the Illuminati to repress scientific advancement (or any other productivity) by the common folk. Hopefully this is the beginning of the end for auto-correct! Let's build a wall around auto-correct, and make typing great again.
The worst one I had (partly due to auto-correct, partly due to other issues) was when I tried to text my wife "Me and the children are fine. My phone is almost dead so I'll call you later" and what she got was "the children are dead"
So SCIENTISTS don't proof read their scientific findings before publishing? Don't blame the software, where's the due diligence to make sure you know wtf you are publishing?! And no one noticed this before a study?? Is anyone even reading these things? I have to assume there isn't both a
2310009E13 and 2.31E+13 RIKEN identifier. No one said "wow, this doesn't even make sense, there is no 2.31E+13!
Sounds like a bunch of idiots just throw up on paper and no one (writers or readers) really thinks about it and now they want blame software for their mistakes.
Back in the day, old versions of Word only had "co-operation" in its dictionary, and would autocorrect "cooperation" to "Cupertino". This sounded like one of those urban legend things until I searched Google Books for some common constructs. You'll find hundreds of examples like the cupertino of and Cupertino Between (has some real uses but most aren't)
You should have seen the researcher's face when I told them Excel was (and had been) corrupting his data input (of yes genes). I forget what year this was, somewhere in the 1999-2002 range.
You are right, I stand corrected. I thought that the converted cell would revert back to the raw data if the column type was changed, or if one exported as CSV. A little testing shows I was wrong.
This posting is provided 'AS IS' without warranty of any kind, implied or otherwise.
I work in libraries and this is a common problem with tables of books along with their corresponding ISBNs.
Should've learned TeX, bitches!
Why can't the default behavior be that the data is just not modified, at all, unless I tell Excel I want it to be.
Flag as Inappropriate
Because 99% of the time that's not what everybody else wants.
I'm unconvinced that this can be classed as a user error, though. Excel is using a heuristic to determine the data type of a field
I disagree.
If you're using typed data, you have to cast it as something when it's stored. As far as I can tell, the Excel heuristic hasn't changed much since the 1990s.
With a 20+ year history, it's user error if they do not account for it.
It is dead simple to select a column and set its data type to Text, which is appropriate for gene names and will prevent any alteration of the input.
If there is an option that accomplishes exactly what the user wants and the user does not select that option, well, that's basically the definition of user error.
---
According to the latest ruleset, this post should be modded as Vorpal Flamebait +5.
Please tell me you understand why this is a design flaw.
No, because it isn't.
This is simply crap programming from MS, as per usual.
Nope, it's pretty good programming. They're just not programming for you.
20% is a full 1 in 7 (via Excel 2016)...
Try entering your credit card number into an Excel sheet. Yup, it's unexpected. You Excel mavens know this already, so move on. This is Excel goofiness for dummies.
First, you'll see it in scientific notation. Go ahead, set it Number format, and remove the decimals.
Now, see what ya got, the whole number, right?
No. Your credit card doesn't end in a zero, does it?
Oh, re-enter it, since the zero is there forever. I double-dog-dare you to get that last digit back without re-entering it.
Try a custom format: Make Type 0000000000000000 (that's 16 zeros for those of you in Rio Linda).
Yeah, that fixed it, right? No?
Ok, last time, re-enter the card again, but put an apostrophe in first...
Success?
Microsoft can't yet imagine why anyone would need 16 significant digits. Or more likely the binary needed to do that is more than 8 bits, and so there we are, do we go all 16-bit and drive storage? Or something.
Oh, and that apostrophe? Save the sheet as CSV. Open it in a text editor, replace '4 or '5 with just 4 or 5. Hope you didn't have a string '4 or '5 for some other purpose. Enjoy.
Researchers probably don't proof their spreadsheets for publication.
deleting the extra space after periods so i can stay relevant, yeah.
And not just data formatting.
It would be nice to be able to define a data type and some rules and limits of progression.
I could see the value in defining an arbitrary data type that was comprised of a fixed set ("Apples", "Pears", "Oranges", "Bananas") with no progression (ie, no set member has precedence or rank) or perhaps some with progression or rank (fetus, infant, toddler, child, adolescent, adult, senior). Cells formatted as belonging to a data type would only accept those values as valid entries, and sorting would apply the set's rules of simple progression if there were any.
It might help for other numeric-based data types, such as IP addresses, where it would be helpful to define rules of progression around some kind of delimiter. If they could only add one new data type, I wish it was IP addresses.
There's probably complex ways of doing this with macro/scripting, but, they end up being complex and one of the main reasons so many people use Excel because it makes it trivial to manage lists. Trivial tasks that get made complex end up being done sloppy.
Is this problem need for alarm by Microsoft or is this a case where more scientists should be double-checking their work?
No.
The data isn't lost. Excel just displays it in an inconvenient way. Highlight column - right click - Format cell - Text - OK. Done. Sheesh.
A 20% error rate on 35,000 files isn't entirely a user problem. Yes, the user ultimately has complete control and the issue could have been corrected if the user had carefully verified the data. In that sense, it's a user problem. However, if the tool is so counter-intuitive that roughly 20% of a large sample people make the same mistake, it's Excel's problem too.
I wonder if an aluminum extension ladder analogy is a first on /. ?
Consider an aluminum extension ladder... :-)
Suppose the locking mechanism on the ladder worked properly when the user carefully verified that it was engaged. The user has complete control. If the mechanism was so counter-intuitive that 20% of the users ended up making the same error and falling off, it wouldn't be brushed off as a problem with stupid or careless users. There's no question whatsoever that the manufacturer would be held partially responsible. Hell, if they sold 35,000 ladders and found out that there had been 100 accidents because of confusion about the lock, they'd yank the product off the market immediately and probably face lawsuits.
Exactly. Why not assume all cells are text and never change anything user entered, unless requested by user?
Cell type could be only evaluated at runtime, for example when there is a formula that assumes a cell to be numeric,
then if user enters letters and presses Enter or clicks to another cell
then an error popup could tell that there is a formula that assumes that cell to be numeric etc.
This is classic MS. Add a feature that is marginally useful to few people but screws up millions and don't make it easy to turn off.
I think even if you find the setting to turn it off it only applies to current Excel sheet and not saved as a general preference.
This is just another example of the seemingly complete absence of any Editor at any publication in the modern world. It is the Editor's job to read and select submissions, and have said submissions vetted for a laundry list of errors. The more obvious ones, such as spelling and basic grammar, seem to be absent from a huge number of publishers (online and in print).
If you are a Scientific or Medical Journal, whose business it is to publish Papers and collect advertising dollars or subscription fees (or both), then it's part of the Editor's job to vet errors that are common in Scientific or Medical Papers. Such as this one identified by the article.
If there is but one Editor, then that person is required to complete all the tasks expected of an Editor. If there is money for more staff, then it's the job of the Editor to assign duties to junior employees to do such grunt work, and then it's the Editor's job to monitor these employees and their work. Again, the ultimate responsibility is the Editor.
If you publish online or in print, you need someone, maybe it's you, maybe it's an employee for hire, but someone none the less, to perform this task. It is not optional. Yet for some reason various "publishers" show their lack of skill at their chosen profession.
There has always been bad magazines, journals, and newspapers. Generally they went broke because people would punish this behaviour by not buying whatever it was they were selling (advertising, library subscriptions, individual subscriptions, etc) and they would mercifully dissapear.
Today we can add blogs and online news or science sites to the list. The average internet user has shown themselves to be less discerning than the print reading public of the past, when the dull and stupid simply didn't read anything.
Today everyone from moron to genius reads online, and there is money to be made from serving ads to the idiots of the world, as they can be reached at an economically viable number.
Maybe we will someday return to a publishing world where the poor examples of the art flounder and die. Regardless, if you don't have an Editor or someone performing an Editor's function, you are putting yourself firmly in the lot of the un-dererving of respect. Journals who would like to instead be respected should be vetting submissions for these well-known (amongst actual Science and Journal Editors of competence) errors and correcting them before accepting the paper for publication.
It really is that simple.
You have hit it spot-on. I'm a grad student in immunology and I see this happening a lot.
"Big Data" experiments such as RNA-seq or proteomic mass spec typically result in raw data in a .csv or tab-delimited format. Importing into Excel under default settings will convert some gene symbols into dates or numbers. It's easy to miss this happening, because only a handful out of hundreds or thousands of gene symbols are affected.
Stupid problem that could be fixed with a little awareness. Data management training is unfortunately not a usual part of academic biology curricula.
"you are using it wrong" is applicable
I mean, really, defined the damn column as TEXT when creating the spreadsheet.
The default behavior is to treat the field as whatever you've told the spreadsheet that it is.
This is false. The default behaviour is what happens when you haven't specified anything else. To say "the default behaviour is whatever non-default you've set it to" is nonsensical.
By default, every cell is set up for numeric data types.
This is true, but incomplete. The complete statement is: By default, every cell is set up for numeric data types, which means that entered data may be silently modified.
This is the job of editors and peer review. In fact, it is their primary job.
A peer reviewed paper need not be factually true (editors and peer reviewers can't determine that), but it should be largely free of obvious formal errors, like using a date instead of a technical term.
Automatic formatting of spreadsheet data was a mistake. Furthermore, it was a mistake Microsoft could have fixed (and still can fix) by flagging and highlighting autoformatted cells that seem inconsistent with the cells around them.
Data> import text> select csv, choose columns to import as text. Voila no conversion bs. It's ridiculous how many of these "problems" have clearly documented solutions and that a two second search on Google could have easily remedied user ignorance. Jesus Christmas this whole article is basically 20 percent of the scientific community is fucking retarded and can't be bothered to spend a minute learning the software they use. It's not about tech savvy it's just laziness. If you have a problem, do you just say oh well, that's the way it is? Or does the goddamn scientific method just mystically not apply when it involves a mouse and keyboard? I could understand this article applying to government workers at the social security office, but gene researchers? What. The. Fuck.
A process of "five whys" applied to the present discussion immediately reveals "default numeric" as bad policy in academic research.
A sane default would be "untyped" or "exactly as entered" which shifts sins of omission into sins of commission, this being far more compatible with the culture and standards of scientific journal publication than what Microsoft originally chose, mainly for the convenience of boutique-reseller power demos. Also, the more collaborative the environment, the more important it becomes to enforce a strong-typed, sin-of-commission data model.
This is all covered in the first week of Graybeard 101 as taught with slate tablets back in the stone age. I was there in 1985. Microsoft has had wool in its ears since forever. Still doesn't make it right, does it?
Furthermore, anyone who really cares about data pipeline integrity writes an export function from the derived format back to the raw input format, until they come out exact, or every difference is adjudicated and signed off, which is incorporated into an automatic validation task which can be repeated at any point in time for the life of the project.
CRAN Task View: Reproducible Research
Leslie Lamport won the Turing Award in 2013 for his uber graybeard rectitude, if anyone cares to notice. Douglas McIlroy made his seminal contributions in 1968 (Bill Gates was thirteen, but perhaps he was already set in his ways). John Backus delivered his Turing Award lecture "Can programming be liberated from the von Neumann style?" in 1977, which inaugurated the modern tradition in functional languages (Bill Gates was then twenty-three).
Competence is hard. Sloth is eternal. We continue to seek a third way.
Penn Jillette on Donald Trump, Hillary Clinton, And Why He's All in on Gary Johnson — 2 August 2016
I watched this video yesterday. There a fabulous exchange 24:30–30:00 on truth and naievity.
Once upon a time I would have ventured that most Slashdot readers would want to view this. It had me thinking about my own life 1985–1995 where I watched the software industry turning into a train wreck, where every seventh train car is painted bright orange and lettered in an ominous Area 51 black stencil font "patch Tuesday", with sparks flying off wheels seized (and reseized) for so long they resemble lopsided pentagons.
I used to think to myself "surely these are just temporary conditions due to the extreme rate of expansion of the software industry, and it will all settle back down to sanity as we crest the exponential growth phase". But no. Like Jillette, I was a die-hard naievitarian. Lesson learned.
It's not like Excel alters the underlying data, all you have to do is correctly change the column type.
Oh! but it does - once you save it.
If you open a CSV with Excel by default, it will simply read in the values and format it how it sees fit.
Then if you save it, even as a csv, it will give you a warning saying something like "some of the features are not compatible with this format type"
If you proceed, your file is now changed. I have seen scientific notation changed like this. Many columns and rows, you may miss a malformatting and save it as csv. Boom, your data is now toast.
It is why I always look at my CSV files with a text editor first, and only open copies in Excel.
And if you use a real editor like vi, even opening files with millions of rows isn't an issue.
My beliefs do not require that you agree with them.
Awesome.
I remember those apps from the 1980's.
The best part of working in a computer store at that time was constantly explaining that there was no printer on the planet that could output anything readable from VisiPlot. It was an amazing software product that was better than the resolution of the monitors or the printers.
Sorry to hear about your anus though. Must be hard to find comfortable chairs.
The weird thing is why the geneticitsts tried to put genes into Excel rather than into test tubes or cell nuclei. It seems a rather demented kind of CS research to just take scientific data and put into whatever computer program is avaialable and see which one produces the most interesting garbage.
When you already know that every model looks best when pimped up with Photoshop....
Yip, its pretty dumb. If you have a CSV file with phone numbers formatted as +1-234-567-8901 (not uncommon if you're an international company and you want the numbers stored in the global address book to be directly dial-able from phones anywhere!!).
Excel makes a right mess of that. And so far there is no way to stop it that I can find.
The acronyms have the advantage that they are, relatively, semantics-free., If we turned them into long hand....
This is a false choice. Just because shorthand names are preferable doesn't mean they have to be acronyms, whose tortured derivations often distance the "shorthand name" from its derivation.
Disease researchers, for example, seemingly don't fall into the trap - H1N1 is nice and short, is descriptive of the virus, easy to remember AND is not an acronym.
#DeleteChrome
Failing to format the cells (naming irony) is really the problem. Excel just makes assumptions about the data when the cell is undefined.
Seriously, we have Terabytes of data storage, even in collapsed form, So we use shell scripts written by code in Perl and CGI to run against massive data files spit out by large databases.
Just do the Math.
Now, if you want to say small labs doing biological research have occasional false hits, due to open access searches against common repositories, I might believe you. To be frank, though, you should have read all the notes yourself.
I'm far more worried about a tired research assistant or grad student doing stuff like this and it not being caught before publication.
-- Tigger warning: This post may contain tiggers! --
Spreadsheet novices often get bit. I am an Excel user and have often seen these kinds of 'errors'.
Excel, for instance, converts almost every piece of data you enter. From your keyboard you enter 10, which is literally the CHARACTERS ONE AND ZERO. Excel analyses the characters you've entered, determined that since you entered only numeric characters you intend the entry to be the numeric value 10, which is binary 1010, which is a different binary value than the characters 10. So in storage is 1010, but what is displayed is the characters 10, since 1010 is a non-display value.
But that conversion is what we expect and want. The examples in this article are unexpected and unwanted.
If you were to enter 1/2 in a cell, Excel assumes that you're entering a DATE. Excel parses the entry, assuming month 1, day 2, year current year and converts characters 1/2 to number 42371, and changes the Number Format to Date d-mmm.
If you were to enter 6: in a cell, Excel assumes that you're entering a TIME. Excel parses the entry assuming hour 6, minute 0, second 0, and converts characters 6: to number 0.25, and changes the Number Format to Time h:mm.
If you were to enter 171e2 in a cell, Excel assumes that you're entering a NUMBER in scientific notation. Excel parses the entry assuming coefficient 171, exponent 2, and converts the characters 171e2 to number 17100, and changes the Number Format to Scientific 2 decimal places.
So what a savvy Spreadsheet user does is FIRST, before any entry is made, change the Number Format to TEXT, or prefix the entry with an APOSTROPHE, as '1/2, or '6: or '171e2 if you intend to prevent any conversion of the data you enter. It's that simple!
This is why I use PowerQuery to import CSV files or anything really into Excel. It's easier to make sure your data is typed correctly on the way in.
I guess proof reading papers is now a lost art.
-- I ignore anonymous replies to my comments and postings.
"If you make a tool foolproof then fools will use it." And the remainder of that sentence could be "and then complain when they get the wrong answer."
=-2^2
then you only need a way to attach that database to that email with the excel file, too....
Well, not really. If you consider a spreadsheet as a view of a normal relational database, unlike database views, a spreadsheet is a permanent representation of the original data and can be for example emailed and viewed easily. In most circumstances the recipient would likely not need the database nor know how to use it.
This way the data is strongly typed in the database and data operations are a lot safer. Software often used in scientific studies, such as IBM's SPSS, are pretty much a database anyway, so it's not a big leap to use an actual relational database.
But I don't see this happening. Ever. So back to broken Exceling. :-(
-SR
Format the column as text and it will disable calculation in that column. You can do the same thing for individual cells, but phone numbers tend to be stored in columns, so you probably just want to do the entire column at one time.
If you are importing a file, read the inport text wizard dialogue box that opens and set the offending column to text then.
I call bullshit and point you at CSV if they are going to use MS Excel.
just use the import tool for importing csv.
bettet yet don't use csv! provide a .txt. if they care about the data, they can parse it back.
No surprise. Most MS stuff tries to guess what you want and gets it wrong more often than right. Excel's guessing at data types is horrible and causes a lot more problems than just this. What's more, the guessing at what you want repeated is sometimes unreliable and unpredictable, leaving people with a repeated date when they wanted an incremented one or more often the other way around. While I'm at it... Excel made spreadsheet output "pretty", but still doesn't offer a true third dimension (that can be easily referenced within formulae).
This is a consequence of trying to make things easy enough for anyone to use. The usual attempt is to make it automatic, but making the computer smarter than the user is to lose control over what is happening.
"Make a machine even a fool can use, and only a fool would want to use it."
But computers are too new, for the actual knowledge to have propogated through humanity, yet...
Maybe another hundred years?