Hacker News new | past | comments | ask | show | jobs | submit login
On the accuracy of statistical procedures in Microsoft Excel 2007 [pdf] (drexel.edu)
52 points by adulau on April 21, 2013 | hide | past | favorite | 33 comments



Best part : "Gnumeric was originally such a good Excel clone that it even reproduced many Excel errors. When informed that these were errors, the handful of part-time programmers who maintain Gnumeric managed to fix them correctly in a matter of months"


And the linked paper for this claim: http://www.csdassn.org/software_reports/gnumeric.pdf


I think that in jobs where statistics is the main function of the job, you will often not find them using Excel. The business intelligence folks at my company use it for cursory analyses, but I often question the rigor involved in their practices, and using Excel to do statistics is icing on the cake. I would trust the results of SAS or R or, hell, even Minitab, over the results I get out of Excel. Microsoft seems more content making their product shinier than improving the underlying numerical accuracy of their methods.

The sooner people realize it's not designed for real statistics, the sooner people can stop hearing statisticians rant when they receive "Report_Analysis_Feb1-2012_Update_Final-v3.xlsx".


If it's not designed for real statistics, what is it designed for then, fake statistics? I see what you're saying, but I think the real distinction you're drawing is between heavy-duty work and run-off-the-mill descriptive stats. Even the latter need to be accurate.


Thanks for pointing out the distinction. You're right that this is what I was going for, but the articles I've seen thus far about the inaccuracies in Excel deal mainly with the poor capabilities of the optimization routines, RNG, and distribution functions. If you're balancing a budget with numbers that involve 2 decimals or taking measures of central tendency on similarly-scaled values, Excel isn't woefully inadequate.


That makes sense. Apologies if I was being flippant.


Ah. Andrew Gelman just complaint about this few days ago on his blog:

In response to the latest controversy, a statistics professor writes:

It’s somewhat surprising to see Very Serious Researchers (apologies to Paul Krugman) using Excel. Some years ago, I was consulting on a trademark infringement case and was trying (unsuccessfully) to replicate another expert’s regression analysis. It wasn’t until I had the brainstorm to use Excel that I was able to reproduce his results – it may be better now, but at the time, Excel could propagate round-off error and catastrophically cancel like no other software!

Microsoft has lots of top researchers so it’s hard for me to understand how Excel can remain so crappy. I mean, sure, I understand in some general way that they have a large user base, it’s hard to maintain backward compatibility, there’s feature creep, and, besides all that, lots of people have different preferences in data analysis than I do. But still, it’s such a joke. Word has problems too, but I can see how these problems arise from its desirable features. The disaster that is Excel seems like more of a mystery.


Excel's main limitations (disclaim: I haven't upgraded to 2013 yet)

- you can't easily audit a spreadsheet unless each cell only refers to cells immediately above or to the left. Otherwise it's GOTO and COMEBACK programming. "referentially opaque probabilistic graph programming"

- Control-] lets you see where a cell is referenced in other formulas, but only if the formula is on the same tab. Control-[ does take you to other tabs. Control-` shows you a wall of formulas

- there's no easy way to track significant digits and sources of floating point error, e.g. adding numbers that are orders of magnitude different.

- in the past, underflow has even been a security issue

http://www.checkpoint.com/defense/advisories/public/2011/cpa...

- you can't report bugs, or see their tracker, that I know of

http://connect.microsoft.com/ (i'll save you the time, Excel isn't on 2 lists

- you can't go to extended precision, Rationals or unlimited precision integers or floats when you need.



Apparently 2010 isn't much better.


One problem of particular concern for many budding statisticians is the difference in the way Student's t-distribution and the Normal distribution are handled. The Excel functions for the t-distribution split the probability into two tails, but the functions for the Normal distribution did not. This made it quite a pain when doing some calculations with both large and small sample sizes. Newer versions of Excel have "fixed" this by adding a flag to the t-distribution functions that let you choose whether you wanted to use single tailed probability or two tailed probability. But this flag has not appeared on the functions for the normal distribution and thus the functions for these two similar distributions are still inconsistent.

Another annoying issue was the autocorrelation in the random number generator (a very bad thing for a random number generator). The main rand() function is now corrected but the problem lives on in the randbetween() function, which shows that these two related functions are using separate RNGs instead of them both tapping into a single good RNG. It's very frustrating if you want to do any real work in Excel because you can't trust the built in functions.


This is a funny sarcasm on the topic:

http://www.bbc.co.uk/news/magazine-22213219

They are saying Excel - established data tool - is, in fact, dead for big data. And what else matters?


The problem is that still most data is not big; even worse, big data is often either small data presented in a very sparse way (logs for instance) or is being subsetted or summarised to fit in Excel :(


In Excel 2010 a review of statistical accuracy of functions and related work was done. Here is a summary of the work: http://office.microsoft.com/en-us/excel-help/what-s-new-chan...

Its also worth noting that Excel only guarantees a certain level of accuracy (15 digits if im not mistaken). In cases where this is highly important SSPS, MatLab , etc should be used since they offer a higher degree of accuracy.


I don't even know if this matters that much within a corporate environment. Those dealing with the numbers are most of the time aware of the fact that these numbers have already inherent flaws and do only approx. the reality out there. As long as the results sound somewhat plausible chances are dim that anyone is going to ask questions. At the same time those requesting the numbers are happy "to have the facts on their side".


Excel 2007 was released in late 2006. Six and a half years have passed and two new versions have been released. It would be much more interesting to see the accuracy of the current version, then this very old version.


They do one of these every iteration of Excel, and the results are usually similar. One day, Microsoft will make a good version of Excel, but I'm still waiting. http://homepages.ulb.ac.be/~gmelard/rech/gmelard_csda23.pdf


Microsoft is likely to preserve some errors in the name of backwards compatibility. Spreadsheets that calculate different values in different versions are going to cause them more headaches from their customers than these reports of errors, as the people who care are already using other methods.


This is not far from the truth. In certain cases large solutions have been already developed to take into account any variation in implementation, accuracy or behavior.

With millions of custom solutions based on Excel floating around , "fixing" issues where calculations suddenly give different results would be highly counterproductive and potentially dangerous.


Sit down, have a drink, and think about what you are saying.


And yet, in your heart of hearts, you know mutagen's point to be true.


have to say mutagen is spot on.


I am still using 2003 at my company...


They are just now moving people off MSO 2003 at ours as well, following complaints from several people that they can't open the newer xlsx, docx, and accdb file extensions. IT and Clinical Analytics are the first wave to move to "brand new" MS Office 2010, as of Jan'13.

Sure hope you aren't doing any life-threatening numerical computations in your version of Excel!


It doesn't matter the version, if 2003 or 2020, I would never do life-threatening computations in Excel. It's just too painful and dangerous (look at how it treates dates). But unfurtunately I don't do this much, I do mostly finance-accounting. Yet, if I did, be sure I would have to put the results in a nicely formatted Excel file.


We got around the inaccuracy problems where I'm at by doing all the sensitive calculations in a more capable program (SAS where I'm at now--R when I consult), then exporting the results to Excel. It avoids most of the problems faced by the poor statistical tools in Excel. It's frustrating to think how difficult it is to cut Excel from the pipeline, but such is the dilemma with Excel being as virulent as it is.


IIRC there's a plugin that lets you call R functions within excel.


I always found much easier to do all the work in R or Python and then put the results in Excel. Excel is really not a good tool for things that go beyond additions and vlookups.


Does anyone know how LibreOffice compares?


Calc (from OpenOffice) was analyzed among others in 2010 "On the numerical accuracy of spreadsheets, Journal of Statistical Software, 34, 4, 1-29."

http://www.jstatsoft.org/v34/i04

" This paper discusses the numerical precision of five spreadsheets (Calc, Excel, Gnumeric, NeoOffice and Oleo) running on two hardware platforms (i386 and amd64) and on three operating systems (Windows Vista, Ubuntu Intrepid and Mac OS Leopard). The methodology consists of checking the number of correct significant digits returned by each spreadsheet when computing the sample mean, standard deviation, first-order autocorrelation, F statistic in ANOVA tests, linear and nonlinear regression and distribution functions. A discussion about the algorithms for pseudorandom number generation provided by these platforms is also conducted. We conclude that there is no safe choice among the spreadsheets here assessed: they all fail in nonlinear regression and they are not suited for Monte Carlo experiments."


I'm curious as to why use a Spreadsheet for statistical functions if they cannot be trusted?


Because they claim to support statistical functions, their users expect them to be able to do it, their users already have the spreadsheet software and know how to use it, and knowledge that they can be dangerously broken is not widespread.


How does SPSS compare? I've mostly used Excel and SPSS back in college.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: