Excel bashing is easy but, where's a truly comparable alternative?
- Alteryx at $5000/year/user?
- A database that requires specialized skill or going through a DBA for every little need?
- Salesforce? It's hell to configure if you don't have the skill.
- Sharepoint? Again. If you're a small business or nonprofit, you've gotta hire or train someone before this is a serious conversation.
I would in no way suggest that an enterprise can run on Excel with centralized data and multiple users. And that's where I believe a lot of criticism of Excel is unfair: the comparisons are not apples to apples.
And let's be honest. EVERYTHING comes with a level of human error.
- Enterprise level databases can get so big that the corrupt data in them can be impossible to cleanse.
- Reports made by SQL people are often flawed.
Over my 7 years in an international company I discovered report after report that didn't accurately match business rules or they were based on loads of incomplete records, duplicates, inconsistencies, incorrect math, etc.
How about fixing the reports? Often the decision was "no." Fixing the reports wasn't something a department head wanted to pay a consultant for.
Eventually, I had the DBA stop sending me reports. I started exporting the raw data into Excel and made my own reports.
- Brand new reports can be impossible to get if they don't justify the cost of a consultant to create the report. (Ah, but there's Excel.)
- Databases get hacked and they crash.
So. Where's the tool that's perfect, free and easy to use?
It doesn't exist.
And how does knime/everything else compare with PowerBI, given how aggressively it is being pushed by Microsoft, and the fact that it would come bundled as part of most organizations Office 365 deployment.
When you see it, you understand it. Knime interface is designed around "data pipelines" if you wish, completely different from PowerBI, which seems more oriented to dashboards.
It's pretty useful for data manipulation. I use it for very heavy lifting (millions of rows with A LOT of text manipulation, regex, and even some ML) with a Dual Xeon + 32 GB Ram and it's a beast. It takes some time to master, but I encourage you to play with it.
Take a look and you'll understand it right away. It's organized with nodes and links. Nodes perform operations, and links transport data between nodes. You can arrange them visually as you please. There's even stuff like "metanodes" that allow you to put nodes inside, so you don't have to have a messy playground.
Most nodes (there are official and 3rd party nodes) do have error output, and there's a console to inspect what's going on.
You can also inspect data visually in most nodes.
Even someone who is not used to play with data, who doesn't understand consistency and yada yada will understand the advantages.
It uses a visual data flow approach to transform data step by step from one form to another. This approach is much better for data transformation. It is also less error prone as operations take place on columns or rows of data, not cells. But you wouldn't try to create a set of company accounts with it.
Instead of only considering a consultant, the company should have a developer on staff to work on small tasks like this. The cost gradually decreases as they learn the business and don't need everything explained from scratch.
See. You've got answers to something that you haven't asked a single question about. But in your judgement the company should have had the right developer on staff.
That's what's frustrating about these anti-Excel discussions.
Ok. What happens when I say that the company bought another company and it had a product line that the new company wanted to pay sales people commissions on? And the compensation plan around it was complex.
For whatever reasons, that's not a report the DBA was up for developing from scratch (or, someone decided his time wasn't to be used that way). That's just the reality.
We can talk about what should have been or what ought to have been, but that doesn't change the reality.
The reality is that I ended up doing most of it ACCURATELY in Excel.
> But in your judgement the company should have had the right developer on staff.
Yes, like a janitor for example.
> The reality is that I ended up doing most of it ACCURATELY in Excel.
Yes it's the hammer for all your nails. You could also use Python+Jupyter. There are selfhosted and managed versions.
Because of guys like you, municipial installations run their sewage system with an excel sheet because some guy implemented a vba macro to talk to the sewage api endpoint.
I think the company would quickly realize that they need a different solution if they have no Excel wizard on staff, and the team spends weeks just calculating sales commissions.
Or maybe not, and they’d just hire more people solely to calculate commission :/
I would say that you probably do not know what you did wrong having done it in Excel. Stuff gets complicated really fast in Excel thus it’s much easier to do something that you can’t possibly debug and thus live in blissful ignorance.
It's just unbelievable ... how people KNOW about situations that they weren't in and don't ask any questions. But since you brought it up I can tell you about my accuracy in Excel:
- The process involved 2 people checking my work before it was sent to payroll. And we did catch and correct mistakes. We were able to trace back where and how the mistakes were make, and build in ways to prevent or flag future mistakes.
- The job was necessary because of all the garbage in the database. Excel's data-checking and data-cleansing and flexibility was superior to the database.
- After 1 year all of my work was audited. For several dozen sales people, there was ONE person I wasn't 100% accurate with. And that one was off by 0.5%. SUCCESS!
- If we'd gone with the report coming out of the database, things would have been such a confused mess.
It has nothing to do with your particular situation. This is a reasonable assumption to make if somebody comes and wants a solution in Excel.
But ok. So what you are saying your Excel was reviewed by somebody else. How is this any different from reviewing somebody else’s code?
I work in a company where even the least technical analysts are capable of using numpy. We have libraries to use Google Sheets as input-output data. So you get the best of the two worlds.
I am not giving you advice on how to run a company, I am saying that your anecdotal opinion will not affect my decision-making process. And I am clearly stating that it should not affect the decision-making of other people here either.
Also I am going to underline that what you did was relevant 16 years ago; which you did not state in your initial message about the usefullness of Excel.
Which is kinda funny, if I didn’t ask you would have otherwise omitted a lot of otherwise relevant information which makes your opinion even less important.
It only makes me question why do you take this whole thread so personally? Probably because by now your skills are way less relevant than back in 2005.
Not just a developer. A developer that's clued up in business needs, can communicate (listen, clarify, translate technical concepts), has a smattering in finance to promote cost, time and/or quality savings they make. And the company needs a succession plan should such a person not turn up.
Excel is the proverbial hammer that makes everything look like a nail. It is very flexible, powerful and beginner-friendly. And while you can solve a lot of problems with it, some of those should be handled by other programs (or other people). The problem is that the average Excel user doesn't have the right knowledge to judge what the best solution is to their problem. And sometimes it only becomes clear after long use that they have created an unwieldy tool which causes many unforeseen problems.
This is so frustrating because I've worked in small departments in an international company. Our problems were too small to get consistent, needed help from IT and we weren't important enough to get one of the limited licenses to the fancy reporting system.
Ultimately, there's all this bad talk about evil Excel, but we get no help or other resources.
And what about all the junk reports that were coming out of the company database that no one wanted to spend the money to fix?
Real life doesn't always provide ideal solutions. In your case, using Excel is probably the easiest or even only way to do what's needed. That doesn't mean it's the best, or even a good solution. If it causes organization-wide problems, whoever runs your organization should make sure better processes are instilled. The only thing you can do is make it clear to the person above you and hope things improve.
We're closer in agreement. And I'd never suggest that Excel was the best solution or even a good solution at times. (I hated the tedious error-prone things that I used to do in Excel.)
I'm just offering the context from which weird stuff happens. And the answer isn't always as easy as: put it in a database.
The context, the processes (or lack of processes), the decision-makers, budgets, timelines, they all play a part in the outcomes.
I think the question is what a viable alternative is if the company does not want to spend any more effort than the effort of the singular employee that wants the report.
They can do it by themselves with Excel. Can they do that with any other software?
Not many people would be able to do complex calculations in an RDBMS, especially if it requires joins or grouping. But many more are able to use a GUI tool to create a single database table and import some CSV data.
The recent UK disaster wouldn't have happened if the data had been imported into Access or SQLite instead of an Excel sheet.
Of course it's not a panacea. Badly designed databases can be just as error prone as Excel, but they have different error modes.
exactly. because he does not see a problem with this because people like you tell him it's the only solution since dawn of excel.
But when developers complain they have to learn domain specific items, people say the shall keep up with the times. but suddenly the analysts and the others are exempt from this?
Your position is more reasonably explained than most of the Excel Haters' (I am a card carrying member of the group, btw) but there is an additional problem.
Even if/when the "average Excel user" finally sees the Light ... s/he still has to work in an environment where excel files (including .csv) are the lingua franca.
It's a bit like working for the German (or French or Japanese etc.) branch of a US large company. Of course the official language is English, and all documents should be written in English and emails should be in English and so on.
Even if everybody (down to the Janitor?) speak acceptable English and are perfectly able to talk business in English... everyone will still use German unless they have to communicate with someone like me, whose German fluency is rated in Imaginary Numbers.
Sometimes that database feels more like a jetplane when all I need is a hammer for my nail. Excel (or libreoffice calc) is perfect for a lot of stuff, especially those small things that you won't ever do again.
Excel is a UI over a database table. With a bunch of other UI driven math/stats functions built in. It really comes down to a difference between UI driven folks and folks that are code literate.
To me it seems Python or Julia notebook stuff like ipython or jupyter could (and did in some circles) displace Excel in almost all cases. The self-sustained inertia to change is real though.
Who's going to pay to train a Customer Service Rep, Admin Assistant, Sales Assistant or Warehouse Manager to do Python when there's Excel?
Get into the world of the Excel user. Some of us do learn coding. But generally speaking? No. If it happens, it's on our own.
And we can agree that spreadsheets have problems, and Excel users typically are not data people.
BUT!
When we agree on that, it becomes evident that a lot of suggested solutions are non-starters. Not all Excel users are in an enterprise that's throwing money around for training and dedicated software.
I'd like to see the Warehouse Manager who goes in Monday morning and says to the company vice-president, "I've been looking into this Python class ... "
The problem isn't businesses using Excel, the problem is businesses using Excel for critical processes.
The Warehouse Manager or Sales Assistant doesn't need to learn an alternative, but e.g. the Finance people should. PowerBI/Tableau for reporting and R/Python for modelling should be a thing in those worlds (even if Excel continues to be an amazing tool for entry level stuff and some quick number crunching).
How do you have so many answers to situations you haven't been in?
Do I need to say it again? My Excel skills were necessary because of garbage in databases.
I could sit at my desk; extract a data-dump of 50k rows of data; locate the fuzzy matches, identify which records needed to be kept and which ones deleted; merge it with the client's spreadsheet that they emailed to me and compare them; identify problems and report back to the angry client.
Why was the client angry? Because of inflexible garbage reports coming out of the database that people were blindly trusting.
Why was there garbage in the database? Not because of Excel.
With jupyter, you get the problem that editting cells does not automatically make later cells get recalculated. That alone, in my opinion, is worse than any of excel's failings -- imagine a spreadsheet which could get out of sync like that?
Try deploying python apps in corporate environments with fairly locked down machines to non-programmers and get them to perform simple operations, and then compare that with just emailing someone an excel spreadsheet.
>, and doesn't do secret magic guessing of whether a value is a date or not.
This Libre Office question&answer[1] in 2020 says it guesses that the users want leading zeros removed from numbers. This behavior is similar to other software tools I looked at.[2]
This means that if a csv has "02142021" as a date for Valentine's Day with a leading zero, it will be imported as number "2142021" -- unless one takes extra steps to prevent the default behavior.
Ok, it might try to deduce the data type, but you can set it in the import options that appear when you open a CSV file. Excel will automatically convert a string like "2-4" to "02-Feb" upon opening a CSV file. No import options appear when you open the file. If you then save the file it then uses that new string. As far as I can tell there's no way to stop it doing this.
Metabase (for reports rather than transformations) was useful to fill all the "cant we just have like a spreadsheet" requests while maintaining some quality
I used to professionally audit spreadsheets that were used to back up million dollar decisions. I say “back up” because it was fairly clear the decisions were made independently. But they needed to tick a box saying they’d had it audited so they paid us for a two day audit. In that limited time, which included writing the report, you’ll appreciate we rarely had time to audit anything more than the top-level summary sheet.
There was always at least one mistake that put the numbers out by 8 digits.
I assume you fix the math errors and then whatever the result is turns out to be the justification for the decision. The result doesn't matter, so you don't need to aim at a particular one.
Neither. You report what you found. You make recommendations. You state which areas you haven't had the time to look at. Under no circumstances do you provide any guarantees that might put you at the wrong end of a multi-million dollar lawsuit.
This, incidentally, is every audit ever. I've been audited many times, every audit finds stuff and make recommendations. Then management gets to pick which recommendations they adopt and which they'll derogate. I'm not saying it's a completely useless process, but it doesn't provide the cast-iron guarantees people think it does.
Then, of course, there's financial audits, which are meant to provide a cast-iron guarantee the firm is still a viable entity. But nearly every big firm that went bankrupt had passed such an audit in the previous year.
QOE drove me nuts in a past life. M&A of small companies with crappy financial data. Biggest caveat is basically garbage in and garbage out. But often they already had the deal done and would just feel better if I told them the garbage didn’t stink.
One of my best behaviours for actually writing one of these things is to have a single page that only has inputs/assumptions on it and put an annotation/note on every last data item/series saying where you got it from, because someone on the warpath will want to know where you got that BS number from two months later, and it's very satisfying to show it was them.
Agree. That’s a must. You have to cover your ass when it comes to this type of work. It’s high stakes enough that I save off copies of emails (.msg) and create a Sources folder for these, spreadsheets, and I’ll even do a screen capture if I have to login to some system and pull the data myself. I am usually given the parameters on what to pull, and it’s very easy for someone to accuse me of pulling the data wrong a month or more after the fact. I did this on every analysis. I can relate to what your saying. Two months later something like this always happens;
What was you assumption of growth?
Infinity.
What?! That’s ridiculous why would you assume that?
forward back an email you sent to me when I pointed out the growth rate seemed high but you insisted it was a slam dunk (aka the only way the deal would go through and we’d get that bonus)
At times I feel like I put so many footnotes in the resulting presentation later the information should be interpreted as meaningless.
Excel is a great case study. Turns out, if you give people an accessible and widely available Touring complete environment, they will manage to do pretty much anything with it.
On another note, it's interesting to observe that the problems spreadsheets have are the same problems that plague poorly designed software everywhere - lack of documentation, lack of testing, lack of input validation, among others.
One of the big issues with spreadsheets is the lack of visibility. It is very hard to tell what is going on in any non-trivial spreadsheet. That something that visual data flow tools such as Alteryx, Knime and (my own) Easy Data Transform try to address.
The other problem is specific to Excel. It just loves to mangle your data! The problem is so bad that geneticists have renamed some genes to stop Excel reinterpreting them as dates.
A well designed high complexity spreadsheets are pretty easy to tell what's going on. The problem is most people don't put thought into how they design their spreadsheets. That's not their fault because there are very few resources that teach people how to design a complex spreadsheet well.
If you have not seen a spreadsheet with say dedicated data input tabs, calculation tabs where you can see and audit the work being done, output tabs, revision notes, summary sheet you haven't seen a good spreadsheet.
Not just Excel; Google Sheets too. Our school system has student codes in the format ABC0001 where ABC is the first three letters of the surname, and the number is sequential. So a student with a surname Martin might get a code MAR0004. Which Sheets, in it's wisdom, decides to interpret as the date March 2004. Go figure.
Not really Python as such, it seems, but rather some third party library[1] which promises "generic parsing of dates in almost any string format". This is impossible to do correctly (what is 01/02/03?) and usually a bad idea to attempt.
There is actually some standards here at least. 01/02/03 has one order, 01-02-03 another. The character between tells you how to read them. Not that most people know how though...
Can't say I blame python or even the library; why is the code even trying to convert every string in an object, instead of just the keys that are supposed to contain dates?! Seems like a poor hack, imho.
The age-old workaround for this used to be to type an apostrophe in front of the value, which forces Excel to interpret as text. IIRC that used to work in Google Sheets as well. Doesn't that trick work anymore?
It does, but people are uneducated and think that just because something runs in the browser, it doesn't require training and knowledge to use properly...
More problematic tbh is that excel doesn't really let you configure it for your setting: ideally you should be able to set it up with defaults suitable for genetic scientists, and then spread that configuration throughout the group.
Or if it does have this feature, no one knows how to access it
It's not even just that Excel is Turing-complete. It's that Excel is widely used and is its own programming language with a comprehensive standard library, and I don't mean VBA - entering the = character in a cell opens up a massive library of functions which can be used in a fashion sort of similar to functional languages, since everything is nested functions.
There's branching. There's Boolean operations. There's variables (cells, which is ultimately the point of the entire exercise). There's even a switch construct. There's even really simple web requests! All of these are of course functions. The only thing that's missing is function definitions (which can be done with VBA, but I excluded that specifically for the sake of this exercise).
It's not particularly pretty, but is usable by almost anyone, has good documentation, and a huge community of like-minded power users. As you mentioned, Excel doesn't have any of the constructs which help developers for other languages to avoid mistakes. The simple syntax, users which are not necessarily aware of good practices, and a lack of methods to implement such practices produces a perfect storm of factors to introduce bugs.
It’s an extreme end of the spectrum where static, restrictive, but harder to screw up is one end, and weak, dynamic, permissive, but easy to screw up is on the other.
For what it’s good for, it’s great. I love it because you can do many things faster with it than anything else, and the immediate feedback and visual data flow aren’t matched in any language’s IDE I’ve seen. But like all systems, when they grow they need checks. Like the trope about when your code grows to too much complexity, you start needing static types.
This is pretty much the same reason that PHP is much maligned - it certainly has gotchas but they're not so much dangerous than, say, the C++ gotchas... When everyone is able to "have a go" at a tool they'll all abuse it in their special way to get things done and good on them for doing it since if every data problem required an engineering team to solve everyone's business would grind to a halt.
You are right about PHP vs C++. But C++ is too easy a target.
I dislike PHP just as much as the next guy, but I am quite impressed with how Hack turned out. It's Facebook's improved PHP. See eg https://hacklang.org/
It's not that Hack is a good language in absolute terms. But I think it's about the best language they could have come up with when starting from PHP.
(For a better comparison instead of C++, but still in similar spirit to C++, perhaps take D.)
I work at a large firm which has shifted itself from using Excel, to significantly promoting data analytics tools such as Alteryx in place of Excel.
One of the MAJOR improvements from using Alteryx is that it is easier to audit and review the process. With Excel you can be down the rabbit hole of data manipulation or formula driven outputs and realise you have made a mistake but cannot find where or how. Or cannot find where you went wrong without recreating the steps followed. The workflow style interface of Alteryx and other data analytics tools allows everyone to see the process from start to finish of how you are getting to your outputs. It can also ingest and handle much larger datasets than default Excel.
I honestly believe that we will see a large shift from using Excel to using better suited analytics tools. I cannot think of anything I can do in Excel that I can do in a GUI based data analytics tool. Not to write Excel off completely as will certainly still be using it, but to a lesser extent and not where complex data manipulation is required.
As someone who has personally had to use Alteryx in multiple projects, fuck Alteryx. What an absolute garbage piece of software.
I've already written a rant about it before and I don't have the energy to repeat myself, but do not fool yourself into thinking switching from excel to Alteryx is doing yourself any favors. You're just trading one monster for another.
Save yourself a huge amount of money, not to mention your sanity and just take the time to learn some Python, Julia, literally anything else to get the same results faster, more reliably, and not be locked into that noveau Oracle-esque nightmare.
Alteryx was my first step into analytics and it helped me get 2 heads funded for my team. But yeah at some point I arrived to your same conclusion and moved my team to python, we never looked back. One thing people don’t realize with these UI tools is that there is no way to use a version control system and I nowadays I cannot think of doing any analytics (including basic BI) without git.
Alteryx files are xml files but the point is change management is hard to do. You need to compare each single element in the UI to understand what was changed between versions. It's a lot like Excel in that regard.
Technically, you can diff 2 XML files, but Alteryx will make slight alterations to the xml literally everytime you open it, even if you make no changes and it's a huge amount of spaghetti XML that is absolutely not meant to be parsed by human eyes.
My point is moving from Excel to Alteryx, now the average population of people that uses Excel would not have the credentials to be able to stop using Excel and move to python to do the same job. I'm focusing on non-technical/programmers.
Excel has similar capabilities (since Excel 2010) in the form of PowerQuery[1].
It can pull in 3rd party data sources (including custom HTTP calls if you want to reach out to an API), has data types, can both process and store substantially larger data volumes (data is cached/stored separately from the spreadsheet data in a compressed, columnar format and processed by a separate data processing engine), and every step of the data processing pipeline is shown and easily referenced/reviewed (including the ability to leave comments in the code).
That said, PowerQuery is a graft onto Excel, originating in the SQL Server world. So "Excel" skills aren't very helpful/transferable; it leverages Microsoft's M language[2] instead of Excel functions, and the usage paradigm is far more natural for someone with a programming background than an Excel background.
But as someone else mentioned in another comment, Alteryx is anything but cheap. PowerQuery is already widely distributed and available to anyone with Excel installed[3]. So for anyone looking for this type of operational discipline around data, it's a handy tool to pick up. Bonus points that PowerQuery/M-code is portable between a variety of Microsoft tools (SQL Server, PowerBI, Excel, and a variety of Azure's data services).
From the M-code link:
“A core capability of Power Query is to filter and combine, that is, to mash-up data from one or more of a rich collection of supported data sources. Any such data mashup is expressed using the Power Query M Formula Language. It's a functional, case sensitive language similar to F#.”
Does M-code serve as a replacement for (the abysmal) VBA? And how similar is it to F#?
> Does M-code serve as a replacement for (the abysmal) VBA?
Sort of depends. M (and PowerQuery in general) is fundamentally about data processing. If you were only dropping into VBA for ETL purposes, then chances are you can use PowerQuery/M instead. With the advantage that it exists in a standard xlsx file, rather than the macro-enabled xlsm files that are more heavily restricted in many environments. It's not intended for more general purpose tasks, nor can it do stuff like interact with the Excel object model directly, so you'd still need to use VBA (or if targeting Excel 2016 and above, Javascript[1]) for that.
PowerQuery/M is basically a sidecar app within Excel, and doesn't have access to nor is aware of the Excel object model. Excel.CurrentWorkbook[2] is just a convenience function available when used within Excel, and it treats data from the current workbook the same as if it was pulling from an external workbook or csv file or web API or any other source. To the extent that even when solely accessing data from within the same workbook and not calling out externally at all, it prompts users with an "Enable External Connections" warning to use. You can then either have the output of that sync to a Table object within Excel (and easily/generally available for use), or made available from within the data model (only available via the object model or pivot tables).
That said, the data access functions[3] are fairly expansive, and you can get pretty creative in how you (ab)use them. For example, Web.Action is one I've abused often: pinging a web service to track usage, triggering external jobs, and exporting/syncing data to an external service. Although those last two uses require a bit of complexity in architecting, to account for cases when it gets manually refreshed/triggered in quick succession or with duplicate data.
> And how similar is it to F#?
I'm not familiar enough with F# to be able to answer that. But if you are, you can probably find out pretty quickly via this[4] article series. PowerQuery/M is used across a variety of Microsoft products, and most resources around it are akin to product-specific "how-tos". Ben's primer series is the only source I've come across that approaches it from a software engineering perspective.
Thanks for this great explainer. I discovered PowerQuery almost be accident (trying to just do a simple join).
The capabilities were nice (and according to your posts more extensive than I know!) but the interface was so beyond counterintuitive, no discoverability, so unlike any of the variety of environments I've encountered in 10 years of programming, and the performance was painfully slow (compared to an RDBMS).
I have hopes for MS moving Excel to a less unique interface (I heard smooth scrolling is coming!) and I will definitely be reading your links.
Yea, discoverability is a pain. Although importing data from a CSV file has migrated from their obtuse "Text Import Wizard" to PowerQuery in the most recent versions of Excel, so more people will be stumbling on it.
And also yes, the Power Query Editor interface sucks. It's intended to be intuitive for Excel users, which inversely makes it a bit counterintuitive for someone coming from a dev background. The GUI functionality is also pretty limiting and only exposes a fraction of the overall capabilities. That said, there's an "Advanced Editor" button in the Ribbon that'll open up a modal that's far more familiar for a dev; shows you all of the code/Steps for the query all at once and has IntelliSense for code completion. Still crude as far as a dev environment goes, but far more productive than flailing around in the GUI the whole time.
Since you're coming from a dev background, I'd highly encourage that M primer[1] I referenced. It's a crash course on everything that's relevant to a developer (type system, error handling, patterns, antipatterns, custom functions, execution logic, etc). Between that and glancing over the standard library[2], you can avoid using the GUI entirely except for when it's convenient (such as scaffolding out some tedious stuff to then clean up in the Advanced Editor).
> and the performance was painfully slow (compared to an RDBMS).
Also of note is that Excel is only one of the data sources PowerQuery supports. There are a ton of others[3], including many RDBMSs natively supported and generic ODBC support if you have the appropriate driver installed.
It's not really comparable to VBA, as it's not used to automate/script the Excel object model. There is a JavaScript scripting model in recent releases which at least overlaps in use cases with VBA but I'm not sure if you can use it to do everything that VBA can.
M code is the query language mostly used for getting the data into power query and transforming it/cleaning it. The replacement for VBA for power pivot (as well as analysis services and power bi) is DAX (Data Analysis expressions). I've never used F# before.
> PowerQuery is already widely distributed and available to anyone with Excel installed
Looks like you need to install it separately as a module, it's not included with base Excel. So every user needs to download it, and have admin rights to install it. That makes it a non-starter in many of the corporate use-cases.
If you're pulling data from an external source that supports query folding[1] such as a database, it'll try to push those joins to the original source.
If you're pulling in data from a source that doesn't (or the join occurs after query folding is no longer possible), there are a host of performance considerations/optimizations.
By default PowerQuery knows nothing about your source data and makes no assumptions. If you're joining data using the GUI, the generated code will likely use the NestedJoin[2] function. A single "step" in the processing will have its ram capped at 256MB[3], so depending on the size of the data you're joining and absent of any data processing steps that would give it guarantees about your data, you may or may not be paging massive amounts of data to disk as it loops through the entire dataset you're joining for each row that is being joined.
That said, PowerQuery has 6-7 join algorithms available, and a variety of techniques that can be used to optimize the processing. For example, if you add a primary key to the table (via Table.AddKey or Table.Distinct or Table.Group), it can short-circuit the data processing the moment it finds a match. This[4] article series is really helpful for optimizing joins in particular and the link-outs and tidbits scattered within the series are fantastic for getting a better understanding of PowerQuery's inner workings and performance considerations.
The problem is that everyone has and knows how to use excel. And enough people in finance and consulting see excel skills as a mark of status that there's going to be some resistance to change there. A new tool has to be equally universal, easy to use for basic purposes while having advanced capabilities, etc. There are courses that teach supply chain management with solver and the like - having people shift over from the only software they were trained to use is not trivial. It was easier for geneticists to change gene names than change excel behavior in the end for a reason - excel is powerful and everywhere
Not only have I seen spreadsheet used for function they were never intended for like using it as a text notepad. The main problem since spreadsheets are easy to set up, people create them without asking even the most basic safety functions.
Testing the range on inputs or outputs is not done enough, but worse using the wrong functions can give a sheet that looks good but is very wrong.
I'll stop you right there when I say that the firm I work at is a global finance and consulting firm and has globally started shifting to using data analytics tools such as Alteryx over Excel.
The barrier to entry into Alteryx is nowhere near as difficult as say python or R for finance teams and consultants.
It is also seen as a mark of status to master a data analytics tool which improves a particular process either within the company or at a client.
Alteryx appears to be considerably more expensive, though. Shouldn’t it take a tool of comparable cost to replace Excel? Especially in self-serve use.
I’m not disputing the quality of the tool, to be clear. It seems like a good change and a totally reasonable line item. Just questioning what it will take to actually replace Excel in a widespread way.
My own Easy Data Transform software (https://www.easydatatransform.com) does a fair amount of what Alteryx does and is just $99 (one time fee). So cheaper than Excel!
Apart from the visual UI (which is a big deal, especially if you have lots of transforms):
-wider range of input and output file formats supported (xls/xlsx, fixed width, json, xml, csv, tsv, yaml, html, markdown, vcard)
-much faster
-45 transforms to choose from
-written from scratch for the purpose, rather than grafted on to an existing product
-doesn't mangle your data like Excel does!
Also, Excel is just buggy enough to screw you over anyway, even if you’re trying to use it safely. I can’t count how many times valid formulas have failed on me across various instances of spreadsheets. Or I have to save, reload, or restart just to make things work. I’m constantly amazed by how much it’s used for business critical functions while being so routinely unreliable.
Most business people don't value and appreciate correct software. It's odd but shouldn't be surprising that most people are unable to effectively evaluate the risk of a software defect, especially in a tool like excel.
I don't think it's that odd. Marketing in general is the art of ignoring edge cases, focusing on majority tailoring. Why would I care about losing 1 customer due to changing our branding, if it gains me 1000?
This is why business is not considered a form of engineering, while software is (just barely). Imprecision is a tool in business, while in engineering it is an enemy.
Same where I work. Excel was replaced completely with R a long time ago. Even automated pdf reports are made using R programs. Yet people are often shocked to hear it at first.
Having consulted in large enterprises, the main reason spreadsheets exist: A knowledge worker needs some form of automation support to do the job.
Choices are:
(a) spend a day drafting a rough spreadsheet, then over the next months tinker with it making it a bit better, more efficient and add more relevant functionalities.
(b) put in a formal project request to IT. This will be put into the backlog, and most probably classified as non priority unless there is some lobbying from people with clout to 'get this done'. When accepted, usually months or even years later, it will be analyzed at length by a business analyst that is unfamiliar with the problems that need to be solved, handed over to a programmer (or worse, an external contract) that will implement something that will have 30 assumptions filling the holes in the analysis, be accepted by the analyst and then thrown over to the poor user that will have to figure out if something, anything relevant to the job can be done with it. The user can start asking for corrections or rewrites, but since most of the operation is CYA from start to finish has a high chance of being berated for 'wasting the companies resources' (after all, this project that could have been done in a day with a spreadsheet has now already cost the org 200k or more in the least, a significant amount of that went to the IT dept's internal platform rewrites that they are desperate to do but never get budget for from the business) and with high probability land the user on the 'difficult/nagging' people list of the IT dept.
Every few years the enterprise will launch an initiative to 'clean up all our processes', typically when a new key manager moves in with 'friendly' IT contractors that he was palls with at his previous place, and they'll start with an inventory of all the 'shadow IT' (spreadsheets, Access DB solutions, homebrew VB programs, some SaaS stuff ...) in order to purge and consolidate all onto the shiny new ERP/BPM/... platfrom that was sold to the CxO. The new platform a few dozen million dollars later will (if successful) deliver maybe 33% of the required functionality, and the users can either 'let the business fail to show the nobheads that it is an impossible system to run the business on' or, more likely, keep using their spreadsheets and other quick fix solutions to keep the business running despite.
I work in Business Intelligence (BI) and the joke is that Excel is the best BI tool. My job starts after the new platform has delivered 33% of the functionality required. The rest of the functionality is in the Cobol system. Standard practice in BI is for business users to export spreadsheets/CSV from the different systems then use Excel to create the reports they want. No new BI software is going to magically solve the issues with data.
Hear! Hear! It's another mystery of capitalism! People screw up majestically, band-aid it, and still they make money. I'm not complaining, btw, having made plenty on the phenomenon...One might suppose all those people laboring mightily to band-aid all those problems must have some kind of incentive to do it that way, eh? Magical!
Excel is a triumph of software. Never before has there been a REPL so powerful, a programming language so widely used, an environment so simple even mere mortals can produce magic - without it what would we have? The power of software would remain closed behind the Ivory gates of the self-titled engineers with development cycles, sprints and countless hours to make a single formulae change.
Fear the excel less world, for without it society is a shadow of its potential!
I don't actually like Excel, and think it could be significantly better, but in general, I agree with you. Excel has allowed users with domain specific knowledge to create real honest to god software cheaply and quickly.
The biggest triumph of Excel is that you can incrementally learn new things, slowly turning a basic worksheet into a Turing complete masterpiece. In contrast, I believe that programming languages have a much higher barrier to entry.
I would like to argue that Excel (i.e. spreadsheets) are a form of programming language; albeit in tabular format. Each cell is a variable, some cells are constants, some cells are computed, cells can be conditional, etc.
A programming language must not necessary come in form of written text. We can say that concept of spreadsheet is the programming language, and Excel, Google Sheets, etc. are just implementation of the same concept.
The success of spreadsheet is in the fact that its visual and intuitive, similar to what made Visual Basic and Delphi very successful.
In an ideal world, we would have pgsql (the PostgreSQL front-end) and AWK, and people who realize that thinking about the model before you start playing with layouts and copy-pasting formulas is a good idea.
Edit: Excel is a Dunning-Kruger syndrome enabler. People think they can do data analysis because they can type "=sum(B1:B4)" into a spreadsheet.
Just coincidentally I spent most of the last week converting a spreadsheet to Postgres. It is interesting how similar some of the formula functions are to SQL functions. All the functions are standalone and there are no methods on top of objects/types.
Excel took a step in the right direction when it introduced tables, which autofill formulas for all its rows (seriously, use them!), but it should've made deviations more obvious in them. Right now it's the usual green corner, which is extremely easy to miss in a large data set. This, and explicit column types would prevent maybe 80% of spreadsheet errors.
Technically you do have column types already[1], although only available via the data model[2] (another wholly underutilized feature set).
Although it only helps to an extent, since it doesn't prevent bad data from being input into a table, and supports so many implicit conversions (listed out in [1]) that undesired behavior is still possible. Plus it enters into a little known area of Excel rather than baked into the customary workflows.
That said, it still comes in handy. A typical pattern of mine for worksheets I have to release into the wild (i.e. no telling who will end up (ab)using it and when it'll eventually come back to me in a bastardized form to salvage) is:
1. Create table(s) for raw/user-inputted data
2. Use that table as a source for PowerQuery
3. Leverage PowerQuery for type checking, data normalization validation, and data joining.
4. Output to a new table on the worksheet, referencing this instead of the original table for all formulas.
Also optionally putting complex logic into the PowerQuery steps, that way the "output" table is ready for end-use and doesn't require any complex and error-prone formulas or anything.
Even without leveraging the data model and DAX functions, this works out super well. More sophisticated users can easily modify the PowerQuery steps, and less sophisticated users are protected from accidental mistakes by either using it as-is or reaching out for help earlier on, before there's a crisis for assistance.
you can still be exposed to data munging on input/import, due to all sorts of formatting issues (like zip codes losing their leading zeroes or invalid phone numbers due to misaligned formatting).
Very true, and that's why I prefaced my comment with "only helpful to an extent".
In some cases, data validation[1] on the cells directly can help. Particularly if the data is going to be manually entered by the end user, rather than a bulk copy/paste job from elsewhere.
In other cases, you apply defensive checks within the PowerQuery processing. If a field is supposed to contain US-formatted phone numbers, you can explicitly set the column type to text, split at dashes and keep the first part, verify it only contains numerical content, optionally check for non-sense values (all zeros, all nines, 12345, etc), then left pad it with zeros to ensure they're all 5 digits. You can also package up all of those steps into a function that can easily be added into future worksheets via copy/paste and save you from re-creating the wheel. And depending on the data itself, it may be handy to call out to a third-party API for validation/normalization.
To maintain sanity, you can also create a helper column for validation. If any of the columns can't be "safely" normalized or validated, you update this field with details on what failed. At the end, split out any records where that column is not null into its own table. That way you both filter those records out of the "final" table, and can also easily add in an Exceptions Report section to the spreadsheet, clearly calling out the rows that had issues and providing the error details to the end user.
It sounds complicated, and it is compared to the alternative (Excel's undefined behavior silently swallowing it). But it's actually incredibly straightforward to implement and a fairly natural design pattern for someone coming from a software development background. You just start treating Excel as a UI and PowerQuery as a backend. All of the business/processing logic gets decoupled into PowerQuery, and formulas become primarily tools for presentation logic (such as adding an an Exceptions notice to the primary worksheet if COUNT(EXCEPTION_TABLE)>0).
Excel suffers from being accessible to "everyone", and having a low barrier of entry.
While I'm not an auditor, I've been the go-to Excel person at all my internships and in full time roles. I've reviewed, enhanced, and fixed hundreds for tons of different functions (HR, supply chain, manufacturing production reporting). You can build a respectable system in Excel which has built in error flags that make it obvious to the end user. Here are some of my tips for people who deal with spreadsheets:
* Cell styles, much like headings in Word, are really great but nobody uses them. They clue end users about what cells do. If my end users see an orange cell, they know it's an input they can modify.
* Lock sheets, even without a password. 90% of the time will prevent people from breaking the spreadsheet in the first case.
* Stay up on new functions! =Unique, filter, sort, Switch, textjoin, and {sum|average|min|max}ifs have been added in the last ~18 months and can greatly clean up messy formulas.
* Document stuff like code should be documented. The Explanatory Text style is excellent for more in depth explanations but not visually clogging things up.
* Use tables. Name the tables. =Average(tbl_Production[Widgets Produced]) is self documenting, and the range expands with the table. Range references are notoriously obtuse.
--
The European Spreadsheet Risks Interest Group has some great information on best practices.
I am just completely amazed but yet not surprised there is an entire organization dedicated to spreadsheet risks and best practices. What a world we live in. This just goes to show what the killer app is for computers, it seems.
We have these wonderful machines and 80% of people use them for browsing the web/using web apps and producing spreadsheets. Thats the reality I tend to forget sometimes.
Spreadsheet and word processor were THE killer apps of microcomputers. As long as humanity uses letters and numbers, we will have spreadsheet and word processor in some shape and form.
>* Cell styles, much like headings in Word, are really great but nobody uses them. They clue end users about what cells do. If my end users see an orange cell, they know it's an input they can modify.
It would be interesting if excel had some sort of internal markup language perhaps. So those orange cells could be semantically represented as <input> or some such.
The problem is that someone creates a sheet, locks all the cells, forces a bunch of people to use it via some process, and then either a) never touches it again or b) leaves the organisation and doesn’t tell anyone the unlock password.
Example: a lookup list for “hardware models” which is protected. So you can’t add a new model, but some cretin forces you to keep using the sheet.
yes. Go to cell properties and you will find there is a "locked cell" property that is on by default.
When you Protect a spreadsheet, all Locked Cells become... locked.
So a typical use case is to only un-lock the cells you wish people to enter variables into, (color the cell to make it clear), and then protect the sheet so nothing else can be tampered with.
I'm reluctant to pimp my own product on HN, but one of the features of Spreadspeed [1] (my Excel add-in) is a Quick Protect tool which protects the sheet but will unlock individual cells based on styling (e.g., Input style).
The European Spreadsheet Risks Interest Group (EUSpRig) runs an annual conference [...] — and there’s also a Yahoo Groups mailing list, where members offer tips and tricks, share links to resources and pick apart press coverage of the contact tracing debacle
When Yahoo Groups shuts down, no more excel repair for governments?
Also related, his recent video about the UK government's loss of COVID data (featuring a fantastic parody ad for Spreadsheets™): https://www.youtube.com/watch?v=zUp8pkoeMss
In my former life I worked in Finance departments of Fortune 500 companies, responsible for process and report automation, and making Excel-based systems less unwieldy -- often by converting them to Access. I have an Accounting & Info Systems degree. I'm well-versed w/ relational databases, SQL, and VBA. In more recent years, to distract myself from wanting to die, I've become proficient with Python and R. Unfortunately, a gap in my resume a few years ago became the reason recruiters started treating me like a leper.
What went wrong, besides bad luck? I was good enough at what I did that the perception formed that I'm not a real Finance person. So, instead of being used in roles with longevity, where writing code was an adjunct to domain expertise, I became The Tech Guy: a'hired gun', used exclusively to build tools for the real Finance pros to use in perpetuity.
Reading these comments is upsetting because they suggest that in 2020, someone w/ my background should be able to support themselves. To those who understand the power of databases and automation, the inanity of using Excel to do almost everything except wash your dishes is painfully obvious. Trust me: there's a better way.
Many technical people -- W/OUT a business background -- are prepared to do whatever is necessary to improve their understanding of The Business Side. Why is it that asking finance types to write code is akin to asking them to perform brain surgery? Perhaps grasping math and accounting principles doesn't imply creativity, or competency w/ language and logic? I've worked with people who can do all of these things. In 20 years, those who can't may struggle to find work. Somehow, in spite of my background, I ended up in the same boat.
Sorry to sound bitter, but people with skills who can't find work are sometimes more miserable than those without any skills, who have resigned themselves to a life of earning minimum wage. I miss the good old days, when I could flush money down the toilet for sport.
About Excel visual grid performance...I have tried many data table libraries over the years in various languages. Almost all have performance issues when reaching 100-1000+ row range (even with virtualization and other tricks enabled). I would like to know how the Excel team made their grids so fast. Scrolling is so smooth.
MS Office was updated to use GPU accelerated rendering and I guess this has something to do with it. I don't know how this works (which API, which portions of the software, which algorithms) as details on that seem to be pretty sparse.
Same here. From around Excel 2007 - 2013... activating the GPU acceleration option slowed Excel way down. I remember turning off the setting many times during those years to get better performance.
Fast forward to 2016-today...Microsoft fixed the GPU acceleration bugs.
I recently opened a 9000 row sheet in LibreOffice Calc on an x220 and had no performance degradation at all. In fact I was impressed at how responsive the application felt compared to Excel. Might be interesting to try bigger and bigger files until it looks like it's slowing down.
This has been my profession for the last 15 years in a large consulting firm. We do this work for financial institutions (some other industries too). Despite how this group feels about banks, US banks have set up internal groups that literally review (validate) and test quantitative models, many of which live in Excel. This was driven by SR11-7 and OCC2000-16 both regulatory guidance. The term for this group / department / function is Model Risk Management (or model validation). Most banks now hold an inventory of models that exist in the entire bank (or at least the ones that have been declared) and they run a battery of testing procedures on them annually. It’s pretty exhaustive and sometimes too much but it certainly finds errors both in implementation and in logic.
You’d be amazed at the types of errors we’ve found over the years. Formulas referencing blank cells, relative references formulas that were meant to be static when the formula was copy and pasted to other cells. These were in some pretty serious models (e.g. trading models, risk monitoring models).
One point on Excel, it’s wildly powerful. We’ve done things in Excel that would impress most of this group. Things like 3D formulas (formulas that go through tabs), array functions (which are very infrequently used but are very powerful). I’d argue that most pieces of code (with the exception of recursion and other advanced things) can be replicated in Excel.
Half of the time when validating complex models implemented outside of Excel, we’d use Excel to replicate the logic to benchmark the result.
My big question to this group is, does this type of practice exist in non-banks? Do other industries do this?
I used Excel professionally for 20+ years as a quant in major banks. Mostly as a front end to mathematical models in C++. Always been amazed how a 40 years old UX paradigm is still being used virtually unchanged in modern systems.
Spreadsheets are truly ubiquitous in the financial industry, and for 20 years I've been struggling with exactly the same issues mentioned in the article and in other comments here.
The problem is that neither Excel, nor most alternatives treat a spreadsheet as code that it essentially is. And being code, it needs to
- be source control friendly
- be testable
- have business logic (formulas) decoupled from presentation
Only then one can seriously consider using a spreadsheet as part of any production grade system.
For the past couple of years I've been working on https://ellx.io - a platform for visual reactive programming and a new generation spreadsheet/notebook designed to solve exactly this pain point.
Here are some examples of what you can do with it:
Why do I even have Excel skill? Because of the garbage reports and data that I was pulling out of databases. I spent YEARS exporting data-dumps and cleansing them in Excel to help customers whose lives were turned inside out because of garbage in databases.
Not a negative word should be uttered about Excel until you can guarantee that databases and software are pristine and never fail due to human error.
I agree the excel is great for cleansing or processing small/medium sized data sets. It's great for comparing two or more data sets too. The thing that makes it great for these things is visual editing, and the ease of setting up formulas. This also makes it dangerous in business critical processes because it allows too great a surface for mistakes.
There's also the issue that they're often set up as an ad hoc analysis only to become a process carried out on a schedule. This gets particularly bad when macros are involved - especially when they're password protected.
Jocelyn Ireson-Paine (JIP) et al developed software years ago that converts spreadsheets to language and vice versa, so you can write in code and then generate a corresponding spreadsheet. This two-way capability aids, among other things, in debugging spreadsheets. JIP is in euSPRIG (see http://eusprig.org/best-practice.htm ). Here are some links:
Give me a break. 2nd sentence of the article: "It had the number 40,335 in a random box, and payroll wasn’t clear why it was there."
If you work with Excel with any regularity at all, it's immediately 100% clear that this is a Date issue. I can't emphasize enough how obvious and elementary this "mystery" is.
>I don’t use excel - could you explain what date this and why it’s obvious?
MS Excel internally represents dates as a floating point number from epoch of Jan 1 1900
So "40335" or "40335.0" is +40,335 days from that epoch -- which is June 10 2006.
An experienced Excel user often runs into "mysterious 5 digit numbers" from 3xxxx to 4xxxx and will instinctively think it's probably a date that's inadvertently formatted/interpreted as a number. By switching the formatting of that cell to "Date" (Excel right-click "Format cells..." choose Category "Date"), the mystery number becomes a readable mm/dd/yyyy.
Same idea as experienced programmers coming across a 10-digit number like "1602930678" in a JSON or XML data file and instinctively assume it's probably a UNIX timestamp rather than a phone number with area code (160).
The article is written for a general audience that probably doesn’t use Excel regularly. It would be counter-productive to use an example the reader needs paragraphs of explanation to understand.
That describes thousands of SMEs and non-technical business units who use Excel to manage their accounts and business operations. For example, a small law practice might use it to manage salaries, but, in spite of being full of very smart people, employ no one who has a clue about the quirks of how Excel handles dates.
I'm not criticizing the consultant, hell, I'll happily take someone's money to "solve" these types of problems. But nobody from "payroll" (as described in the article) should be authorized to cut checks, especially for $40,000+, if they haven't figured out a pretty basic aspect of the primary tool they use -- and in addition, don't have critical thinking skills to determine that, hey, the person's start date seems to be missing. If Excel is too challenging, either outsource payroll or go ahead and spend for a dedicated application. Being a non-technical business is no excuse to hire incompetent people.
Good thing Excel doesn't use Unix time... payroll would likely have just written the guy a $1.6 billion check. Look at that, Excel just reduced the company's expenses by 99.997%!
> there’s also a Yahoo Groups mailing list, where members offer tips and tricks, share links to resources and pick apart press coverage of the contact tracing debacle.
I do hope the group is aware that Yahoo! is closing down Yahoo Groups and have moved their mailing list
One of the interesting takeaways from Emery Berger's talk, is that he was able to replicate to 2 decimal places a published spreadsheet transcription error rate of 5.26% of cells have errors.
- Alteryx at $5000/year/user?
- A database that requires specialized skill or going through a DBA for every little need?
- Salesforce? It's hell to configure if you don't have the skill.
- Sharepoint? Again. If you're a small business or nonprofit, you've gotta hire or train someone before this is a serious conversation.
I would in no way suggest that an enterprise can run on Excel with centralized data and multiple users. And that's where I believe a lot of criticism of Excel is unfair: the comparisons are not apples to apples.
And let's be honest. EVERYTHING comes with a level of human error.
- Enterprise level databases can get so big that the corrupt data in them can be impossible to cleanse.
- Reports made by SQL people are often flawed.
Over my 7 years in an international company I discovered report after report that didn't accurately match business rules or they were based on loads of incomplete records, duplicates, inconsistencies, incorrect math, etc.
How about fixing the reports? Often the decision was "no." Fixing the reports wasn't something a department head wanted to pay a consultant for.
Eventually, I had the DBA stop sending me reports. I started exporting the raw data into Excel and made my own reports.
- Brand new reports can be impossible to get if they don't justify the cost of a consultant to create the report. (Ah, but there's Excel.)
- Databases get hacked and they crash.
So. Where's the tool that's perfect, free and easy to use? It doesn't exist.