I've seen a lot of products similar to this, and it looks like one of the nicer ones. But the use case in the video and screenshots make me immediately think it wouldn't be viable in any of the organizations I've written SQL for in the past 10 years.
At least 95% of the challenge in querying data doesn't come from the difficulty of writing SQL statements IMO, but from the complexity/brokenness of the data itself. Especially with the move to unstructured data streams, most data warehouses seem to have become pretty bad when it comes to extracting truth out of the information within. Many business users know some SQL but blanch at creating coherent reports from what's available to them. What BI people are paid for seems to be having knowledge about the problems particular to a domain.
If your data is relatively clean and follows a good model, this would be a great way to help someone join, group, and aggregate their data without knowing SQL. I think a lot of people would use Tableau for this if it didn't also have a steep learning curve / high expense. However as soon as you get into data where you're writing case statements, coalescing null fields, matching on different data types, decoding, partitioning over streaming data etc., it doesn't help someone without knowledge of the caveats within the data sources themselves. Show me someone who doesn't know SQL using this to produce insight out of compromised data and I will be impressed.
I fully agree with your experience here. We have a super hard mission at Chartio - it's not just about the interface but also how the data is setup. The interface, being as flexible as it is and also enabling full schema (instead of dataset) browsing is a pretty big part though in also allowing a more agile version of data modeling. It had that very much in mind and we've written a book (soon to be published with Wiley) on proper modern data governance techniques.
Our next phase is to help people get to that cleaner source of truth much more quickly than traditional dimensional modeling approaches. Tools like Visual SQL and DBT (https://www.getdbt.com) are really changing the complexities here.
I'd love to see the data modeling book. I spend a lot of energy shouting the virtues of The Data Warehouse Toolkit into the void. You are right it is outdated but it isn't entirely (or even mostly) wrong.
My coworkers seem much more interested in making a bigger EMR or adding nodes to Redshift than designing a reasonable data mart because "star schemas don't scale". I'm interested to see what you come up with, it is a huge gap in the current literature.
I would love to be involved with that book. I'm a data engineer myself and I have built SQLBucket, a python library I have been told is similar to getdbt.com (although I'm not familiar with DBT, the similarities have been mentioned to me on various occasions).
Oh awesome! I haven't heard of SQL bucket but I'll check that out as I love anything that encourages SQL based modeling. I was writing my own and then DBT came out and we push people there primarily. Send me a note - dave-at-chartio and I'd love to chat sometime.
Very interesting what you wrote in your article. Most interesting is how you seem to realize while designing your product that the spreadsheet surface is the most intuitive to users. They like also the baked results you present quickly. So you can see really the problem of your customer then.
What is really good is to assemble a library of visual queries for the customer. This is a good idea for the reason that many users have the same fundamental types of queries on their data. When finally you have enough of the basic queries that the user can do useful work without programming then you can find a way to customize this yes.
Have you data on how many similar queries customers use? Then you should know how to create the basic set of important operations.
I wish I could say that we had that insight from the beginning and that it worked right away, but we ended up failing into that realization after a lot of designs and prototypes.
So what you describe is somewhat built in to what we have now. Users still have to choose what columns they want to look at (there's no real way for us to guess that) and then we do apply some knowledge on what type of data they're looking at to help them get to what they're likely looking for.
We've also tried at times to make default dashboards for data sources when people connect. We can do this to some extent with known Schemas like connecting GA, SalesForce, Hubspot, etc, but for databases - that's proven to be a largely impossible task so far. Everyone's data is so different, and have such odd conditions to consider filtering by, that the auto dashboards end up being quite useless.
Yeah, doing ETL everyone wants me to use their data and perform magic with it, but I really want to see their stored procedures and queries running on top of it before I can understand the best way to "auto dashboard" anything.
Finding out the distinct set of values in any column helps a lot, referential integrity helps a lot, but without those queries its pretty dang hard.
If you aren't investing the work up front to make your data useful then no amount of tooling or magic beans will make writing meaningful SQL easy. If your reports (SQL) are hard to write it's because you have bad data.
Having said that making data exploration easier is always a worthwhile exercise. The more business people can self-serve their questions the better. It falls to data people to make that possible not just with tools but with the data itself.
SQL is easy like solving differential equations is easy. If you know how, you just work througH it, but that’s only because you already learned all the prerequisites and then spend a year getting good at it. Anyways the point is that either way, you’ll never have an entire organization working at this level with SQL or solving equations and that’s fine. The CEO doesn’t need it, sales don’t need it. They do need numbers though, so having a dumbed down interface that can show some curves after clicking “net sales” and “country” let’s them move forwards with what matters to them, without having to learn the difference between an inner and outer join, which doesn’t matter to them.
Not everyone needs to be a programmer even if they do need the value typically buried in programmer interfaces.
Questions to be made on data are always an afterthought, after the systems to gather and produce this data have already been designed.
The idea that it’s ok to just give a BI tool to business people later, and not involve them at the beginning to inform the system and data model design, is maybe why we’re in the tar pit.
A really excellent process/tool would help business figure out what kind of questions they need answered, and work backwards to the data model and implementation.
Sometimes it's not for lack of trying, though. I've done a decent amount of data model design in my career and as I got more experienced, I learned to make a point of asking the business people what specific questions they'd need the data to answer. Very often I got only very vague answers that were of little to no use in informing my design. I've seen this repeatedly at a variety of companies over the years. The business people are often winging it as much as the engineers are.
What do you guys think about a system that would 'guide' the user through the process of creating a star schema data warehouse (Kimball style)?
The advantage of this approach would be the fact that you do the dirty laundry upfront during the modeling & population phases. The end result is a data structure that is fool-proof, i.e. there is only one way to join facts/dimensions, it's self-documenting. In fact pre-joined views could automatically be created (and persisted if need be), giving the business user a clean structure to interact with.
The payoff from having some engineers doing the T of ETL (transforming the data to be more queryable, denormalizing values etc) is insane if you have problems in democratizing data analysis/reporting
The dumb thing for people who don't know what this looks like: make a separate database. Create tables based on what you often want to query (but don't try to keep the same shape as the base data!). Then figure out how to fill in those tables from the base data
We use SAS Enterprise Guide in my org it has a visual SQL generator similar to what is previewed it is quite good for simple queries but I find it falls down when you introduce complexities.
Joins between tables are one of the big pain points for me. I am not sure how the demoed product does joins but in SAS EG you join tables by click and dragging a line between the two columns you want to join together and then it gives you a pop up to select Inner, Outer, Left, Right etc. (although it uses plain English. I.e rather than saying "left join" it says something like "All Rows from Table 1 and matching Rows from Table 2) it also gives you a visual Venn diagram type preview. This is frustrating for a few reasons: if table is large i.e several hundred columns you have to scroll for an eternity to find the columns you want to join against which is frustrating (a drop down box you can use keyboard shortcuts to jump to column name would be much better UI IMO). It is difficult to do complex joins "Where A.Col1 Between B.Col1 and B.Col2" for example. Maybe just my org but this is common feature of our data we have a lot of Event driven stuff with Start and End date and then Raw time series data you need to aggregate between the event frames. SAS's Timeseries stuff is very good but SQL side not so much...
It is also very easy for unsuspecting users to join on unindexed columns which leads to very poor database performance. In general there are a lot of performance footguns with generated SQL user can have working query then change something in the GUI and suddenly query that ran in subsecond takes several minutes. I have developed the habit that for complex I use GUI to generate SQL then hand edit it before running it to make sure it will be performant. Or I just hand write it to begin with.
Pattern matching and filtering is another pain point the query editor lets you use "LIKE" but does a poor job explaining to user how it works people who only use the GUI are surprised when I explain you use "%" and "_" to do character expansion a lot of people blindly assume "*" is used and then complain their query isn't working
Microsoft Power BI also has a query editor I've used this program a little bit but do not have as much experience from what I experienced using it I found it to be a bit less flexible it seemed easier to do the data extract and transformation first then load it into Power BI.
Absolutely. I've been employed in healthcare IT for the last 5 years, doing data extracts for researchers and reporting. Many people don't quite appreciate the underlying complexities of extracting data from not only large EMR systems such as Cerner (with over 6000 schema-less tables), but integrating that with the countless other systems the hospital employs for things such as radiology, pathology, ICD-coding, etc. Many data aren't entered as you would expect, documentation is often lacking and there are very few people who can tell you where exactly the data sits, and how data is actually entered by medical staff doing various work all across the campuses. A simple research request becomes immensely complex when multiple systems are involved and you're using something as complex and as evolving as health data.
So I agree that while this tool may be handy for some, the real challenge with this sort of work is knowing where the (often unstructured) data sits, how it's entered and by whom, and how to extract meaning from it.
I think of use cases when I see this product other than what you mentioned. I don’t see this as an end user product for analysis. IMO a product like this can be used for iterating over backend design with other developers when discussing a new feature or model change. Imagine walking through a complex data model when discussing a new feature and needing to quickly visualize stuff. Now I write SQL by hand pretty quickly but I would love to use a product like this in a team design discussion where we want to visualize what we would be pulling from the DB and instead of having to hand write all the joins and filters as you walk 5-6 tables you can quickly and easily see what data is going to be provided to your DAO layer. You can spend way more time focusing on the discussion and less time waiting around for data access changes to be visualized. Then in the end the developer can take the auto generated SQL and use it as a base to start writing code for the DAO layer. Then the developer can do the optimizations and cleanup to the query you mentioned in your post.
So the agile world of data we're looking to bring about is one where your DAO or marts, are still regular schemas - not just flat tables or cubes. So the developer teams would be creating these mart schemas, and then business users could be doing much of their own self-serve exploration and visualization with Visual SQL. I wrote about some of these modern architecture approaches here https://chartio.com/blog/cloud-data-management-book-launch/
First I want to applaud the effort here, this looks like a great tool for BI/dash boarding and Ive signed up to demo it.
I'm not here to poo-poo your product at all but want to explain why I think this cant work in all cases, and that is totally fine!
The application that is my bread and butter is about 25 years old and started as a mid 90s power builder application. Its since migrated to a 3 tier .NET enterprise app and the data model has been dragged along with it largely unchanged. The data model is full of cruft and unnatural keys that are the result of a ton of technical debt. The newer tables in the database are more sane and follow most SQL best practices, but there is a large portion that was written (and I know the original data architect who SWEARS this is true) in no-vowel notation, unless the table starts with a vowel, or maybe throw a random e in there too. A few examples:
Sane column name -> actual column name
AccountDetailAccountCode -> AcctDtlAcctCdeID
AccountingPeriodID -> AccntngPrdID
PlannedMovementBatchID -> PlnndMvtBtchID
Provision -> prvsn
There are about 1000 tables in this database, so part of the problem with SQL is that you have to know about the table structure to know about the database. Another is that a lot of this data is constructed in stored procedures, functions or views.
Most people need at least 12-18 months to totally grasp the application and the data model in a real way. I have to create database diagrams still to find a path to get the data I want, and Ive been at it for 10+ years. What I would love to see is a way to show table relationships that is filterable in some way- e.g. not just because there is a key relationship. To me its much more useful if I knew that table a is related to table b and both table a and table b have some threshold of row counts. I don't care if there is a key that joins table a and table b if table b has 0 rows. Building that intelligent meta data about the database would be hugely helpful.
> At least 95% of the challenge in querying data doesn't come from the difficulty of writing SQL statements IMO, but from the complexity/brokenness of the data itself.
More and more of our consulting gigs are this - collating and cleaning crappy data, and then putting BI reporting on top of it.
I agree with this as well. Products like this seem to be designed primarily for well-structured and lovingly maintained databases. If you are lucky enough to have one of those, congrats!
At least 95% of the challenge in querying data doesn't come from the difficulty of writing SQL statements IMO, but from the complexity/brokenness of the data itself. Especially with the move to unstructured data streams, most data warehouses seem to have become pretty bad when it comes to extracting truth out of the information within. Many business users know some SQL but blanch at creating coherent reports from what's available to them. What BI people are paid for seems to be having knowledge about the problems particular to a domain.
If your data is relatively clean and follows a good model, this would be a great way to help someone join, group, and aggregate their data without knowing SQL. I think a lot of people would use Tableau for this if it didn't also have a steep learning curve / high expense. However as soon as you get into data where you're writing case statements, coalescing null fields, matching on different data types, decoding, partitioning over streaming data etc., it doesn't help someone without knowledge of the caveats within the data sources themselves. Show me someone who doesn't know SQL using this to produce insight out of compromised data and I will be impressed.