Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SQL is syntactic sugar for relational algebra (scattered-thoughts.net)
216 points by dmarto on March 23, 2024 | hide | past | favorite | 145 comments


As someone who learned mathematics first and programming later, I think it took me about 10 years of working in data-intensive programming before I could write really "good" SQL from scratch.

I completely attribute this to SQL being difficult or "backwards" to parse. I mean backwards in the way that in SQL you start with what you want first (the SELECT) rather than what you have and widdling it down. Also in SQL (as the author states) you often need to read and understand the structure of the database before you can be 100% sure what the query is doing. SQL is very difficult to parse into a consistent symbolic language.

The turning point for me was to just accept SQL for what it is. It feels overly flexible in some areas (and then comparatively ridgid in other areas), but instead of fighting against this or trying to understand it as a consistent, precise language , I instead just go "oh SQL - you are not like the other programming languages I use but you can do some pretty neat stuff so we can be on good terms".

Writing good SQL involves understanding the database, understanding exactly the end result you want, and only then constructing the subqueries or building blocks you need to get to your result. (then followed by some trial and error of course)


I feel like a foreigner in another land when I read your comment and others like it. For as long as I can remember using SQL, I can't remember ever finding it more difficult or backwards than anything else I use.

That difference might go some way towards explaining why I prefer a much more database heavy/thick approach to writing apps than my peers.


I agree. I never even thought about "select what you want first" as a problem until someone else pointed out.

Programmers seem far too sensitive about wanting everything to work one way. SQL is a very powerful DSL. It has its quirks but nothing that ever enraged me. I don't really care that it doesn't work like some other stuff I use, I just accept that I'm learning the language of a particular domain. This doesn't mean that I don't think there is always room for improvement. Of course I think FROM first would be a little nicer, but so much nicer that I think its worth changing a whole battle-tested standard? Not at all. The pain is so minimal I don't even feel it.


> I never even thought about "select what you want first" as a problem until someone else pointed out.

I thought it was a problem as soon as IDEs had good SQL autocomplete. I got so used to depending on just being able to "tab my way through" autocompleting in other languages (e.g. if you do <objectVariable>.<propertyName>, it's obvious the set of property names can be narrowed down based on the type of the variable), that it immediately becomes apparent that doing select first sucks, because autocomplete has no good information until you get to the from clause. A lot of times with a good SQL editor like Datagrip I just do "SELECT * FROM foo" first, and then go back and edit the select columns because it can now autocomplete them quickly.

I now notice this in other places too, like I hate how in JavaScript you do `import { foo } from "moduleBar"`. I'd much rather do `from "moduleBar" import { foo }`.


Hey, that's fair! I'm not a big autocomplete user so I never thought of this, but it's a good argument.

> I now notice this in other places too, like I hate how in JavaScript you do `import { foo } from "moduleBar"`. I'd much rather do `from "moduleBar" import { foo }`.

Personally I prefer languages that don't make you import at all ;)


The actual ISO standard falls well short of being useful/sufficient to anyone who isn't an incumbent player. It's effectively a moat and therefore a direct impediment to competition from teams who have novel technical ideas but don't have access to significant capital - building a SQL implementation is a long, expensive journey. This is why many startups resort to building Postgres extensions, or using Calcite or DataFusion.

If SQL weren't so (needlessly) complex we would see much more competition across the database space.


> If SQL weren't so (needlessly) complex we would see much more competition across the database space.

I think there is more competition across the database space now than back when the SQL spec was less complex (say, in 1989 with SQL-89).

Also, much complexity in the spec comes from complex features; I really like grouping sets and window functions, and sure, that adds complexity; but it does allow users to express certain concepts that allow the database to more efficiently process data than sending everything to the user and letting the user solve the computations.


LINQ runs with FROM being first. Definitely trivial difference but a bit easier.


Ya I use Ecto which is the defacto Elixir SQL abstraction. It's heavily inspired by LINQ though only works with SQL. In any event, it also starts with FROM and I always end up writing my selects last. I've just never felt particularly annoyed writing them first in SQL (and I've written a LOT of raw SQL) I'd just do it without thinking about it. Never thought of it as a big deal.

The big problem with SQL AFAIC is its poor (really complete lack of) composability.


> I feel like a foreigner in another land when I read your comment and others like it. For as long as I can remember using SQL, I can't remember ever finding it more difficult or backwards than anything else I use

Learn linq or query/list comprehensions and then you'll easily see why SQL is backwards.


I've been using Django almost as long as I've been using SQL and I prefer the SQL ordering more: it matches the rest of the code, making it faster/easier to read. As a crude example:

  SELECT results FROM source WHERE criteria

  results = source(criteria)
It's rare to see someone want to change assignments in code to be like:

  source(criteria) -> results
Where I see it as the same thing: the SELECT columns, like the variable assignment, are the interaction point with the following lines of code.

And yes, CTE ordering does annoy me because of this. Putting it in the middle is pretty much the worst order.


> Where I see it as the same thing: the SELECT columns, like the variable assignment, are the interaction point with the following lines of code.

Indeed, which is why source(criteria) -> results makes more sense: the results definition is right next to the code that's going to be using that definition. If you put the results definition first as with SQL, then you have to scroll up to find the context (although perhaps Python's indentation sensitivity is the tripping point in this case). Not even mentioning the fact that the SQL way completely destroys any chance of code completion.

I'm going to boldly state that the SQL way is literally objectively wrong, in that there is no world in which SQL's choice is superior for general querying.


Then why are you advocating for it?

> or query/list comprehensions

List comprehensions are column first.


Right, here's the nuance: list comprehensions are intended to be concise one-liners, so having the results definition far off to the right defeats the principle I was outlining. Most SQL queries are not like this, they are almost always multiline of the form:

    select x, y, z
    from Foo
    where a or b
Here the opposite is the case: selection-first moves the return definition far from the subsequent code that uses it.

So if you're going to support list comprehensions, a monadic do-style notation which lets you chain them and again places select last:

https://wiki.haskell.org/List_comprehension#List_monad


Your first example should be more like

  return source(criteria).results
In your SQL, `results` isn't the variable you're assigning to, it's the column you're reading from source.


I learned SQL before I learned set theory. While learning set theory I remember thinking "oh this notation is just SQL backwards." Afterwards I began to find SQL much harder because I realized there are so many ways to mathematically ask for the same data, but SQL servers will computationally arrive at the end differently and with very different performance. This is a minor deal if you're just doing small transactions on the database, because if you are dealing with pages of 100 objects it's trivial to hit good-enough performance benchmarks, even with a few joins.

I was first introduced to the issue of needing hyper optimized SQL in ETL type tasks, dealing with very large relational databases. The company switched to non-relational database shortly after I left, and it was the first time I professional witness someone make the switch and agreed that it was obviously required for them. We were dealing with very large batch operations every night, and our fortune 500 customers expected to have the newest data and to be able to do Business Intelligence operations on the data every morning. After acquiring bigger and bigger customers, and collecting longer and longer histories of data, our DBA team had exhausted every trick to get maximum performance from SQL. I was writing BI sql scripts against this large pool of SQL data to white-glove some high value customers, and constantly had to ask people for help optimizing the sql. I did this for a year at the beginning of my career, before deciding to move cities for better opportunities.

Lately, I've began seeing the requirements of high performance SQL again with the wave of microservice architectures. The internal dependency chain, even of what would have been a mid size monolith project a decade ago, can be huge. If your upstream sets a KBI of a response time, it's likely you'll get asked to reduce your response time if your microservice takes up more than a few percentage points of the total end to end time. Often, if you are using relational SQL with an ORM you can find performance increases in your slowest queries by hand writing the SQL. Many ORMs have a really good library for generating sql queries they expose to users, but almost all ORMs will allow you to write a direct sql query or call a stored procedure. The trick to getting performance gains is to capture the SQL your ORM is generating and show it to the best sql expert that will agree to help you. If they can write better SQL than the ORM generated than incorporate it into your app and have the SQL expert and a security expert on the PR. You might also need to do a SQL migration to modify indexes.

So in summary, I think your experiences with SQL depends heavily on your mathematical background and your professional experience. It's important to look at SQL as computational steps to reach your required data and not simply as a way to describe the data you would like the SQL server to give you.


Was this before BigQuery/Presto/Trino? To me it seems like those technologies would have been a good fit.

They don't really work with indexes but instead regular files stored in partitions (where date is typically one of them).

This means that they only have to worry about the data (e.g. dates) that you are actually querying. And they scale up to the number of CPUs that particular calculation needs. They rarely choke on big query sizes. And big tables are not really an issue as long as you query only the partitions you need.


Those technologies were brand new at the time, the discussions about the problem started in 2013. The company (I had zero input) choose a more established vendor with an older product. Given the time and institutional customers that were trusting us with their data, I suspect any cloud based offerings were a nonstarter, and open source felt like a liability.

Of course with 20/20 hindsight that decision is easy to criticize. I suspect their primary concerns were to minimize risk and costs while meeting our customer's requirements. Even today, making a brand new Google product or Facebook backed open source project a hard dependency would be too much risk for an established business.


> I can't remember ever finding it more difficult or backwards than anything else I use."

This is the major problem. SQL looks like is not "difficult". You don't see (as a user) all their MASSIVE, HUGE, problems.

That is why:

- People barely do more than basic SQL

- People can't imagine SQL can be used for more than that, which leads to:

- Doing a lot of hacky, complex, unnecessary stuff on app code (despite the RDBMS being capable of it)

- Trying to layer something "better" in the forms of ORM

- Refusing to use advanced stuff like views, stored procedures, custom types, and the like

- Using of using advanced stuff like views, stored procedures, custom types, and the like, but wrongly

- Thinking that SQL means RDBMS

- So when the RDBMS fails, it is because the RDBMS is inferior. But in fact, is SQL that have failed (you bet the internals of the RDBMS are far more powerful than any NoSql engine, unfortunately, they are buried forever because SQL is a bad programming interface for the true potential of the engine!)

- So dropping SQL/RDBMS for something better, like JS (seriously?)

- And they are happier with their "cloud scale" NoSQL that rarely performs better, needs major, massive hacks for queries, or reimplements, poorly, ACID again, is more prone to data issues, etc.

And this is not even starting. If you think "is bad to make a full app, all their code, in relational model" that is how much brain damage SQL has caused.

---

I can count with my fingers the number of semi-proper DBs/SQL usage on my niche (ERPs) and that is mostly mine! (For example: I use dates for dates, not strings, like many of my peers!) and that is taking into account that I actually learned what the heck is that "relational" thingy after +20 years of professional use.

Go figure!

P.D: And then go to my code and see "what the heck, I could have done this in some few lines of SQL" and "what the heck, if only SQL were well designed I could do this dozen lines of SQL in 3!"


The trial and error is the worst part.

In traditional languages, you can print iteration by iteration the intermediate result and understand if there is something wrong.

In SQL you sample output, and you keep changing the query until you think you get it right. And then 2 years later someone else finds that the query was wrong all this time.


Common Table Expressions (CTE) do help a little, as you can query each “table” and inspect the output. Debugging a giant query with deeply nested sub queries is very painful indeed


So do table variables and temp tables.


> The trial and error is the worst part.

I don't know about anyone else, but I do this kinda naturally when writing SQL queries. Usually start with a base table, query the first 100 rows to see what the data looks like, start joining on other tables to get info I need, querying as I go to check join conditions, perhaps build out some CTEs if I need to do some more complex work, query those to check the format of the data ... And so on.

It doesn't feel that different to any other programming in that sense. Querying is printing.


> you can print iteration by iteration the intermediate result

You would not be able to do that with a multi-threaded/multi-process application.

And this is the reason why e.g. Trino/Presto is so powerful together with SQL.

Instead of telling the computer how to go by to get your result, you tell it what result you want and let it do it in the best way.

The most up-front way of telling a computer "how" is a for-loop. And SQL does not have it. It may seem limiting, but avoiding explicit for loops gives the freedom to the computer. If it sees it fit to distribute that calculation over 200 distributed CPUs it can do that. With an imperative language you need to tell the computer exactly how it should distribute it. And from there it gets really hairy.


In development I don't need it to be multi-threaded. 1 thread is fine, as long as I can explain, step-by-step, how the calculations produced the output.


If you don't need threads in development OR production, you might as well do SELECT * from users and do the join in your imperative code.

If you need threads in production I think you will end up getting rid of your for loops anyway (or possibly, if you really want to, end up in a mutex/semaphore quagmire).

I must say, though, that there are other benefits with a declarative approach than just avoiding threading issues. But I guess it takes some getting used to.

I would say that the same "I cant step through my code" argument also goes for functional style code.


> If you don't need threads in development OR production, you might as well do SELECT * from users and do the join in your imperative code.

Except that it most likely will be orders of magnitude slower. Most databases are very good at what they are doing.


Yes. Kind of my point to. But the OP missed the possibility to step through the code.


sure you can. set the concurrency limit to 1. If you're debugging the logic and not some race condition then this works perfectly fine. Remember to profile afterwards though


Trial and error is usually a bad idea in all kinds of programming.


I mean, I never build a query from front to back. Usually I build it FROM -> JOIN -> WHERE -> SELECT.


Start off with SELECT * then once the joins are working, filter * down to the essentials.


> widdling it down

Whittling. It means to carve something out of wood, with a metaphorical extension, as here, to gradually making something smaller by shaving bits of it away.


Important distinction. "Widdling" is urination.


I always thought writing SQL from scratch was the easy part. The hard part for me was coming back to my query a few weeks later


This is true for most programming languages.


That's why I try (but sometimes forget) to extensively comment my queries that have any kind of complexity :)


This doesn’t totally solve the issue of SELECT’ing first then filtering, but for complex queries I’ve found CTEs very useful (whenever the database/SQL dialect supports it).


What I usually do is start with "select *", get the joins and where clause down, then refine the select.


> I completely attribute this to SQL being difficult or "backwards" to parse. I mean backwards in the way that in SQL you start with what you want first (the SELECT) rather than what you have and widdling it down.

> The turning point for me was to just accept SQL for what it is.

Or just write PRQL and compile it to SQL

https://github.com/PRQL/prql


You may like PRQL, which gives a more composable-atoms based approach. I find it far easier than SQL.


Saying what you want first rather than what you have is evidence of the von Neumann bottleneck or it was a sign of the times when SQL was being developed on 1970s machine.

Either way, point taken that it is not like a proof.


Covey’s: “start with the end in mind” is not a bad advise when building something complex. With procedural languages you do the same, you first write the signature, parameters expected to go in and out, and then you start writing the way to achieve this.


I'm glad that the article concluded "No" to it's own headline. Calling SQL "syntactic sugar" is an insult to sugar. The "helpful diagram explaining how the scoping rules work" alone should make people blanch. The language is a syntactic disaster that we've been saddled with out of habit and inertia.


Call a masochist but I love me good old SQL. I find most problems with systems I have worked with are the result of managements putting people who never paid attention in database classes in charge. Or maybe nobody really cares about doing a good job. In any case, my professors would have given all the legacy SQL databases that I have come across F-.

The worse part is nobody dares changing anything because it would require application code changes. Don’t blame SQL. Blame incompetent people doing SQL badly.


I didn’t “get” SQL for the first 6-8 years of professional full-stack engineering. I was chasing, along with many in my cohort, the ORM abstraction dragon we thought we needed to work on j2ee, django, or sqlalchemy teams quickly.

It’s perplexing to me now why SQL was perplexing. Even knowing about relational theory and enough DBA to pass interviews, it somehow always seemed like we should have been able to treat databases like our OOP’s.

To think of the hours I wasted. Maybe I heard in a meeting “it’s the join table” and another colleague said “just use the association table,” and I was trying the n^2 debugging approach to solve it with model classes.

The best advice I got on this made sense later: “I wouldn’t even try to manage that [dataset] myself. Just figure out how to tell a database to do it—that’s its only job.”


Understanding relational algebra helps you write better SQL the way understanding lambda calculus helps you write better JavaScript. The language is warty and but a shadow of the pristine purity underneath, but it gets the job done and the theory guides your reasoning on how to use the language.


In particular, I find the closure property of SQL very helpful and always think in terms of queries gettings tables and producing tables (derived most of the time). https://en.wikipedia.org/wiki/Closure_(mathematics)


I really "got" SQL about 15 years after I started using it. The explanation is very simple: until that moment, SQL was a minor component for me in a big pie. Then one day I got into the position to troubleshoot some apps with >1 TB database behind them and I figured that SQL is more than simple CRUD operations. For most full stack developers that I know, SQL is still that minor component in the back of their mind and they are usually right.


Agreed, though disagree about the database class thing. I got a C in my class. I was a DBA at the time and definitely knew my shit. That was kind of the issue. He was teaching by the book with no regard to the database engine used. I was using SQL Server and MySQL daily.

There was a fundamental disconnect. I approached a class project differently not only because I was a DBA, but I am neurodivergent. Because my design didn’t align with his expectations, I got a C.

I had a 4.0 GPA prior to that, but that made me realize I wasn’t a fit for college.

Good thing as well because I am debt free today.


I also got a C in my cs relational databases class. I also got more out of that class than just about any other in CS. I'm still kind of mad about it though. We had a group project assignment and our group had far-and-away the best project, of which I did 90% of the work. My bad grade came down to just one thing: the class required us to use Oracle and I refused on moral grounds.


I failed COBOL in college in a similar situation: I was professionally writing COBOL, my project was too divergent from what the teacher expected. I took that exam 3 times to barely pass. Good times, good times.


i took compilers and prof wanted us to write lexer code by hand for a specific language. instead i wrote a generic lexer generator and fed in the language definition to generate the lexer code, got a bad grade for not following instructions. it seems kind of funny now but at the time i was rather cross.


In my experience few CS students seem to take a relational databases course. It was optional in my program. Unfortunately, that means many of those databases were created by folks with no formal grounding in databases. Some folks pick it up on their own, but likely not as thoroughly as those who've taken a good course on the topic.


> Blame incompetent people doing SQL badly.

Everything you spoke to is around people doing relational algebra badly. That has little to do with SQL. The same mistakes would lead to the same problems if the database used, say, QUEL instead.

Perhaps it's not SQL that you love at all?


Yes, people who don't understand SQL databases can easily create "F-" situations. The problem is that people who are great at SQL databases can't do any better than a C+ because that's as good as SQL gets.


Strongly disagree, complete newbies can get up and running with SQL very quickly and it's expressive enough that experienced people can do things that no ORM would ever hope of being able to do. One criticism I have though is that whoever thought "SELECT" should come before "FROM" should be fired.


I sort of feel the same…but on the other hand if you consider ”delete from” exists also, it’s not completely unsensible to consider you first tell what operation you’re about to perform to the data. Would be nice to start with the source entity name for sure. Dunno what ”select 1” would look like, I guess the from foo would be optional.

Random saturday ramblings, sorry about that :-D


DELETE FROM is even worse. Accidentally/mindlessly press cmd+enter before you wrote the WHERE? Poof, data gone. Make it FROM … DELETE!

I also wish we needed to explicitly take locks. In PostgreSQL at least (I think other dialects/engines too), figuring out what operation will take what lock and potentially mean downtime is left as an exercise to the reader. Force me to write WITH EXCLUSIVE TABLE LOCK when I add a non nullable column with a default!


If you're going to run commands that modify data directly on the cli, do it in a transaction so you can roll back. Also, start with `--` to make it a comment. Once you have the correct query and someone's checked your work, go back to the beginning of the line and remove the `--` so you can run it. It's also a good idea to write a select * or count first to check that things make sense, and then start your transaction, go up and modify your select into a delete/update, check affected rows looks good again, maybe do another select or two to check that data looks like you expect, commit.


Fun fact: if you still have a MySQL (MyISAM) database, the transaction commands return success but don't actually function, so you can't roll back.


Well, to be fair ”from foo delete” would do the same I suppose :-D Unless there’d be an explicit end to the statement to designate you really want to delete everything. Which might not be a bad idea. Or make ”where …” mandatory and bring in ”delete_all” or ”delete everything from foo” as a syntactic guardrail. This is equally implementable, whichever the order of ”delete” and ”from” would be.


The MySQL client has an --i-am-a-dummy flag that won't let you run a DELETE (or UPDATE, I think) without a WHERE.


the Google SQL variant has where required for deletes. on my first encounter of it I was just like "huh, neat", you can always use 1=1 as the predicate if needed.


If you don't SELECT first before running the DELETE query, you shouldn't be anywhere near an IT job, let alone a production database.


Good tools prevent mistakes from happening rather than blaming the user for using it wrong


And yet I agree with both. SQL is at the same time easy to get started, incredibly powerful and yet also a complete disaster. Just like JavaScript but with standards you have to pay to read and implementers take their sweet time to implement. ORMs (with the possible exception of JOOQ, whose open source version is unfortunately also very limited) are hopelessly limited and have been a drag on the development of the field ever since.


i think “disaster” is hyperbole. my brain thinks in concert with sql so it feels natural, though i will concede that the parsing/syntax order differences was an odd choice.

programmers and language users all have a mental translation layers that they use to go from conceptual->concrete using the syntax and idioms of their language of choice.

javascript doesn’t make sense in a lot of ways. calling it a disaster is hyperbole. go doesn’t make sense in a lot of ways. it’s not a disaster. unfamiliarity or disagreement with choices does not necessarily mean it’s a disaster

the only language with which i’m familiar that is a complete, unmitigated disaster is the english language, but i suppose that’s outside of the scope of this conversation


> the only language with which i’m familiar that is a complete, unmitigated disaster is the english language

Out of curiosity, are you a native speaker of english or did you learn it as a second language?

I’m a native speaker and feel sorry for anyone who has to learn all its irregularities, etc. But the few times I’ve mentioned this to anyone who had to learn as a second language, they’d correct me and tell it was easy. Which baffled me.

Obviously as a native speaker, I’m in no place to judge it.


SQL syntax seems how humans communicate to me. Can you pass me that cup from the cupboard


SQL syntax is only on the surface level similar to natural language. Besides, if we actually considered similarity to English to be an important criteria for a programming language, we would all still be using Cobol.


this totally depends on which language the human speaks. Chinese grammar is kind of like backwards English grammar.


native english speaker. learned german (fluent) and italian (conversational). english is an insane amalgamation which may be why it’s become near lingua franca


> (with the possible exception of JOOQ, whose open source version is unfortunately also very limited)

How is it "very" limited?


It just occurred to me, do you have wide denormalized table queries where tables represent entities more than relations?

I think if so that might explain the difference in query structure.


Yeah I maintain several BI reports and the codebase is about 20 years old, so multiple teams have taken creative liberties with the naming conventions for tables. Without intellisense I'd probably have gone insane by now.


I like it starting with the desired output columns, often when I work with a db quite a bit and have common naming conventions that’s all I need to know.

Even writing queries I think in terms of outputs then build the query from there.

I am blanking about the advantages of starting with from, people clearly share your view, but I have no idea why.

Maybe it comes from my C++ background where return type comes first, then function name, then inputs.


Ability to tab complete columns in an interactive settings.

If I’m in psql I can tab complete columns in the where clause but not in the select because I haven’t actually given any information about what I’m selecting from yet.


This feels like a tradeoff we make where a convenience while writing once harms long term readability?

But that is a valid usecase!


Imagine if methods on your C++ classes had to be called like method(object) instead of object.method(). While it's a meaningless syntactic difference, the latter makes autocomplete in an IDE possible.


Sure, in that very small example, if you don't care internal about visibility.

But method(object1, object2, object3, object4) is different than object.method().


I don't think visibility is important, since you want autocomplete to work both in public and private methods. It would be a crappy IDE experience if you couldn't autocomplete on private methods within other private methods.

And sure, multimethods are too complex for naive autocompletion. But it's useful for people to have tooling that can say "given an input tell me what outputs are possible" because that's how we write code.


Completely agree with everything you said

Reminds me of reading music actually… many first time noob’s, mostly out of frustration, realize how sub optimal staff notation is and seek to write something better. In the end it’s ‘Good enough’ and accept it’s flaws because no rewrite has proven to be a 10x better solution.


SQL seems to be the QWERTY of the data query world. It's far too entrenched.


... unless you're playing guitar, in which case tab notation works way better ;)


More like Tab + already knowing the song.:)


I agree that SQL syntax is terrible, but for some potentially masochistic reason I still like writing it. It could certainly be done better though and I don’t understand how there hasn’t been more innovation in this space.


> I don’t understand how there hasn’t been more innovation in this space

I think it's simply that most businesses and investors don't register SQL as having any real problems, and especially now with a resurgent interest in SQL the idea of attempting anything novel feels too risky.

Shameless plug of one recent attempt to offer something different: XTQL https://docs.xtdb.com/intro/what-is-xtql.html


Having been forced to write queries as JSON in elastic I definitely prefer writing SQL. I would actively avoid utilizing any query language where I have to count brackets.


Handcrafting JSON is undoubtedly always a pain, but the idea with XTQL is rather that it can be easily generated from any regular programming language.

> I would actively avoid utilizing any query language where I have to count brackets

That's really an editor/tooling problem, solvable in many ways, but I guess a Python-like/Parinfer approach would be your preference? (where whitespace/indentation is significant)


True enough, tooling is key. I'm not a developer but I think SQL is a decent language that's easy to pick up. But I also like xpath, regex, and jq syntax too, so maybe I'm the weird one.


There's PRQL which is really nice. But it has pretty big downsides: it only supports querying data, not modifying it; and since it's not database-specific it doesn't support all the features of whatever database you're using. E.g. it can't access complex data types like structs/arrays in DuckDB.


PG’s blub essay isn’t especially charitable. There’s a good reason to want to stick with what you know and are comfortable with—-climbing endless learning curves means not getting things done and it isn’t simple to know ex ante what is worth the time


Which is fine when everything is handcrafted and human-scale, but as soon as you start going down the path of machine generated SQL (both the generation and analysis thereof) the tax is non-trivial.


It's not bad enough that it's worth actually trying to overturn, and the inertia from everything speaking SQL means there's a lot less pushback if you also do that.

SQL actually killed alternatives back in the 80s (and 90s, Postgres used to use a dialect of Ingres's QUEL, the "SQL" bit of PostgreSQL specifically denotes the switch over from QUEL to SQL).


Agreed, I wish that SQL was more concise and organized more left-to-right, like shell pipelines or the F# pipe operator. But what are the alternatives? The only one I know is PRQL, and that doesn't seem to be getting popular (correct me if I'm wrong).


It is organized, especially if you write it correctly (split lines and indent). Minified, it is a nightmare, but with a proper layout it is a pleasure to read it. I just did a code review for a 20,000 LoC piece and it was quite a breeze after I improved the layout a bit.


"Disaster" may be an exaggeration, but I think the syntax could be better, proper layout or not. In particular, I find that complex SQL queries often need to be read in a "spiral" order due to SQL's attempt to resemble natural language. I find PRQL's consistent left-to-right/top-to-bottom reading order a clear improvement.


I can understand that the statement order can be awkward, such as choosing what to select before the source.

But help me with "more concise". SQL has no boilerplate, its keywords are brief, you express the exact logic of the result set you're seeking. There are tons of SQL overlays or replacements embedded in programming languages or BI/reporting tools, and they are universally, more difficult to work with than straight SQL.


There is a paper at the end of this link trail and that paper discusses everything that is wrong with SQL including concision.

https://news.ycombinator.com/item?id=39798528

et voila: https://www.cidrdb.org/cidr2024/papers/p48-neumann.pdf

see their comments on WITHIN GROUP and redundant declaration of identifiers. As for boiler plate, SQL's turf war over plain English words as reserved names creates this interesting situation of boiler plate identifiers in SQL to disambiguate from reserved keywords.


This is nothing more than one guy's opinion dressed up to look like an academic study. He actually admits a few times that SQL features are concise. SQL definitely isn't perfect, and some vendors' implementations are better than others, but things like JOINs are pretty perfect. He considers specifying an INNER JOIN as syntactic sugar (over using WHERE equality), which seems backwards, as opposed to a more consistent practice of always using the JOIN keyword regardless of the type of join.


That one was more of a nitpick, so maybe I shouldn't have included that. What I had in mind was that I would like a bit more symbols instead of words, for example value IN [1, 10] instead of value BETWEEN 10 AND 10.

Some operations on groups are also too verbose, like getting top 5 items in each group.


IN already has a different meaning. BETWEEN 1 AND 10 is exactly how a human would describe it to another person.



Pandas and now Polars can replace a lot of SQL. I find that many of my students find these easier to build up and debug.


Syntactic sugar, in the sense that it is a syntax that is bad for you.


My take is that SQL is a bit like C - very nice to know, but move off of it to a more "managed" environment for safety and sanity.


Definitely disagree on this one. ORMs are a leaky abstraction, they hide the wall you’re going to bump into anyway, and they bring their own gotchas on top


I worked with two frameworks so far - SQLalchemy core and a much bigger "orchestration" one (dagster / airflow equivalent). SQLalchemy got me type checking, column name checking, and DB migrations versioning with Alembic.

The bigger framework got me more static analysis and dependency graphs on top of that. Those features saved me lots of time and headaches (type checking, dependency graphs, etc). Mind you, the frameworks I worked with still allowed for raw SQL for those edge cases that still pop up.



As someone who has implemented a composable SQL generator from user-defined algebras of (arbitrary SQL) queries using relational algebra, I understand the shortcomings of SQL when viewed from an angle of a relational query language.

However, SQL is a language with many facets (DML, DDL, DCL) other than 'relational' querying. Putting on a less mathematical and more engineering mindset, SQL ingratiates me by its interface to incredibly powerful primitives difficult to find anywhere else. (I've primarily worked with Postgres SQL)

Consider the humble function; in SQL https://www.postgresql.org/docs/current/sql-createfunction.h..., one can declare the function as `stable` or `immutable` to let the runtime optimise repeated calls; as `parallel` to let the runtime consider parallelisation, as `cost ...` and `rows ...` to aid optimiser cost estimation. Imagine if one could do that in C or Javascript!

Another facet which regularly puts me in awe is the transaction isolation primitives and locking primitives offered by SQL.

I understand that as a database language, SQL necessarily has these within its specialised niche, but it seems to me these aspects of SQL as an interface to a language runtime would be equally useful in the everyday program; in all these areas of functionality, SQL is so much more advanced than nearly every other general purpose programming language.


Codd's seminal paper, A Relational Model of Data for Large Shared Data Banks, states that a language based on applied predicate calculus "would provide a yard-stick of linguistic power for all other proposed data languages". Quoting from https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf:

"1.5 Some linguistic aspects

The adoption of a relational model of data, as described above, permits the development of a universal data sub-language based on an applied predicate calculus. A first-order predicate calculus suffices if the collection of relations is in normal form. Such a language would provide a yard-stick of linguistic power for all other proposed data languages, and would itself be a strong candidate for embedding (with appropriate syntactic modification) in a variety of host languages (programming, command- or problem-oriented)."

Languages based on predicate calculus indeed seem extremely suitable for reasoning about relational data. Datalog is a well-known example. It is more directly based on predicate logic, and much simpler than SQL.


> Lest you think is just one weird corner of the sql spec, I found this helpful diagram explaining how the scoping rules work (from Neumann and Leis, 2023)

It's an excellent diagram, it really conveys the dissonance. Incidentally I interviewed Viktor Leis on a podcast last week about the paper where it's from: https://juxt.pro/blog/sane-query-languages-podcast/

A lot of people seem to believe that LLMs or other ML methods can overcome the complexity challenges of generating SQL accurately, but I'm yet to be convinced that a database-powered AI revolution can happen without somehow bypassing SQL.


Everything is just syntactic sugar for something else. I'm syntactic sugar for the hydrogen atoms in my body.


Your comment is both needlessly dismissive, and worse, incorrect. You cannot get a human being, much less the person that is you, by recursively applying a constant set of rewrite rules on unstructured hydrogen atoms.

If you think to counter my assertion with "The standard model of particle physics and the big bang already did that, I'm here after all", then spare us both the trouble and don't reply. The particular arrangement of all known matter and energy in the universe at t=0 is not a repeatable initial condition.

Some rewriting systems are in fact Turing complete[1], and that's an interesting digression. However, it's far afield from the article's discussion of untangling the syntactic mess that is the SQL standard and bringing it closer in line with the standard expression of its semantics.

[1]: https://www.sciencedirect.com/science/article/pii/0304397592...


"constant set of rewrite rules"

We just don't know the rewrite rules. And I didn't say unstructured. And you need Carbon atoms etc. - hydrogen was just a shortcut.


And on some cases sugar sugar not syntactic sugar.


Thanks for this interesting post.

Intuitively, relational algebra compresses enumeration over data in time that a CPU executing billions of cycles a second can feasibly and efficiently traverse and execute against many collections of millions or billions of records in human perceivable time thanks to indexes.

I've been trying to think of systems communicating with eachother as parts of a relational model in the sense we can model system behaviour as a series of events and a join is a communication between components.

I would love to talk about this with people.


Some years back I spent a weekend and built a "good enough to prove to myself" version of this in Haskell that implemented the main relational operations of projection, selection and cartesian product.

The basic idea was based on the "stream fusion" papers. So the relation was a stream in the stream fusion sense so it was pretty trivial to implement the normal relational operators in that paradigm. Changing this type of system to work on "events" as input would be pretty trivial.

The one thing I never managed to get to work was the actual "fusion" compiler hint. I kept trying variants of what he did in the paper but ghc just refused to optimise my stream/unsteam unstream/stream pairs away because it had already done some rewriting to them. I couldn't figure out how to apply the optimisation early enough to be effective.

[1] Which are a fantastic read if you're into CS whatever you think of my idea https://www.cs.tufts.edu/~nr/cs257/archive/duncan-coutts/str... and https://www.researchgate.net/publication/220802863_Rewriting...


Thank you for sharing your thoughts and ideas Sean, appreciated.

I am unfamiliar with the source material and I have recorded the paper you linked to my reading list.

From a description of "stream fusion", it reminds me of Clojure's "transducers".

It also reminds me of Kafka's Table/Stream duality.

Term rewriting is something is really interesting to me.

Communication, protocols and Communicating sequential processes, session types are all ideas I am thinking about and trying to understand.


When working on a system that was overly split into microservices, I once wrote a leftJoin stream operation for Akka, partly to demonstrate how stupid it was that we had split our database apart and now needed to implement a database in our application using http calls, with 1000x worse performance.


> a join is a communication between components

Makes me wonder just how far people have pushed Foreign Data Wrappers in practice.


I’d much rather deal with the peculiarities of SQL than any of the attempted replacements (ones I’ve seen in my minted experience). Elastic for instance, other json based languages, are absolutely terrible. We lost something we when stopped writing ANSI standards.

We’ve even stayed on InfluxDB og versions _because of _ the SQL like syntax, and also their improved languages are a nuclear disaster area.

SQL, despite its flaws (null != null) is pretty good enough!


Not totally convinced by the ORDER BY obstacles that the author raises..

    table('test').project('a').orderBy('b')
> That's an error, because we can't order by a column that we just projected away. Right?

assumes that 'projection' completely eliminates part of the underlying relation, but why does that have to be the case?

If a relation includes 'selected fields' and 'hidden fields', and project just 'hides' the fields it doesn't project, while orderBy can operate on either projected or hidden fields, this ends up being perfectly sound.

Even the more complex example which is translated as follows:

   translate('select a+1 as c from test order by b,c')
   =>
   table('test').project('a','b').addColumn('a+1', as='c').orderBy('b','c').project('a')
would work fine as:

   table('test')                 // selected: [a, b, ...], hidden: []
      .addColumn('a+1', as='c')  // selected: [a, b, c, ...], hidden: []
      .project('c')              // selected: [c], hidden: [a, b, ...]
      .orderBy('b','c')          // selected: [c], hidden: [a, b, ...] 
(not sure why there's a .project('a') on the end of their version)

Which is a reasonably local, algebraic transformation.


Relational algebra IR is implemented in MonetDB and discussed in their paper. Definitely worth reading.

Not trying to be picky but pure relational algebra doesn't map to SQL and IMO it's not a good idea to attempt to do that due to the fact that relational algebra treats tuples as mathematical sets (ordering/uniqueness matters) while SQL does not(and has to deal with nullability).


A few tweaks here and there and it would be nice enough for me. Most of them are actually implemented by some engines but are not part of the standard. Just changing the order of the from and select clauses so autocomplete can know what fields can you use would be a nice enough change.


I tried to study C. J. Date's books to understand relational theory... suffice it to say that I got nothing from his books, except a deep irritation partly due to his absolute pedantry...

I finally learned SQL with a gentle introduction by Alan Beaulieu. I stumbled upon another book that's about the theory: Applied Mathematics for Database Professionals, by Lex deHaan, and Toon Koppelaars. Maybe these authors will benevolently teach me relational theory.

But please avoid C. J. Date's books. And don't be him when writing a book or trying to explain something to another human being.


SQL is pretty good all things considered.

But I've always looked out for languages that can represent relational algebra concepts more directly. Maybe CozoaDB is close, though still immature. Any recommendations?


I highly recommend the Third Manifesto. I could link this under most posts in this thread but I'll limit myself to two.

https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

The only problem there is that you might want to use a D language, and well. You can't. There was a product called Dataphor which one can find some writeups on but, baffling though I find this, there are no robust open-source relational databases which use a D language.


I’ve been using Pandas which exposes a python slicing syntax for manipulating relational data. It also has a builtin join() function.

“select id, date from orders” is orders[“id”, “date”].

It’s meant for in-memory datasets but the syntax could be extended to work for other backends. I’m not sure if anyone is working on that.


Ibis takes the notion of a dataframe and abstracts it from SQL backends.


This is a great write up. There appear to be a few camps forming in the comments and I’m in camp “SQL is confusing and attempts to explain it in terms of relational algebra have felt inadequate to me”.

It also gives me some good follow up material to read. I’m particularly interested in that one link that forms subqueries and lateral joins in terms of a new “dependent join” operator.


Go read: Database Design for Mere Mortals.

ERD's are your friend. Learn how to generate one, and how to read it.

The relations (not relational, not algebra) are IN the design they are IN the ERD (as a tool to visualize). Even if your not visual thinker the ERD might help you find a path between two distant tables.

Needing a subquery is rare. It happens but a lot of subqueries would be better off as joins. The moment you grasp the design of something you're less likely to want to sub query.

Explain is your friend. Reading an explain plan is going to give you some good insight into what is going on UNDER the hood. Not only will it help you tune slow queries but it is more insight into how large queries decompose.

Lastly, there is nothing worse than having to query a badly designed DB. If you do a shit job on the first part everything else is going to be painful.


Can we call it syntactic ashtray? Because it feels like I'm sucking on 1970s ashtray when I see or use it.

Those who have read their Spivak 2017 will know that databases are just Co-presheaves of Ologs over the Kliesli Category of the Power-Set Monad, the Identity Monad, or the Giry Monad. I would like a QL that acts like it!


I know some of these words.


Snark of grandparent aside: https://arxiv.org/abs/1102.1889 if you want to read more.


If you're interested in what it would take to put relational databases back on the rigorous footing of relational algebras, the Third Manifesto is a good place to start.

https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

I find it somewhat sad that an implementation of a database with a proper D language hasn't broken out and become a ubiquitous tool for the profession. There were some proprietary versions shortly after the manifesto's publication, but it never caught on.


I find that most people who object to SQL do not use TVFs. If you don’t have any tools to easily break down the steps of the work, of course SQL will feel like an opaque Write-only language. With TVFs you can easily iteratively add more complex steps to your query while checking your work while you build.


What does TVF mean? I have been able to find anything on Google, all I get is an Indian streaming service...

"Truth value function"?


Table-valued function.


See C.J. Date's "An Introduction to Database Systems," https://www.amazon.com/Introduction-Database-Systems-8th/dp/... This is not news.


Discussions of what is/is not syntactic sugar are unapproachable for me because I cannot get past the abuse of sugar's essential functions in the tortured metaphor.


You should look at LINQ in C#/.net . The SQL-like syntax always has a function-first equivalent, that gets across this point fairly eloquently.


That diagram separating the syntactic vs. semantic layers of a SQL statement (from Neumann & Leis paper) is brilliant.


An analogy I like is - Are Perl-style regexes (used in Python, Ruby, Java, .NET, etc.) syntactic sugar for regular languages?

The answer is no, because Perl added all sorts of imperative doodads to regexes, which can’t be easily represented and executed in the automata-based paradigm. Trying to do this is like a “research paper generator” (and not in a bad way), e.g.

Derivative Based Nonbacktracking Real-World Regex Matching with Backtracking Semantics - https://dl.acm.org/doi/abs/10.1145/3591262 (2023)

This is until Go and Rust, which used automata-based regexes from the beginning. I don’t think users have lost much.

Purely automata-based engines are kind of pleasant to write, because almost everything is in the compiler, and not in the runtime, e.g. https://github.com/andychu/rsc-regexp/blob/master/py/README....

That is, features like ? + * really are syntactic sugar for repetition. There’s also a lot of syntax sugar around character classes like [^a], and the runtime is very small.

---

Likewise, SQL seems to have so many non-relational doodads in its language design, which cause problems for implementers. In this case, I think there’s an incentive problem with SQL: It benefits vendors if their dialect is harder to re-implement. Although certainly they’ve added many useful features too in 4-5 decades!

To me a language design issue is we never really “learned” to compose languages with different paradigms:

- the set-based paradigms like relational algebra and regular languages, with

- Turing-machine like code. (and also I/O!)

We never learned polyglot programming, so each language becomes its own source of “reckless growth” – its own parochial backwater.

Both regexes and SQL should be able to “escape” to normal code, and that would greatly simplify them. This can be done both by language implementers and by application programmers, i.e. “factoring” across languages. It’s not always obvious how to do this, but it certainly it can be done more than we do it today.

---

I’d argue the same phenomenon – lack of language composition – leads to programming languages within YAML. Github Actions is nominally some kind of “declarative” scheduler specification, or graph (job -> job dependencies), but that’s not enough for many problems.

So it also has a bunch of doodads for escaping that model (to the extent it has a model).

Shell, Awk, and Make also grew many doodads (https://www.oilshell.org/blog/2016/11/14.html), which are not very well designed. They used to be declarative languages, but no longer are.

Although there is some distinction between “formerly set-based languages” like SQL and regex, and other “declarative” non-Turing-complete languages. But I think the language composition problem is approximately the same. Part of it is syntax, but a lot of it is semantics.

(copy of lobste.rs comment)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: