The main takeaway from this for me is that SQLite’s query planner seems to be pr...

crazygringo · 2025-09-29T19:07:12 1759172832

> The main takeaway from this for me is that SQLite’s query planner seems to be pretty limited.

This doesn't appear to be true at all.

The order of WHERE conditions does not matter; the order of columns in an index does.

Everything you're describing is pretty much just how indexes fundamentally work in all databases. Which is why you're saying it hasn't been "solved" by anyone.

Indexes aren't magic -- if you understand how they work as a tree, it becomes very clear what can be optimized and what can't.

It is true that occasionally query planners get it wrong, but it's also often the case that your query was written in a non-obvious way that is equivalent in terms of its formal results, but is not idiomatic -- and making it more idiomatic means the query planner can more easily understand which indexes to use where.

Ameo · 2025-09-29T21:34:12 1759181652

(copying my reply from the other comment that said the same thing as you)

The order of conditions in a WHERE definitely does matter, especially in cases where the conditions are on non-indexed columns or there are CPU-intensive search operations like regex, string ops, etc.

I just ran this test locally with a table I created that has 50 million rows:

``` » time sqlite3 test.db "select count() from test WHERE a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f' AND f = 'g'" sqlite3 test.db 5.50s user 0.72s system 99% cpu 6.225 total » time sqlite3 test.db "select count() from test WHERE f = 'g' AND a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f'" sqlite3 test.db 1.51s user 0.72s system 99% cpu 2.231 total ```

The only difference is swapping the `f = 'g'` condition from last to first. That condition never matches in this query, so it's able to fail fast and skip all of the work of checking the other conditions.

crazygringo · 2025-09-29T22:10:23 1759183823

Sorry, I should have clarified -- the order of WHERE conditions doesn't matter for whether an index is utilized. I thought that was the context of the original comment, but now I realize maybe it was unclear.

Yes, of course you can skip evaluating other conditions if an AND fails and that can affect speed. So that's the same as most programming languages.

SkiFire13 · 2025-09-29T18:38:42 1759171122

> A lot of modern DB engines like Clickhouse tend to just work around this problem by being so fast at full table scans that they don’t even need any sophisticated indexing set up at all.

There's only so much you can do with this approach due how to the algorithmic complexity scales as more joins are added. At some points you'll need some additional data structures to speed things up, though they not be indexes in name (e.g. materialized views)

setr · 2025-09-29T19:46:51 1759175211

Clickhouse isn’t fast at table scans, it’s just columnar. Indexes are basically a maintained transform from row storage to column storage; columnar databases are essentially already “indexed” by their nature (and they auto-apply some additional indexes on top, like zone maps). It’s only fast for table-scans in the sense that you probably aren’t doing a select * from table, so it’s only iterating over a few columns of data, whereas SQLite would end up iterating over literally everything — a table-scan doesn’t really mean the same thing between the two (a columnar database’s worst fear is selecting every column; a row-base database wants to avoid selecting every row)

Their problem is instead that getting back to a row, even within a table, is essentially a join. Which is why they fundamentally suck at point lookups, and they strongly favor analytic queries that largely work column-wise.

immibis · 2025-09-29T19:55:14 1759175714

Columnar databases are not "already "indexed"". Their advantage instead comes from their ability to only load the relevant parts of rows when doing scans.

jasonwatkinspdx · 2025-09-29T22:38:06 1759185486

"The indexes are the database" is a common perspective in column database implementations because it works quite well for ad hoc OLAP.

setr · 2025-09-30T01:18:45 1759195125

They’re indexed in the sense that they’re already halfway to the structure of an index — which is why they’re happy to toss indexes on top arbitrarily, instead of demanding the user to manage a minimum subset.

SkiFire13 · 2025-09-30T04:45:00 1759207500

What does it even mean to be "halfway" to the structure of an index? Do they allow filtering a subset of rows with a complexity that's less than linear in the total number of rows or not?

setr · 2025-10-01T14:27:25 1759328845

A row-based index is a column-wise copy of the data, with mechanisms to skip forward during scanning. You maintain a separate copy of the column to support this, making indexes expensive, and thus the DBA is asked to maintain a minimal subset.

A columnar database’s index is simply laid out on top of the column data. If the column is the key, then it’s sorted by definition, and no index is really required outside of maybe a zone map, because you can binary search. A non-key column gets a zone map / skip index laid out on top, which is cheap to maintain… because it’s already a column-wise slice of the data.

You don’t often add indexes to an OLAP system because every column is indexed by default — because it’s cheap to maintain, because you don’t need a separate column-wise copy of the data because it’s already a column-wise copy of the data.

SkiFire13 · 2025-10-01T17:06:27 1759338387

> A non-key column gets a zone map / skip index laid out on top, which is cheap to maintain… because it’s already a column-wise slice of the data.

I don't see how that's different from storing a traditional index. You can't just lay it on top of the column, because the column is stored in a different order than what the index wants.

setr · 2025-10-02T04:31:58 1759379518

Zonemap / skip indexes don’t require sorting, still provide significantly improved searching over full tablescans, and typically applied to every column by default. Sorting is even better, just at the cost of a second copy of the dataset.

In a row-based rdbms, any indexing whatsoever is a copy of the column-data, so you might as well store it sorted every time. It’s not inherent to the definition.

SkiFire13 · 2025-10-04T10:20:32 1759573232

> Zonemap / skip indexes don’t require sorting

That's still a separate index though, no? It's not intrinsic in the column storage itself, although I guess it works best with it if you end up having to do a full-scan of the column section anyway.

> Sorting is even better, just at the cost of a second copy of the dataset. > ... > In a row-based rdbms, any indexing whatsoever is a copy of the column-data

So the same thing, no?

setr · 2025-10-04T18:07:24 1759601244

I’m not saying columnar databases don’t have indexes, I’m saying that they get to have indexes for cheap (and importantly: without maintaining a separate copy of the data being indexed), to the point that every column is indexed by definition. It’s a separate data structure, but it’s not a separate db object exposed to the user — it’s just part of the definition

> So the same thing, no? Consider it as like: for a given filtered-query, a row-based storage is doing a table-scan if no index exists. There is no middle ground. Say 0% value or 100%.

A columnar database’s baseline is a decent index, and if there’s a sorted index then even better. Say 60% value vs 100%.

The relative importance of having a separate, explicit, sorted index is much lower in a columnar database, because the baseline is different. (Although maintaining extra sorted indexes is a columnar database is much more expensive — you basically have to keep a second copy of the entire table sorted on the new key(s))

emschwartz · 2025-09-29T18:08:02 1759169282

Just to clarify one thing: the order of WHERE conditions in a query does not matter. The order of columns in an index does.

Ameo · 2025-09-29T21:33:11 1759181591

It definitely does matter, especially in cases where the conditions are on non-indexed columns or there are CPU-intensive search operations like regex, string ops, etc.

I just ran this test locally with a table I created that has 50 million rows:

``` » time sqlite3 test.db "select count() from test WHERE a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f' AND f = 'g'" sqlite3 test.db 5.50s user 0.72s system 99% cpu 6.225 total » time sqlite3 test.db "select count() from test WHERE f = 'g' AND a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f'" sqlite3 test.db 1.51s user 0.72s system 99% cpu 2.231 total ```

The only difference is swapping the `f = 'g'` condition from last to first. That condition never matches in this query, so it's able to fail fast and skip all of the work of checking the other conditions.