More

_peregrine_ · 2025-08-18T16:44:57 1755535497

yeah I mean that's basically what Javi talks about in the post... if you can throw hardware at it you can scale it (ingestion scales linearly with shards)

but the post has some interesting thoughts on how you do the high-scale ingestion while also handling background merge processes, reads, etc.

_peregrine_ · 2025-08-18T16:43:58 1755535438

definitely interesting and related

_peregrine_ · 2025-08-18T16:43:12 1755535392

nice one

_peregrine_ · 2025-07-23T16:33:19 1753288399

Pretty solid at SQL generation, too. Just tested in our generation benchmark: https://llm-benchmark.tinybird.live/

Not quite as good as Claude but by the best Qwen model so far and 2x as fast as qwen3-235b-a22b-07-25

Specific results for qwen3-coder here: https://llm-benchmark.tinybird.live/models/qwen3-coder

_peregrine_ · 2025-05-22T19:22:49 1747941769

According to our SQL Generation Benchmark (methodology linked in the results dash), Claude Opus 4 is the best of the popular models at SQL generation by a pretty decent margin.

_peregrine_ · 2025-05-22T19:21:46 1747941706

Already test Opus 4 and Sonnet 4 in our SQL Generation Benchmark (https://llm-benchmark.tinybird.live/)

Opus 4 beat all other models. It's good.

XCSme · 2025-05-22T21:47:07 1747950427

It's weird that Opus4 is the worst at one-shot, it requires on average two attempts to generate a valid query.

If a model is really that much smarter, shouldn't it lead to better first-attempt performance? It still "thinks" beforehand, right?

riwsky · 2025-05-23T11:57:38 1748001458

Don’t talk to Opus before it’s had its coffee. Classic high-performer failure mode.

stadeschuldt · 2025-05-22T19:40:56 1747942856

Interestingly, both Claude-3.7-Sonnet and Claude-3.5-Sonnet rank better than Claude-Sonnet-4.

_peregrine_ · 2025-05-22T19:48:36 1747943316

yeah that surprised me too

Workaccount2 · 2025-05-22T20:09:33 1747944573

This is a pretty interesting benchmark because it seems to break the common ordering we see with all the other benchmarks.

_peregrine_ · 2025-05-22T21:15:04 1747948504

Yeah I mean SQL is pretty nuanced - one of the things we want to improve in the benchmark is how we measure "success", in the sense that multiple correct SQL results can look structurally dissimilar while semantically answering the prompt.

There's some interesting takeaways we learned here after the first round: https://www.tinybird.co/blog-posts/we-graded-19-llms-on-sql-...

ineedaj0b · 2025-05-22T19:46:02 1747943162

i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately.

timmytokyo · 2025-05-23T10:39:00 1747996740

Not saying you're wrong about "OpenAI practices", but that's kind of a strange thing to complain about right after praising an LLM that was only recently inserting claims of "white genocide" into every other response.

veidr · 2025-05-23T15:12:56 1748013176

For real, though.

Even if you don't care about racial politics, or even good-vs-evil or legal-vs-criminal, the fact that that entire LLM got (obviously, and ineptly) tuned to the whim of one rich individual — even if he wasn't as creepy as he is — should be a deal-breaker, shouldn't it?

gkfasdfasdf · 2025-05-23T14:41:00 1748011260

Just curious, how do you know your questions and the SQL aren't in the LLM training data? Looks like the benchmark questions w/SQL are online (https://ghe.clickhouse.tech/).

zarathustreal · 2025-05-23T16:33:01 1748017981

“Your model has memorized all knowledge, how do you know it’s smart?”

sagarpatil · 2025-05-23T03:34:12 1747971252

Sonnet 3.7 > Sonnet 4? Interesting.

dcreater · 2025-05-23T15:17:52 1748013472

How does Qwen3 do on this benchmark?

mritchie712 · 2025-05-22T20:29:00 1747945740

looks like this is one-shot generation right?

I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first).

sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter).

_peregrine_ · 2025-05-22T21:13:25 1747948405

Actually no, we have it up to 3 attempts. In fact, Opus 4 failed on 36/50 tests on the first attempt, but it was REALLY good at nailing the second attempt after receiving error feedback.

jpau · 2025-05-22T20:23:11 1747945391

Interesting!

Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench?

_peregrine_ · 2025-05-22T21:10:55 1747948255

No it's definitely interesting. It suggests that Opus 4 actually failed to write proper syntax on the first attempt, but given feedback it absolutely nailed the 2nd attempt. My takeaway is that this is great for peer-coding workflows - less "FIX IT CLAUDE"

XCSme · 2025-05-22T21:43:01 1747950181

That's a really useful benchmark, could you add 4.1-mini?

_peregrine_ · 2025-05-23T14:27:06 1748010426

Yeah we're always looking for new models to add

jjwiseman · 2025-05-22T20:31:16 1747945876

Please add GPT o3.

_peregrine_ · 2025-05-22T21:12:05 1747948325

Noted, also feel free to add an issue to the GitHub repo: https://github.com/tinybirdco/llm-benchmark

varunneal · 2025-05-22T20:17:34 1747945054

Why is o3-mini there but not o3?

_peregrine_ · 2025-05-22T21:12:40 1747948360

We should definitely add o3 - probably will soon. Also looking at testing the Qwen models

joelthelion · 2025-05-22T19:35:14 1747942514

Did you try Sonnet 4?

vladimirralev · 2025-05-22T19:40:30 1747942830

It's placed at 10. Below claude-3.5-sonnet, GPT 4.1 and o3-mini.

_peregrine_ · 2025-05-22T19:49:14 1747943354

yeah this was a surprising result. of course, bear in mind that testing an LLM on SQL generation is pretty nuanced, so take everything with a grain of salt :)

kadushka · 2025-05-22T21:08:37 1747948117

what about o3?

_peregrine_ · 2025-05-22T21:13:35 1747948415

We need to add it

_peregrine_ · 2025-03-17T21:45:01 1742247901

Chance to win >$3000 in credits from devtools like Vercel, Tinybird, Dub, Resend, etc.

_peregrine_ · 2024-12-09T20:57:51 1733777871

Since MCP Servers are installed locally, it can be a bit painful to log and analyze usage of MCP Servers you built. My coworker built a utility to capture remote logging events from our MCP Server, could be extended to any MCP Server. Free to use, easy to set up. It uses Tinybird to capture events + generate Prometheus endpoints for Grafana, Datadog, etc.

_peregrine_ · 2024-11-27T18:31:39 1732732299

This is the full, unedited transcript of our conversation with Claude, whose context-awareness is provided by a v0 Tinybird MCP Server.

_peregrine_ · 2024-10-23T13:17:46 1729689466

I think Tinybird is a nice option here. It's sort of a managed service for ClickHouse with some other nice abstractions. For your streaming case, they have an HTTP endpoint that you can stream to that accepts up to 1k EPS and you can micro-batch events if you need to send more events than that. They also have some good connectors for BigQuery, Snowflake, DynamoDB, etc.