Hacker Newsnew | past | comments | ask | show | jobs | submit | _peregrine_'s commentslogin

yeah I mean that's basically what Javi talks about in the post... if you can throw hardware at it you can scale it (ingestion scales linearly with shards)

but the post has some interesting thoughts on how you do the high-scale ingestion while also handling background merge processes, reads, etc.


definitely interesting and related


nice one


Pretty solid at SQL generation, too. Just tested in our generation benchmark: https://llm-benchmark.tinybird.live/

Not quite as good as Claude but by the best Qwen model so far and 2x as fast as qwen3-235b-a22b-07-25

Specific results for qwen3-coder here: https://llm-benchmark.tinybird.live/models/qwen3-coder


According to our SQL Generation Benchmark (methodology linked in the results dash), Claude Opus 4 is the best of the popular models at SQL generation by a pretty decent margin.


Already test Opus 4 and Sonnet 4 in our SQL Generation Benchmark (https://llm-benchmark.tinybird.live/)

Opus 4 beat all other models. It's good.


It's weird that Opus4 is the worst at one-shot, it requires on average two attempts to generate a valid query.

If a model is really that much smarter, shouldn't it lead to better first-attempt performance? It still "thinks" beforehand, right?


Don’t talk to Opus before it’s had its coffee. Classic high-performer failure mode.


Interestingly, both Claude-3.7-Sonnet and Claude-3.5-Sonnet rank better than Claude-Sonnet-4.


yeah that surprised me too


This is a pretty interesting benchmark because it seems to break the common ordering we see with all the other benchmarks.


Yeah I mean SQL is pretty nuanced - one of the things we want to improve in the benchmark is how we measure "success", in the sense that multiple correct SQL results can look structurally dissimilar while semantically answering the prompt.

There's some interesting takeaways we learned here after the first round: https://www.tinybird.co/blog-posts/we-graded-19-llms-on-sql-...


i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately.


Not saying you're wrong about "OpenAI practices", but that's kind of a strange thing to complain about right after praising an LLM that was only recently inserting claims of "white genocide" into every other response.


For real, though.

Even if you don't care about racial politics, or even good-vs-evil or legal-vs-criminal, the fact that that entire LLM got (obviously, and ineptly) tuned to the whim of one rich individual — even if he wasn't as creepy as he is — should be a deal-breaker, shouldn't it?


Just curious, how do you know your questions and the SQL aren't in the LLM training data? Looks like the benchmark questions w/SQL are online (https://ghe.clickhouse.tech/).


“Your model has memorized all knowledge, how do you know it’s smart?”


Sonnet 3.7 > Sonnet 4? Interesting.


How does Qwen3 do on this benchmark?


looks like this is one-shot generation right?

I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first).

sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter).


Actually no, we have it up to 3 attempts. In fact, Opus 4 failed on 36/50 tests on the first attempt, but it was REALLY good at nailing the second attempt after receiving error feedback.


Interesting!

Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench?


No it's definitely interesting. It suggests that Opus 4 actually failed to write proper syntax on the first attempt, but given feedback it absolutely nailed the 2nd attempt. My takeaway is that this is great for peer-coding workflows - less "FIX IT CLAUDE"


That's a really useful benchmark, could you add 4.1-mini?


Yeah we're always looking for new models to add


Please add GPT o3.


Noted, also feel free to add an issue to the GitHub repo: https://github.com/tinybirdco/llm-benchmark


Why is o3-mini there but not o3?


We should definitely add o3 - probably will soon. Also looking at testing the Qwen models


Did you try Sonnet 4?


It's placed at 10. Below claude-3.5-sonnet, GPT 4.1 and o3-mini.


yeah this was a surprising result. of course, bear in mind that testing an LLM on SQL generation is pretty nuanced, so take everything with a grain of salt :)


what about o3?


We need to add it


Chance to win >$3000 in credits from devtools like Vercel, Tinybird, Dub, Resend, etc.


Since MCP Servers are installed locally, it can be a bit painful to log and analyze usage of MCP Servers you built. My coworker built a utility to capture remote logging events from our MCP Server, could be extended to any MCP Server. Free to use, easy to set up. It uses Tinybird to capture events + generate Prometheus endpoints for Grafana, Datadog, etc.


This is the full, unedited transcript of our conversation with Claude, whose context-awareness is provided by a v0 Tinybird MCP Server.


I think Tinybird is a nice option here. It's sort of a managed service for ClickHouse with some other nice abstractions. For your streaming case, they have an HTTP endpoint that you can stream to that accepts up to 1k EPS and you can micro-batch events if you need to send more events than that. They also have some good connectors for BigQuery, Snowflake, DynamoDB, etc.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: