Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Vanna AI – Open-sourced text-to-SQL in Python (vanna.ai)
33 points by zainhoda on Sept 8, 2023 | hide | past | favorite | 12 comments
Hey there HN!

We've just open-sourced Vanna – a Python package that allows you to transform questions into SQL. We've leveraged LLMs to enable you to "ask" databases what you need, bypassing the need to "write" complex SQL.

Quick Overview:

- "Train" using DDL statements, documentation, or known correct SQL statements.

- "Ask" questions in natural language and receive SQL, tables, and charts in return.

- Open Source Flexibility: Swap storage mechanisms, customize LLMs, and choose your databases.

- Local or Hosted: Operate everything locally or use our hosted version for free (including complimentary LLM calls).

- Use it wherever Python is applicable. We provide code examples for integration in Jupyter, Streamlit, Slack, and more.

We would greatly appreciate your feedback, insights on issues, and contributions:

https://github.com/vanna-ai/vanna



Looks nice but I'm confused why it's advertised as if it's training anything. It's just doing retrieval from a vector db and generating a prompt with openai/LLM.


We use "train" in a colloquial sense because you're right -- it's RAG. We've tried to think about a different term but "train" seems to be the one that most data analysts resonate with.


Just curious what kinds of models are used in your project?

https://arxiv.org/abs/2306.08891 One paper seems to suggest trained specialized model can outperform LLM in some tasks.


Thanks for the link! I'll check it out.

In the meantime, this post we wrote might be interesting for you:

https://vanna.ai/blog/ai-sql-accuracy.html


Impressive! Quick question - is it possible to generate sql for a slight variant of sql? My project augments standard sql with a few new constructs.


It probably could but it would require adjusting the prompt. You’d have to override the generate_prompt function and tell it that you’re using a variant of SQL and describe the differences.


Is there a token limit for the train step? Like length of documentation, number of example SQL queries?


There isn't technically a limit on the storage side but it's generally better if you keep documentation to a manageable length.

You call vn.train(sql=...) on each individual SQL statement that you have.

What'll happen under the hood is the package will use the 10 most relevant SQL statement examples, 10 most relevant pieces of documentation, and 10 most relevant DDL statements.

If using 10 examples exceeds the (approximate) token limit for the model, it'll pare down to a smaller number that'll fit into the context limit.


These results are impressive, guys. I'm saving a pointer to dig deeper to glean the details. Tip in the hat for making it open.


Open source, but requires an API key?


API key is optional -- that's only if you want to use the hosted vector database. When running locally, no Vanna API key is necessary. Instead, you can put in an OpenAI API key as shown here: https://vanna.ai/docs/local.html

If you want to use a locally-hosted LLM, that's also possible by implementing the necessary abstract methods: https://vanna.ai/docs/vanna.html#open-source-and-extending


oh how exciting :)




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: