* Don't use a single embedding per content item, use multiple to increase retrieval quality
Can you share some specific examples of what you mean by this? How would you process specific info types (eg: news article, or web page, or product catalogue data) this way, and how would you handle retrieval that makes the quality "better"?
*Edit: Thanks for all replies so far - yes I am aware about splitting or chunking the data, but interested in a good write-up of techniques and pros/cons of each with examples. Eg: Chunking sentences vs. paragraphs, providing context around the embedding result, asking GPT to generate questions to chunks and embedding that instead, combining interaction data (eg: purchases or clicks after search queries) with actual content data before embedding, embedding attributes around data, and so on.
another approach is HyDE, where you ask the LLM to come up with a plausible (but likely wrong) answer, and use the embedding of the wrong answer to find the appropriate chunk, pretty clever
Presumably, they're referring to chunking up the data into discrete semantic units—smaller vectorizable subsections (e.g., paragraphs) more precisely capturing different parts of the data.
I'm curious about more sophisticated answers to this question, but the obvious approach would be to split the article or web page into sentences and do an embedding per sentence.
When I was playing around with search via embeddings (as a test I was using Vampire the Masquerade V5 sourcebooks, and asking rules questions), I got the best results -- in terms of correct answers -- by using sentence embeddings. I'd search the query against the sentence embeddings, and then retrieve more context surrounding the winning sentence(s). That context would be passed to the LLM.
It wasn't perfect, though. I'm tempted to try the avenue of having an LLM generate questions for each passage and then use those embeddings, but it sounds a bit expensive to set up given the length of the books.
Sentence embeddings + retrieving context makes sense.
By any chance, have you done any work with indexing code using embeddings? I'd like to do something similar there, but there's no obvious notion of "sentence", especially across languages.
Probably the closest analogue is just lines of code, but breaking lines on newlines might break an expression in the middle removing meaning from both halves.
I was planning on trying indexing overlapping groups of lines but haven't had time yet.
Read the README and the code. This is a breathless pronouncement of a thin wrapper of some other wrappers. We really need to watch the hype, lest we kill the actually important stuff going on in ML
to be fair the author mentions in the README that the project was built to validate the idea. I think if you're just trying to validate an idea it's totally fine to be using thin wrappers
I think you need a small video and much more clear and simple marketing material. For example, these don't mean anything for me: "Boxcars, CableReady, and StimulusReflex"
This may be a naive question. But is there a way to embed a chat bot like this that only queries the data we feed it, and not the universe of other stuff in gpt? Like I don’t want people using the chatbot in our product to query a good strawberry shortcake recipe. We just want them to query about data we allow it to query which is native to our business.
When you're building applications on top of LLMs, there are a number of central problems that you're trying to solve and this is one of them. Solutions are numerous and widely variable, everything from basic regex parsing to fine-tuning validator models to new programming/modeling languages. Here's some examples:
There’s no foolproof way to do what you want at this point. You could have a separate model trained to infer whether a query is relevant to your product, and then reject the query if it’s predicted to be irrelevant. That’s not 100%, though.
I don't think you can really 100% prevent it. even openai has issues with gpt responding in an undesired way (google Dan, where people try to hack instructions to get responses that are undesired by the openai team). However I think you can make it more difficult (as in the person trying to misuse your chatbot will need to put in some effort to get the strawberry cake, if you have instructed it before to only give information about your product)
Look into https://github.com/NVIDIA/NeMo-Guardrails and specifically to your question there are "topical rails" to ensure the conversation stays on a set of topics you greenlighted.
Also takes care of jailbreaks and allows custom conversation flow templates.
I'm curious how that works, as the documentation is a little under-specified. It seems like it requires specifying exact "utterances" from the user, but I don't think that can be the case -- wouldn't it be flatly useless that way? But it's not clear how to use it to, for example, disallow talking about politics. Or to disallow talking about topics unrelated to the dev's product, for that matter.
This is somewhat possible. I've created a way to chat to our company's material publicly. We used a lot of prompt engineering and custom guardrails to achieve this. However, it severely limited the length of the conversation that a user can have.
This is a great idea and would love to see something like this succeed!
If I understand how all of these OpenAI dependent apps work, none of them actually have the LLM and are doing any kind of heavy processing. AFAIK, they’re all packaging your data, submitting it to OpenAI on every request and then repackaging the output. There’s no real indexing, no real tangible thing, you have to start from scratch every time. So it’s likely going to be very expensive and super slow.
For most applications, packaging all the data and submitting it to OpenAI won't be feasible due to the limited token window size.
I think the most common design pattern nowadays goes like this:
1. Chunk all your data (e.g. per paragraph of content)
2. Generate an embedding for each chunk
3. Index embeddings in a vector database
4. When a query comes in, find chunks relevant to the query (based on embeddings similarity) and ONLY send the relevant chunks + query to a LLM to formulate the answer
Quickly glancing through the repository from this post, I can see that it also follows this pattern. It uses OpenAI's embedding API for 2. and Pinecone DB for 3.
I've seen this described as the common approach and argued for it but with my limited knowledge I have difficulties countering the argument that it would be best to just finetune the model with your own data.
I don't think it is as much the context window size because you would chunk your data anyways. I think the counter argument is either that finetuning is limited by the risk of overfitting and catastrophic forgetting or cost prohibitive. I think it is more of the former. Am I on the right track with this arguments?
Another point to consider is probably the vector DB contains an exact version of your data you get that as a result whereas the model will only be able answer vaguely or by paraphrasing.
Once I dug in to the fine-tuning APIs [1] I realized that the phrase "training the model on your docs" often doesn't make sense for the use case people are trying to solve. You provide hundreds of input examples and tell the model how it should complete those prompts. Fine-tuning has a lot of use cases, but "keeping the LLM generally grounded in the facts of my website" is not one of them.
Fine-tuning has a lot of use cases, but "keeping the LLM generally grounded in the facts of my website" is not one of them.
Yes, that's what everyone says and it makes total sense to me. I'm looking for (technical, but not too technical) arguments why it is not possible. There I'm not so much interested in the "grounded in the facts of my website" point but more in the similar "take the data from my large private knowledge base into consideration" point.
In other words I don't want to restrict the knowledge the model has or the answers it gives. I want to add a considerable amount of my own knowledge. This seems not to be possible without training from scratch. The question is "Why?"
This seems to be mainly a wrapper around the OpenAI API From the repo they want to integrate Open Source LLMs in the future too.
I feel lately - GPT-4 is superb in performance, but locked up. Using a weaker model feels better because I can just spin up a server and run it on my own. Recent Twitter/Reddit changes remind that relying on others can be a bad thing.
Mentioned this in a preview reply but this was something Convostack wanted to solve by allowing anyone to integrate their Langchain agent with a production-ready chatbot. It's completely open-source and also has pre-built React UI components. As a disclaimer I helped work on the project but curious to hear what you guys think: https://github.com/ConvoStack/convostack
If I understand what this tool is doing, there is an important security caveat.
>providing PDF files, websites, and soon, integrations with platforms like Notion, Confluence, and Office 365.
Means that anything you feed this ChatBot, gets turned into data that's uploaded to OpenAI. So if you're using an internal Confluence, consider all that data public now. We've already seen intranet pages show up on ChatGPT/OpenAI in the past.
Use of Content to Improve Services. We do not use Content that you provide to or receive from our API (“API Content”) to develop or improve our Services. We may use Content from Services other than our API (“Non-API Content”) to help develop and improve our Services. You can read more here about how Non-API Content may be used to improve model performance. If you do not want your Non-API Content used to improve Services, you can opt out by filling out this form. Please note that in some cases this may limit the ability of our Services to better address your specific use case.
The research was seeing intranet sites cited by OpenAI. So maybe someone copied and pasted it into the web interface. Maybe the big multi-national firms had their intranet exposed to the internet. I don't know how it got there; heck, no one may know.
I went ahead and installed it in a proxmox container, was fairly easy on x64 (arm support would be nice).
One suggestion: it would be nice to have a short-term memory - a la ChatGPT. With the token limit at 4-8k for GPT-4, it would be nice to take advantage of that with both the "long-term memory" (vector store) but also a "short-term" one (as in, sending the previous questions/answers for context).
"Support offline open-source models (e.g., Alpaca, LLM drivers)" is already on the roadmap, which is great! There's just so much cool stuff we can try with LLMs now…
What's the best discussion forum to exchange ideas, experiences and to collaborate on using and customizing (local) LLMs for (indy) gaming and other cool projects?
This was the goal with ConvoStack which allows people to implement our IAgent Express js interface with any custom Langchain agent. We saw this as an issue with other chatbot platforms which were limiting for developers. Comes with pre-built React components and Redis for caching as well to easily have production-ready chat interface. It's completely open-source too so can be self-hosted and modified to your liking. As a disclaimer, I helped develop this but would love for you to check it out and see if this is something you're looking for.
https://github.com/ConvoStack/convostack
I've scanned the codebase and IMO the statement is mostly misleading. I think LangChain agents (which this is a thin wrapper around) do have some default compression/reranking behavior to help fit relevant context into the context window, but it's very, very, very far from actually infinite short-term memory.
I think the framing where the statement could be considered true is if you assume "memory = persistent storage", which is IMO not what most people will do.
Note: I'm not trying to imply the authors are being intentionally misleading.
A few thoughts:
* allow for custom endpoint URLs, this way people can use open source LLMs with a fake openAI API backend like basaran[2] or llama-api-server[3]
* look into better embedding methods for info-retrieval like InstructorEmbeddings or Document Summary Index
* Don't use a single embedding per content item, use multiple to increase retrieval quality
1 https://github.com/underlines/awesome-marketing-datascience/...
2 https://github.com/hyperonym/basaran
3 https://github.com/iaalm/llama-api-server