My $0.02: correct chunking can improve accuracy, but it does not change the fact that it is still a single-shot operation. I have commented on this before, so I am repeating myself, but what RAGs are trying to do is the equivalent of looking up some information (let's say via a search engine), and you happen to have the correct answer in the first 5 results - not the links but the actual excerpt from the crawled pages. You don't need many evals to naturally figure out that this will only sometimes work. So, chunking improves the performance as long as the search phrase can discover the correct information, but it does not consider that the search itself could be wrong or require more evaluation. Add to the mix that vectorisation of the records does not work well for non-tokens, made-up words, foreign languages, etc, and you start getting the idea of the complexity involved. This is why more context is better but up to a limit.
IMHO, in most use cases, chunking optimisation strategies will not substantially improve performance. What I think might improve performance is running N search strategies with multiple variations of the search phrase and picking up the best answer. But this is currently expensive and slow.
Having developed a RAG platform over one and a half years ago, I find many of these challenges strikingly familiar.
There's far more to a RAG pipeline than chunking documents, chunking is just one way to interface with a file. In our case we use query decomposition, document summaries and chunking to achieve strong results.
Your right that chunking is just one piece of this. But without quality chunks you're either going to miss context come query time (bad chunks) or use 100X the tokens (full file context).
> When a user asks a question there is no guarantee that the relevant results can be returned with a single query. Sometimes to answer a question we need to split it into distinct sub-questions, retrieve results for each sub-question, and then answer using the cumulative context.
> For example if a user asks: “How is Web Voyager different from reflection agents”, and we have one document that explains Web Voyager and one that explains reflection agents but no document that compares the two, then we’d likely get better results by retrieving for both “What is Web Voyager” and “What are reflection agents” and combining the retrieved documents than by retrieving based on the user question directly.
> This process of splitting an input into multiple distinct sub-queries is what we refer to as query decomposition. It is also sometimes referred to as sub-query generation.
> What I think might improve performance is running N search strategies with multiple variations of the search phrase and picking up the best answer. But this is currently expensive and slow.
Eerily similar to Thinking Fast and Slow, and may help explain (when combined with biological and social evolutionary theory) why people have such a strong aversion to System 2 thinking.
It'd be funny if humanity was permanently stalled at the stage AI is currently at. Well, funny if one could watch it remotely like on The Truman Show instead of being trapped within it.
This looks great! You might be interested in surya - https://github.com/VikParuchuri/surya (I'm the author). It does OCR (much more accurate than tesseract), layout analysis, and text detection.
The OCR is slow on CPU (working on it), but faster than tesseract (CPU-only) on GPU.
You could probably replace pymupdf, tesseract, and some layout heuristics with this.
Happy to discuss more, feel free to email me (in profile).
It should be possible to call a GPL library in a separate process (surya can batch process from the CLI) and avoid GPL - ocrmypdf does this with ghostscript.
Can I send a PR extending the benchmark against doctr and potentially textract? I believe these represent the SOTA for open and proprietary OCR.
The benefit is to let people evaluate surya against the open source and commercial SOTA, improving the integrity and applicability of the benchmark in a business or research setting.
There's a risk: it could make surya's benchmark look less attractive. Also, picking textract to represent the proprietary SOTA might be dicey, since it has competitors (Google cloud ocr, Azure ocr)
Still, ranking surya with doctr, textract, and tesseract would be really nice baseline. As a research user, business user or open source contributor, those are the results I need to quickly understand surya's potential.
The recent Real Python pod has some anecdotal insights from a real-world project with respect to dealing with decades-old unstructured PDFs.
https://realpython.com/podcasts/rpp/199/
I see this in the README under the "How is this different from other layout parsers" section.
> Commercial Solutions: Requires sharing your data with a vendor.
But I also see that to use the Semantic Processing example, you have to have an OpenAI API key. Are there any plans to support locally hosted embedding models for this kind of processing?
Relatedly, the OCR component relies on PyMuPDF, which has a license that requires releasing source code, which isn’t possible for most commercial applications. Is there any plan to move away from PyMuPDF, or is there a way to use an alternative?
FWIW PyMuPDF doesn't do OCR. It extracts embedded text from a PDF, which in some cases is either non-existent or done with poor quality OCR (like some random implementation from whatever it was scanned with).
This implementation bolts on Tesseract which IME is typically not the best available.
Author here. I’m very open to alternatives to PyMuPDF / tesseract because I agree OCR results are sub optimal and it has a restrictive license. I tried basic ones and found the results to be poor.
What I want is a dynamic chunking - I want to search a document for a word - and then I want to get the largest chunk that fits into my limits and contain the found word. Has anyone worked on such thing?
Yeah - the idea is simple - but there are so many variations as to what makes a good chunk. If it is a program - then lines are good, but maybe you'd like to set the boundaries at block endings or something. And for regular text - then maybe sentences would be better than lines? Or paragraphs. And maybe it should not go beyond a boundary for a text section or chapter. And then there might also be tables. With tables - the good solution would be to fit some rows - but maybe the headers should also be copied together with the rows in the middle? But if a previous chunk with the headers was already loaded - then maybe not duplicate the headers?
Figures, too! Yeah you could write some logic essentially on top of a library like this, and tune based on optimizing for some notion of recall (grab more surrounding context) and precision (direct context around the word, e.g. only the paragraph or 5 surrounding table rows) for your specific application needs.
Using the models underlying a library like this, there's maybe room for fine-tuning as well if you have a set of documents with specific semantic boundaries that current approaches don't capture. (And you spend an hour drawing bounding boxes to make that happen).
OpenSearch perhaps? The search query results returns a list of hits (matches) with a text_entry field that has the matching excerpt from the source doc
Author here. Optionally we implement unitable which represents the current state of the art in table detection. Camelot / Tabelot use much simpler, traditional extraction techniques.
Unitable itself has shockingly good accuracy, although we’re still working on better table detection which sometimes negatively affects results.
I've been using camelot, which builds on the lower python pdf libraries, to extract tables from pdfs. Haven't tried anything exotic, but it seems to work. The tables I parse tend to be full page or the most dominant element
One thing I've noticed with pdfminer is that it can have horrible load times for some PDFs. I've seen 20 page PDFs take upwards of 45 seconds due to layout analysis. It's anaysis engine is also decent, but it takes newlines into account in weird ways sometimes - especially if you're asking for vertical text analysis
I finished watching this video today where the host and guests were discussing challenges in a RAG pipeline, and certainly chunking documents the right way is still very challenging. Video: https://www.youtube.com/watch?v=Y9qn4XGH1TI&ab_channel=Prole... .
I was already scratching my head on how I was going to tackle this challenge... It seems your library is addressing this problem.
Folks who want to extract data from really complex documents - not just restricted to complex tables - but also check boxes, tables spanning multiple pages do try LLMWhisperer. https://llmwhisperer.unstract.com/
IMHO, in most use cases, chunking optimisation strategies will not substantially improve performance. What I think might improve performance is running N search strategies with multiple variations of the search phrase and picking up the best answer. But this is currently expensive and slow.
Having developed a RAG platform over one and a half years ago, I find many of these challenges strikingly familiar.