Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Beyond text splitting – improved file parsing for LLMs (github.com/filimoa)
206 points by serjester on April 8, 2024 | hide | past | favorite | 43 comments


My $0.02: correct chunking can improve accuracy, but it does not change the fact that it is still a single-shot operation. I have commented on this before, so I am repeating myself, but what RAGs are trying to do is the equivalent of looking up some information (let's say via a search engine), and you happen to have the correct answer in the first 5 results - not the links but the actual excerpt from the crawled pages. You don't need many evals to naturally figure out that this will only sometimes work. So, chunking improves the performance as long as the search phrase can discover the correct information, but it does not consider that the search itself could be wrong or require more evaluation. Add to the mix that vectorisation of the records does not work well for non-tokens, made-up words, foreign languages, etc, and you start getting the idea of the complexity involved. This is why more context is better but up to a limit.

IMHO, in most use cases, chunking optimisation strategies will not substantially improve performance. What I think might improve performance is running N search strategies with multiple variations of the search phrase and picking up the best answer. But this is currently expensive and slow.

Having developed a RAG platform over one and a half years ago, I find many of these challenges strikingly familiar.


There's far more to a RAG pipeline than chunking documents, chunking is just one way to interface with a file. In our case we use query decomposition, document summaries and chunking to achieve strong results.

Your right that chunking is just one piece of this. But without quality chunks you're either going to miss context come query time (bad chunks) or use 100X the tokens (full file context).


Can you describe a little bit more in detail what is your stragegy on query decomposition?


Here is a description: https://js.langchain.com/docs/use_cases/query_analysis/techn...

> When a user asks a question there is no guarantee that the relevant results can be returned with a single query. Sometimes to answer a question we need to split it into distinct sub-questions, retrieve results for each sub-question, and then answer using the cumulative context.

> For example if a user asks: “How is Web Voyager different from reflection agents”, and we have one document that explains Web Voyager and one that explains reflection agents but no document that compares the two, then we’d likely get better results by retrieving for both “What is Web Voyager” and “What are reflection agents” and combining the retrieved documents than by retrieving based on the user question directly.

> This process of splitting an input into multiple distinct sub-queries is what we refer to as query decomposition. It is also sometimes referred to as sub-query generation.


> What I think might improve performance is running N search strategies with multiple variations of the search phrase and picking up the best answer. But this is currently expensive and slow.

Eerily similar to Thinking Fast and Slow, and may help explain (when combined with biological and social evolutionary theory) why people have such a strong aversion to System 2 thinking.


Ha, never thought of that. Thank you :)


It'd be funny if humanity was permanently stalled at the stage AI is currently at. Well, funny if one could watch it remotely like on The Truman Show instead of being trapped within it.


This looks great! You might be interested in surya - https://github.com/VikParuchuri/surya (I'm the author). It does OCR (much more accurate than tesseract), layout analysis, and text detection.

The OCR is slow on CPU (working on it), but faster than tesseract (CPU-only) on GPU.

You could probably replace pymupdf, tesseract, and some layout heuristics with this.

Happy to discuss more, feel free to email me (in profile).


OP: please don't poison your MIT license w/ surya's GPL license


It should be possible to call a GPL library in a separate process (surya can batch process from the CLI) and avoid GPL - ocrmypdf does this with ghostscript.


Can I send a PR extending the benchmark against doctr and potentially textract? I believe these represent the SOTA for open and proprietary OCR.

The benefit is to let people evaluate surya against the open source and commercial SOTA, improving the integrity and applicability of the benchmark in a business or research setting.

There's a risk: it could make surya's benchmark look less attractive. Also, picking textract to represent the proprietary SOTA might be dicey, since it has competitors (Google cloud ocr, Azure ocr)

Still, ranking surya with doctr, textract, and tesseract would be really nice baseline. As a research user, business user or open source contributor, those are the results I need to quickly understand surya's potential.


I've benchmarked against google cloud ocr, but the results are on Twitter, not the repo yet - https://twitter.com/VikParuchuri/status/1765440195124691339 . The reason I didn't benchmark against doctr is language support.


The recent Real Python pod has some anecdotal insights from a real-world project with respect to dealing with decades-old unstructured PDFs. https://realpython.com/podcasts/rpp/199/


Neat and timely. My biggest challenge is tables contained in PDFs.

Are there any similar projects that are lower level (for those of us not using Python)? Something in Rust that I could call out to, for example?


Very cool!

I see this in the README under the "How is this different from other layout parsers" section.

> Commercial Solutions: Requires sharing your data with a vendor.

But I also see that to use the Semantic Processing example, you have to have an OpenAI API key. Are there any plans to support locally hosted embedding models for this kind of processing?


Relatedly, the OCR component relies on PyMuPDF, which has a license that requires releasing source code, which isn’t possible for most commercial applications. Is there any plan to move away from PyMuPDF, or is there a way to use an alternative?


FWIW PyMuPDF doesn't do OCR. It extracts embedded text from a PDF, which in some cases is either non-existent or done with poor quality OCR (like some random implementation from whatever it was scanned with).

This implementation bolts on Tesseract which IME is typically not the best available.


Author here. I’m very open to alternatives to PyMuPDF / tesseract because I agree OCR results are sub optimal and it has a restrictive license. I tried basic ones and found the results to be poor.


This article compares multiple solutions and recommends docTR (Apache License 2.0): https://source.opennews.org/articles/our-search-best-ocr-too...


Coming soon!


What I want is a dynamic chunking - I want to search a document for a word - and then I want to get the largest chunk that fits into my limits and contain the found word. Has anyone worked on such thing?


    grep -C $n word document
will get you $n lines of context on either side of the matching lines.


Yeah - the idea is simple - but there are so many variations as to what makes a good chunk. If it is a program - then lines are good, but maybe you'd like to set the boundaries at block endings or something. And for regular text - then maybe sentences would be better than lines? Or paragraphs. And maybe it should not go beyond a boundary for a text section or chapter. And then there might also be tables. With tables - the good solution would be to fit some rows - but maybe the headers should also be copied together with the rows in the middle? But if a previous chunk with the headers was already loaded - then maybe not duplicate the headers?


Figures, too! Yeah you could write some logic essentially on top of a library like this, and tune based on optimizing for some notion of recall (grab more surrounding context) and precision (direct context around the word, e.g. only the paragraph or 5 surrounding table rows) for your specific application needs.

Using the models underlying a library like this, there's maybe room for fine-tuning as well if you have a set of documents with specific semantic boundaries that current approaches don't capture. (And you spend an hour drawing bounding boxes to make that happen).


OpenSearch perhaps? The search query results returns a list of hits (matches) with a text_entry field that has the matching excerpt from the source doc


Do you need to find the longest common substring? Because there are several methods to accomplish that.

[0]: https://en.m.wikipedia.org/wiki/Longest_common_substring


How accurate is table detection/parsing in PDFs? I found this part the most challenging, and none of the open-source PDF parsers worked well.


Author here. Optionally we implement unitable which represents the current state of the art in table detection. Camelot / Tabelot use much simpler, traditional extraction techniques.

Unitable itself has shockingly good accuracy, although we’re still working on better table detection which sometimes negatively affects results.


is this the unitable you mentioned https://github.com/poloclub/unitable


I've been using camelot, which builds on the lower python pdf libraries, to extract tables from pdfs. Haven't tried anything exotic, but it seems to work. The tables I parse tend to be full page or the most dominant element

https://camelot-py.readthedocs.io/en/master/

I like Camelot because it gives me back pandas dataframes. I don't want markdown, I can make that from a dataframe if needed


Have you checked Surya ?


I did and I had issues when tables had mixed text and numbers.

Example:

£243,234 would be £234,

Or £243 234

Or £243,234 (correct).

Some cells weren't even detected.



worked 100% of the time for me


which software?


One thing I've noticed with pdfminer is that it can have horrible load times for some PDFs. I've seen 20 page PDFs take upwards of 45 seconds due to layout analysis. It's anaysis engine is also decent, but it takes newlines into account in weird ways sometimes - especially if you're asking for vertical text analysis


This looks great! And incredibly timely too!

I finished watching this video today where the host and guests were discussing challenges in a RAG pipeline, and certainly chunking documents the right way is still very challenging. Video: https://www.youtube.com/watch?v=Y9qn4XGH1TI&ab_channel=Prole... .

I was already scratching my head on how I was going to tackle this challenge... It seems your library is addressing this problem.

Thanks for the good work.


Looks dope!


Folks who want to extract data from really complex documents - not just restricted to complex tables - but also check boxes, tables spanning multiple pages do try LLMWhisperer. https://llmwhisperer.unstract.com/


I need to spend more time figuring out how the layout detection is working under the hood, but if it's not a NC model then this could be really good.


Is this limited to PDFs or could the same chunking and parsing be applied to plain text, html, and other input file types?


Is this only for PDFs? Or does it support other formats too? E.g. markdown, text, docx etc.


How does this compare to LayoutLMv3? Was it trained on forms at all?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: