"low hanging" is relative. At least from my perspective. A significant part of my work involves cleaning up structured and unstructured data.
An example: More than ten years ago a friend of mine was fascinated by the german edition of the book "A Cultural History of Physics" by Károly Simonyi. He scanned the book (600+ pages) and created a PDF (nearly) same layout.
Against my advice he used Adobe tools for it instead of creating an epub or something like DocBook.
The PDF looks great, but the text inside is impossible to use as training data for a small LLM. The lines from the two columns are mixed and a lot of spaces are randomly placed (makes it particularly difficult because mathematical formulas often appear in the text itself).
After many attempts (with RegEx and LLMs), I gave up and rendered each page and had a large LLM extract the text.
An example: More than ten years ago a friend of mine was fascinated by the german edition of the book "A Cultural History of Physics" by Károly Simonyi. He scanned the book (600+ pages) and created a PDF (nearly) same layout.
Against my advice he used Adobe tools for it instead of creating an epub or something like DocBook.
The PDF looks great, but the text inside is impossible to use as training data for a small LLM. The lines from the two columns are mixed and a lot of spaces are randomly placed (makes it particularly difficult because mathematical formulas often appear in the text itself).
After many attempts (with RegEx and LLMs), I gave up and rendered each page and had a large LLM extract the text.