Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How hard would it be to do some kind of OCR (maybe combined with machine learning) to “scan” anything that looks like a table into a more accessible format? It seems like there’s a lot of room for ML to improve various aspects of screen readers and other assistive technology.


I claim such a solution would be morally wrong (I am not going to comment on whether such a solution would be good in practice or not) because latex largely throws away the data it has on the table structure and this solution tries to piece it together again. I claim a morally better solution is to have latex not throw away the information it had, but I concede it’s often harder to modify something big and old and complex than to produce something new which is supposed to always work.


Reconstructing content and formatting from a PDF through OCR, machine learning, etc. may not be the best possible solution. But it would still make more information accessible to more people. So I must vehemently disagree that it would be morally wrong.


Can you elaborate on the moral principles you're employing here?


I assumed your parent was using 'morally' as it is sometimes used in mathematics: https://math.stackexchange.com/questions/1434043/sources-of-...


Morally wrong? What the actual f. The proposed is a perfectly reasonable and decoupled solution that would solve the problem for troves of existing, unstructured paper documents, too. In fact, it likely already exists.


Tabula does this. There are Python and R wrappers.

-Java: https://tabula.technology

-Python: https://github.com/chezou/tabula-py

-R: https://github.com/ropensci/tabulizer


Abbyy Finereader does an excellent job




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: