How hard would it be to do some kind of OCR (maybe combined with machine learnin...

dan-robertson · on June 15, 2019

I claim such a solution would be morally wrong (I am not going to comment on whether such a solution would be good in practice or not) because latex largely throws away the data it has on the table structure and this solution tries to piece it together again. I claim a morally better solution is to have latex not throw away the information it had, but I concede it’s often harder to modify something big and old and complex than to produce something new which is supposed to always work.

mwcampbell · on June 16, 2019

Reconstructing content and formatting from a PDF through OCR, machine learning, etc. may not be the best possible solution. But it would still make more information accessible to more people. So I must vehemently disagree that it would be morally wrong.

GavinMcG · on June 15, 2019

Can you elaborate on the moral principles you're employing here?

sjy · on June 16, 2019

I assumed your parent was using 'morally' as it is sometimes used in mathematics: https://math.stackexchange.com/questions/1434043/sources-of-...

pishpash · on June 15, 2019

Morally wrong? What the actual f. The proposed is a perfectly reasonable and decoupled solution that would solve the problem for troves of existing, unstructured paper documents, too. In fact, it likely already exists.

jointpdf · on June 16, 2019

Tabula does this. There are Python and R wrappers.

-Java: https://tabula.technology

-Python: https://github.com/chezou/tabula-py

-R: https://github.com/ropensci/tabulizer

CrazyCatDog · on June 16, 2019

Abbyy Finereader does an excellent job