Hacker News new | past | comments | ask | show | jobs | submit login

> It's trivial to normalise the various formats,

Ha. Ha. ha ha ha.

As someone who as pretty broadly tried to normalize a pile of books and documents I have legitimate access to, no it is not.

You can get good results 80% of the time, usable but messy results 18% of the time, and complete garbage the remaining 2%. More effort seems to only result in marginal improvements.






98% sounds good enough for the usecase suggested here.

Writing good validators for data is hard. You can be 100% sure that there will be bad data in those 98%. From my own experience I thought I had 50% of the books converted correctly and then I found I still had junk data and gave up, it is not an impossible problem I just was not motivated to fix it on my own. Working with your own copies is fine, but when you try to share that you get into legal issues that I just do not feel are that interesting to solve.

Edit: my point is that I would like to share my work but that is hard to do in a legal way. That is the main reason I gave up.


2% garbage, if some of that garbage falls out the right way, is more than enough to seriously degrade search result quality.

It's better than nothing, and nothing is what we currently have.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: