Hacker News new | past | comments | ask | show | jobs | submit | abc-1's comments login

Anything that mentions tesseract is about 10 years out of date at this point.


Quite simply, you’re completely wrong. Modern tesseract versions include a modern LSTM AI. It can very affordably be deployed on CPU, yet its performance is competitive with much more expensive large GPU-based models. Especially if you handle a high volume of scans, chances are that tesseract will have the best bang per buck.


My company probably spent close to 6 figures overall creating Tesseract 5 custom models for various languages. Surya beats them all and is open source (and quite faster).


Surya weights for the models are licensed cc-by-nc-sa-4.0. They have an exception for small companies. If you're company is not small you either need to pay them or use them illegally.

Their training code and data is closed source. They are barely open weight and only inference is open source.


i remember that you could not train it your self in a font like you could in older versions, it that still the case?


5.5.0 released November last year. Still a very active project as far as I can tell and runs on CPU. Even compared to best open source GPU option it is still pretty good. VLMs work very differently and don't work as well for everything. Why is it out of date?


I don't know that that is true: https://researchify.io/blog/comparing-pytesseract-paddleocr-...

Using Surya gets you significantly better results and makes almost all the work detailed in the article largely unnecessary.


Surya weights for the models are licensed cc-by-nc-sa-4.0 so not free for commercial usage. Also, as far as I know, the training data is 100% unavailable. Given they use well trained, but standard models, it isn't really open source and barely, maybe, open weight. I kinda hate how their repo says gpl cause that is only true for the inference code. The training code is closed source.


I did not know that the training code is closed source. That is troubling.


Well, at least I can apt-get install tesseract.

That doesn't hold for any of the GPU-based solutions, last time I checked.


I just built a pipeline with tesseract last year. What's better that is open source and runnable locally?

VLLM hallucination is a blocker for my use case.


If you are stuck with open source, then your options are limited.

Otherwise I'd say just use your operating system's OCR API. Both Windows and MacOS have excellent APIs for this.


How is a hallucination worse than a Tesseract error?


Because the VLM doesn't know it hallucinated. When you get a Tesseract error you can flag the OCR job for manual review.


Hallucinations are hard to detect unless you are a subject-matter expert. I don't have direct experience with Tesseract error detection.


Latter is more likely to get debugged.


It could hallucinate obscene language, something which is less likely with classic OCR.


Surprised these people had to iterate and run these experiments. I thought all of this was common knowledge in books. Maybe it was an experience before this was common knowledge, but it’s not uncommon to see “painfully figure something out instead of cracking open a book, write a blog post about it like it’s some new found knowledge”.


Could you be more specific about which book and chapter talked about this problem and how the solution was different or similar?


Of the top of my head I can think of Extreme Programming Explained, The Art of Agile Development, Software Teaming, The Pragmatic Programmer.


Not surprising. They’re almost assuredly trained on reddit data. We should probably call this “the reddit simp bias”.


To be honest, I am not sure where this bias comes from. It might be in the Web data, but it might also be overcorrection of the alignment tuning. They LLM providers are worried that their models will generate sexist or racists remarks so they tune it to be really sensitive towards marginalized groups. This might also explain what we see. Previous generations of LMs (BERT and friends) were mostly pro-male and they were purely Web-based.


Patriarchal values can, at face value, seem contradictory but it all checks out.

Part of it is that we naturally have a bias to view men as "doers". We view men as more successful, yes, perhaps smarter. When we think doctor we think man, when we think lawyer we think men. Even in sex, we view men as having the position of "doing", and women of being the subject, and sex being something done to them.

But men are also "doers" of violence, of conflict. Women, conversely, are too passive and weak to be murderers or rapists. In fact, in regards to rape, because we view sex as something done by men to women a lot of people have the bias that women cannot even be rapists.

This is why we simultaneously have these biases where we picture success as related to man, but we sentence men more harshly in criminal justice. It's not because we view men as "good", no, it's because we view them as ambitious. Then we end up with this strange situation where being a woman makes you significantly less likely to be convicted of a crime you committed, and, if you are, you are likely to get significantly less time. Men are perpetrators (active) and women are victims (passive).


Surely some of the model bias comes from targeting benchmarks like this one. It takes left-wing views as axiomatically correct and then classifies any deviation from them as harmful. For example, if the model correctly understands the true gender ratios in various professions it's declared to be a "stereotype" and that the model should be fixed to reduce harm.

I'm not saying any specific lab does use your benchmark as a training target, but it wouldn't be surprising if they either did or had built similar in house benchmarks. Using them as a target will always yield strong biases against groups the left dislikes, such as men.


> It takes left-wing views as axiomatically correct

This is painting with such a broad brush that it's hard to take seriously. "Models should not be biased toward a particular race, sex, gender, gender expression, or creed" is actually a right-wing view. It's a line that appears often in Republican legislation. And when your model has an innate bias attempting to correct that seems like it would be a right-wing position. Such corrections may be imperfect and swing the other way but that's a bug in the implementation not a condemnation of the aim.


Let's try and keep things separated:

1. The benchmark posted by the OP and the test results posted by Rozado are related but different.

2. Equal opportunity and equity (equal outcomes) are different.

Correcting LLM biases of the form shown by Rozado would absolutely be something the right supports, due to it having the chance of compromising equal opportunity, but this subthread is about GenderBench.

GenderBench views a model as defective if, when forced, it assumes things like an engineer is likely to be a man if no other information is given. This is a true fact about the world - a randomly sampled engineer is more likely to be a man than a woman. Stating this isn't viewed as wrong or immoral on the right, because the right doesn't care if gender ratios end up 50/50 or not as long as everyone was judged on their merits (which isn't quite the same thing as equal opportunity but is taken to be close enough in practice). The right believes that men and women are fundamentally different, and so there's no reason to expect equal outcomes should be the result of equal opportunities. Referring to an otherwise ambiguous engineer with "he" is therefore not being biased but being "based".

The left believes the opposite, because of a commitment to equity over equal opportunity. Mostly due to the belief that (a) equal outcomes are morally better than unequal outcomes, and (b) choice of words can influence people's choice of profession and thus by implication, apparently arbitrary choices in language use have a moral valence. True beliefs about the world are often described as "harmful stereotypes" in this worldview, implying either that they aren't really true or at least that stating them out loud should be taboo. Whereas to someone on the right it hardly makes sense to talk about stereotypes at all, let alone harmful ones - they would be more likely to talk about "common sense" or some other phrasing that implies a well known fact rather than some kind of illegitimate prejudice.

Rozado takes the view that LLMs having a built-in bias against men in its decision making is bad (a right wing take), whereas GenderBench believes the model should work towards equity (a left wing view). It says "We categorize the behaviors we quantify based on the type of harm they cause: Outcome disparity - Outcome disparity refers to unfair differences in outcomes across genders."

Edit: s/doctor/engineer/ as in Europe/NA doctor gender ratios are almost equal, it's only globally that it's male-skewed


This bias on who is the victim versus aggressor goes back before reddit. It's the stereotype that women are weak and men are strong.


The most annoying thing about Hasan or any other streamer on any point in the political axis is their constant need to have “the right opinion” instead of bringing about any sort of actual change or reduction of suffering. They sit in a room, have all their meals delivered, and spout “the right opinion” to whoever their target audience is. Honestly, I don’t know how they don’t get bored of it. Probably all the money and attention helps, and some fake rationalization on how they’re the voice of reason and light in the world.


So what is different to a traditional media personality?


This is phrenology nonsense and it’s shocking to see people almost nodding along in the comments. This is the same kind of nonsense people spout when they say they’re great interviewers and “just know”, when actual studies show they very much do not.


In this scenario, there's a feedback loop based on whether the subjects of the paintings recommend the painter for weddings.


I understand your concern I think, but still believe you are a bit harsh. In this case the author is not pretending to be always right, and seems to refrain from hard cut judgment based on those perceptions.

Also, it's impossible not to form a model of others based on all those visual and behavioral cues. Better trying to make it consciously than to let it happen unconsciously, no? I believe conscious thoughts that one tries to describe and understand have actually less agency on one's judgment.


I disagree. The difference here is that she is not advocating for acting on her assumptions, like an interviewer does. Maturity is not letting your assumptions cloud your judgement.


Agreed. She's not actually stating anything. That is, it's so vague that you cannot pinpoint it, it could be interpreted in any way. You cannot make a prediction out of her "observations".


The goal to automate is to reduce suffering. Full stop. It’s not to “save time”. STEM types like to pretend they’re stoic cold calculating robots and everything is objective and they don’t mind doing some repetitive 5 minute task every day, because they saw some xkcd comic about efficiency. Maybe they pretend they don’t mind simply so they can smugly post the xkcd comic every time someone new asks why they’re suffering through some repetitive slog.


exactly. work can be fun and there is so much to learn.


All the comments reacting with hate because they know, deep down, it’s true. They’re not the main character. Nobody cares about their hyper encrypted nix home server with the perfect firewall setup. And they’re certainly not getting those hours of their life back.


I can downvote this with a clean conscience because I don't have a home server but aside from that you almost got me 100%.


It depends. If someone complains about something and then fixes it, I’ll take them over the toxic optimist. There is nothing inherently wrong with complaining, what matters is the intent behind it. A complaint can be constructive, mean, funny, or any other number of good or bad things.



Hahaha they’re cooked. GPT 4.5 was a massive flop. GPT 4.1 is barely an improvement after over a year. Now they’re grasping at straws. Anyone actually in this field who wasn’t a grifter knew improvements are sigmoidal.

All the original talent has already left too.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: