Hacker Newsnew | past | comments | ask | show | jobs | submit | liliumregale's commentslogin

I'm sorry, this article reads like AI slop.

It has all the hallmarks: grandiose writing ("everything changed"), the classic "It wasn't X, it was Y" (about five times in the first minute of reading the article), undue emphasis on symbolism...

All those indicators are coupled with an unpleasant level of obsession with how statistical models are "a new kind of mind"—even after claiming to "strip away the hype cycle".


Regularization as a concept is taught in introductory ML classes. A simple example is called L2 regularization: you include in your loss function the sum of squares of the parameters (times some constant k). This causes the parameter values to compete between being good at modeling the training data and satisfying this constraint--which (hopefully!) reduces overfitting.

The specific regularization techniques that any one model is trained with may not be publicly revealed, but OAI hardly deserves credit for the concept.


We absolutely should treat it just like any other software tool.


Let's distinguish between papers and preprints, please. arXiv has contributed to a blurring of the distinction. The arXiv preprints are useful but should always be taken with a grain of salt. There is nearly no filtering done on things uploaded to arXiv.

Everyone accessing someone's uncritically reviewed work is a bittersweet gift.


In mathematics, at least, papers and preprints are indeed widely considered to be the same thing. In practice, for people working in the field, they are.

Math papers tend to be highly technical, read by other specialists in the field. When it comes for correctness -- whether or not I should take a paper with a grain of salt -- the authors' reputation counts for much more than the journal's. And in case of student authors, who are just beginning to publish, the advisor is implicitly staking their reputation on the work as well.

There are also preprints on the arXiv, written by people unknown in the community, claiming to prove the Riemann Hypothesis or some such. These aren't taken seriously by anyone.

An outsider might not be able to tell which preprints can be considered equivalent to papers, but such people are not likely to be seriously reading math research in the first place.


You can always overlay a reputation system on top of your pre-print server.

The informal one you describe here, or any formal one you can come up with.


Arxiv has been working just fine for a long time, there's no need to change it. Besides I'm not going to voluntarily post my work so I can get publicly rated by a bunch of unknowns lol.


You're thinking of social-media-type "reputation".

Instead, think of the goal being to associate measures of worth with the reviewers. If you're publicly rated by a bunch of worthwhile people, count yourself lucky.


> Arxiv has been working just fine for a long time, there's no need to change it.

Exactly, that's why I am not suggesting any change to Arxiv.

Think more of people eg submitting Arxiv URLs to Hacker News for what I have in mind. Or discussing Arxiv submission on a forum or in a wiki etc. You can imagine some specialised software that has some better support specifically for material from Arxiv.

That's what I mean by 'overlay'.

Or think of Slatestarcodex publishing a blog post with links to his favourite Arxiv papers for that month. That's pretty much equivalent to what a journal produces. And if Slatestarcodex compiles that link list by doing some peer review and chatting with the authors of the papers, that's almost exactly what the journal does.


Yes. For example, here is a paper by some Cornell people where they reinvent machine learning model evaluation with the only motivation that I can tell is hubris and self service:

https://browse.arxiv.org/pdf/2310.02335.pdf

Do not trust arxiv papers. They have not been vetted.


> Everyone accessing someone's uncritically reviewed work is a bittersweet gift.

Review work is not always done by senior researcher (e.g., professors). Senior researchers often hand this down to PhDs. Having 3 to 4 reviews by nice junior reviewers doesn't sound very critical.


Just to be clear: you'd expect PhD students to be trained in reviewing by their supervisors.

So PhD students writing the initial review is not weird - it is an expected part of their training. As is the supervisor going over the review and providing constructive feedback. As is the review being submitted under the supervisor's responsibility, with credits (mention in proceedings) to the student for acting as a subreviewer.

Yes, there are ways to abuse this system and yes, abuses do occur. Any system for gaining job prestige or workload reduction is a target for gaming. This doesn't mean the system should be thrashed, but it does warrant additions to curb excesses.


If a late-stage PhD student in the same narrow technical field can't review the paper, then it's almost certainly a problem with the paper. After all, junior people are the primary audience for any paper. Also, PhD students often have more depth on their research topic than the professors.

The sibling comments about making sure that most reviews are written by senior researchers also make good points. That should be checked by the program committee or editor.


They have to say they did this and you are forgetting the editor's role in paper evaluation. This criticism can and is taken into account and you can send papers out for more reviews if you get conflicting ones. In my experience as an editor, junior people typically give better reviews than senior (unless they are emeritus and then have unlimited time). I suppose this has to do with confidence in the junior person who will question their review themselves.


Arxiv paper quality is better than journals' average paper's quality. Because publishing in Arxiv doesn't count as paper in resume in many places, there are far fewer papers who publish just for resume.


It’s how science worked for 3 centuries before the current review system was instituted just a generation ago.


Let's do a quick analogy. arxiv = github. It's all collaborative writing, right? You publish data, code, and your paper continuously. Then you have releases. Perhaps they get tagged with what publication venues accept them.


I'm confused. Do you accept published papers as gospel? They should be taken with a grain of salt too.


Depends on the field certainly. A paper in the Annals of Mathematics is definitely a lot more rock solid than whatever goes on the arXiv, or reviewed papers in certain fields that are particular magnets for junk science.


Funny you should mention Annals. A journal famous for publishing two papers in three years by the same author, one proving some theorem, and the other disproving the theorem. Sure, tons of other journals have done so, but Annals is definitely the highest profile one. Maybe take a look at https://mathoverflow.net/questions/282742/endless-controvers... or https://mathoverflow.net/questions/35468/widely-accepted-mat... It's also a nice way to pad your CV if you manage to get the wrong theorem published - you get two Annals papers for the price of one.

It is of course true that published papers have been vetted. But very often, it simply means that 1. an editor glanced at it, 2. (optional) a peer provided a positive quick opinion on the paper, without checking the correctness, 3. one or two independent referees presumably read the paper and produced a report on it. It's not nothing, but it doesn't mean you should accept blindly as truth everything published.

For context, I'm an associate professor of mathematics at a large research university.


The way I look at it, we passed the point where there are so many people that no one can read all the papers in their field any more.

Peer review is the first filter that papers go through. It's not perfect (it makes mistakes in both directions), but the output of the peer review process definitely has a higher signal to noise ratio than the input.


This repo is a joke, right? I'd be embarrassed peddling this as AGI.

We're a long way from AGI existing at all. Even if you disagree, it's agreed upon that we're not there yet. For this repo to call its offering AGI is even more dramatic of a mischaracterization than saying I'm an Olympic sprinter because I went for a morning jog.

What can I say? Hucksters gonna huckster.


Hello liliumregale, thank you for your brilliant and constructive criticism. For version 2 my repo is going to contain a REAL AGI!! Then you are going to be able to code your, only friend.

Take care


indeed.

even more so, it is already a joke that we have to call it AGI now. back in the day it was just "AI", but then the marketing people came along and called every mediocre machine-learning system or algorithm with more than 10 ifs "AI", like it it already passed the turing test and became self-aware along the way.


Hello dinkblam, thank you for discovering this elaborate scam about agi. I am going to tell my marketing people that Christmas won't be the same this year.

Take care


Google absolutely has their own internal models that do exactly this. It wouldn't surprise me if Microsoft indeed does have an internal Copilot that is trained on their data, but even on the smallest risk that they leak their code, they wouldn't share that particular model.


What does "absolutely has" mean here? Have you actually heard anything about such internal models?


Why wouldn’t they? Meta does, and they write openly about it


The paper has recently been called into question for overestimating their performance relative to BERT: https://news.ycombinator.com/item?id=36758433. Might be good for the blog's author to take this into account in their explainer. The author's perspective sounds a bit too positive (and borderline salesmanlike).


The second to last section "some potential issues with the paper" discusses the top-2 finding.


Yes - it's mentioned, but doesn't the framing below make it sound like they're still advocating for this paper?

> In essence, it's advisable to take the paper’s reported figures with a grain of salt, particularly as they cannot be precisely reproduced as described. Nonetheless, this approach continues to deliver unexpectedly well.

A "grain of salt" is different from "critical evaluation flaw," and if the reproduction's results are true, then the method doesn't after all "deliver unexpectedly well".


I take your point that it could have been more strongly worded. The reason I say it "devliers unexpectedly well" is because the whole concept of using gzip for classification is unintuitive, and even after fixing the flaw it still manages to get decent accuracy (given that it is no more beating state-of-the-art models).


Further analysis shows that it doesn’t perform well at all—successes are tied to things like test set leakage.

https://kenschutte.com/gzip-knn-paper2/

This paper isn’t any surprisingly effective result. It’s thoroughly shoddy scholarship by which the authors should feel embarrassed.


Thank you for reading :-)

I mentioned it towards the end, in the 2nd last paragraph. Those issues in the evaluation do bring its accuracy down a bit, even then it performs better than expected, considering it is doing knn on compressed data.


Well…one that peeks at the test set labels.

https://kenschutte.com/gzip-knn-paper2/


The title wordplay dates back to at least Drew McDermott's 1976 essay "Artificial Intelligence Meets Natural Stupidity" [0]. The intro is phenomenal.

---

> As a field, artificial intelligence has always been on the border of respectability, and therefore on the border of crackpottery. Many critics <Dreyfus, 1972>, <Lighthill, 1973> have urged that we are over the border. We have been very defensive toward this charge, drawing ourselves up with dignity when it is made and folding the cloak of Science about us. On the other hand, in private, we have been justifiably proud of our willingness to explore weird ideas, because pursuing them is the only way to make progress.

> Unfortunately, the necessity for speculation has combined with the culture of the hacker in computer science <Weizenbaum, 1975> to cripple our self-discipline. In a young field, ,self-discipline is not necessarily a virtue, but we are not getting any younger. In the past few years, our tolerance of sloppy thinking has led us to repeat many mistakes over and over. If we are to retain any credibility, this should stop.

> This paper is an effort to ridicule some of these mistakes.

---

[0]: Drew McDermott. 1976. Artificial intelligence meets natural stupidity. SIGART Bull., 57 (April 1976), 4–9. https://doi.org/10.1145/1045339.1045340


>In this paper, I have criticized AI researchers very harshly. Let me express my faith that people in other fields would, on inspection, be found to suffer from equally bad faults. Most AI workers are responsible people who are aware of the pitfalls of a difficult field and produce good work in spite of them. However, to say anything good about anyone is beyond the scope of this paper.

This paragraph concludes the section titled "Benediction." Ouch.


There's also a great line in a Terry Pratchett novel, Hogfather:

> Natural stupidity beats artificial intelligence every time.


Now that you mention it the way Hex was used at the UU was a good prediction of how were interacting with LLMs.


I'm going to add a contrarian take here: this preprint is not a research paper. While it's nice to see that there is an improvement here on their one task, this is not "semantically" driven tokenization. It's morphologically driven. To be semantically driven, it would be reasonable to expect that synonyms would have similar representations. I got really excited from the title, and the content is a let-down.

The line of research here has been going on for 30+ years, from Michael Brent's work, to Linguistica, to Morfessor, and now several approaches to incorporate morphology into tokenizers. The stand-out example is [0]. This paper doesn't seem to acknowledge any of that intellectual legacy. It's not a _research_ paper.

I'm getting a bit tired of people putting their class projects or quick engineering projects on arXiv. I don't know why they're surfacing so high on HN either.

[0]: https://aclanthology.org/2021.acl-long.279/


If a transformer has a good "place" to assign meanings to I think it does a pretty good job of (1) discovering similar meanings in synonyms, (2) representing words differently based on context. That later one is a huge advance over word embeddings which I thought were holding progress back instead of advancing it.

You're right that what they are doing is morphological, not semantic, but it helps a lot. I would say that

   日本語
"Japanese Language" is a good token to apply embedding, attention, etc. to because it has a definite meaning to which the transformer can attach whatever syntax and semantics it learns in terms of activations. If BPE gives up and processes it as UTF-8 bytes

  e6 97 a5 e6 9c ac e8 aa 9e
there is no clear meaning for any one of those tokens, and the model is going to have to work a lot harder.


By your first paragraph's argument, the semantics are in the Transformer, not the tokenizer.

And yes, what they do helps on their two test tasks. I'm not disputing that. It's the fact that there's no scholarship here.

There are so many thousands of knobs to twiddle with in a model these days, and they went after one that's commonly regarded in the NLP community as the 'defect'—the only part of the model that's not end-to-end trained along with the rest. Which would be great, if they acknowledged it! But there's no citation to any tokenization literature beyond BPE or SentencePiece. The literature review is as superficial as what you could find in a blog.

There are certainly byte-level or character-level tokenizers (think about CANINE or ByT5), and we can argue back and forth about their data-hungriness or slow inference. It would be nice to give more helpful units to a Transformer, so it doesn't have to learn syllables (or even characters) all on its own. Rebracketing/incorrect segmentation is a problem! And these authors have clued into that, but so have several hundred (or thousand?) researchers they don't cite.

What I'm having trouble with is the notion that this paper uncovered some exciting, revelatory fact about tokenization. Yes, "Japanese Language" would be a reasonable semantic unit! But these authors didn't discover that fact. Nobody's questioning whether 'good tokenization is better than bad tokenization'. Tokenization has seen ongoing attention in NLP forever.

These authors tried one variant, compared it against a library default option (and nothing else), evaluated on one task, put a bit of marketing around it, and called it a day. In the NLP course I used to TA, this wouldn't even qualify as a complete final project for the course.


> I'm getting a bit tired of people putting their class projects or quick engineering projects on arXiv.

Whenever something becomes a status symbol there will be people willing to exploit it. Perhaps ArXiv should hire some volunteers to check for a minimum of quality before acceptance? (/s, in case it's not clear).

Anecdotally, the second worst paper I've ever read was hosted on ArXiv and presented in an NLP group as a possible breakthrough. Tearing it apart in front of the person presenting it was no fun.


> To be semantically driven, it would be reasonable to expect that synonyms would have similar representations.

How could a tokenizer do anything about that unless the synonyms actually share substrings? The vector embedding is learned, not part of the tokenizer.


It couldn't, which is why it's a good idea to avoid the word, "semantic".

The same problem also exists in the name, "Large Language Model". Sure, the content being modeled contains language, but the model itself is not specific or limited to language patterns. We ought to call them "Large Text Models"; or better yet, "Text Inference Models".

The words we use to describe software are very important: they inform goals and expectations. They define the context that software exists in.

I see our biggest mistake as calling these tools, "Artificial Intelligence". That title began as a goal and a category of work: it doesn't belong in the title or description of software unless that software has actually met the goal.


A morpheme is the smallest *meaningful* unit in a language though.


I was being generous - stemming is poor man's morphology. Empirically useful (ask the IR folks) but incredibly heuristic.


> I'm getting a bit tired of people putting their class projects or quick engineering projects on arXiv.

I got downvoted when I expressed similar opinion with regards to MiniGPT4. I guess HN crowd value usefulness more than real contribution.


Yep! Percy Liang in an interview with Chris Potts said he sees BERT and ELMo as foundation models.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: