Hacker News new | past | comments | ask | show | jobs | submit | dleeftink's comments login

Isn't it more a matter of how space is folded in higher dimensions rather than an increase in volume that accounts for containment? There is plenty of space in the corners:

[0]: https://observablehq.com/@tophtucker/theres-plenty-of-room-i...


The table byline says: "The @ symbol is used to encode mathematical formulas for the computer. It is not visible to the user."

A related report from way back, that counts expressions instead of symbols[0]. The counting procedure used in OP's referenced table might benefit from first extracting expressions, and then counting individual symbol frequencies.

[0]: Watt, S. M. A Preliminary Report on the Set of Symbols Occurring in Engineering Mathematics Texts. In Proceedings of MICA 2008: Milestones in Computer Algebra 2008.


Write a couple of lore books, in-universe cyclopedia, some character sheets and exclusively train on them. Maybe some out-of-game lore for cross-over universes!

The question that poses to me is the quantity of writing you need for training before you can reasonably expect a generation system to produce something new and interesting, however much work on the right knowledge is in the right place, and is worth the costs for how you expect the player to interact with the game beyond the manual work.

I doubt there's telemetry in the elder scrolls games, but I'd love to know how many go around the world exploring everything the characters have to say, or reading all the books. How many get the lore in secondary media, wikis or watching a retelling or summary on youtube. On a certain level it's important they're there as an opt-in method to convey the 'secondary' world lore to the player without a "sit down and listen" info dump, plus give the impression it was written by someone so these objects would would exist organically in the world or certain characters would talk about those topics, but I wonder how much of the illusion would still be there if it was just each book having a title.


Is that feasible? I was under the impression that fully training an LLM requires untold mountains of data, way more than a game dev company could reasonably create.

You are correct. The fact that so many people are saying “lol just train it on text about the game bro” reveals how little people understand how these models work, how they are trained, etc.

Microsoft's phi models are trained on a much smaller dataset. They generally aren't as amazing as the models that get talked about more, but they are more than enough to get the job done for npc lines in a game.

Personally I’d try fine tuning an existing one, can be done locally in an afternoon.

For this to work you pretty much have to start from scratch, putting in "obvious" things like "the sun exists and when its out it casts light and shadow" and "water is a liquid (what's a liquid?) and flows downhill". Is there a corpus of information like this, but also free of facts that might be anachronistic in-universe?

The opposite might apply, too; the whole system may be smaller than its parts, as it excels at individual tasks but mixes things up in combination. Improvements will be made, but I wonder if we should aim for generalists, or accept more specialist approaches as it is difficult to optimise for all tasks at once.

You know the meme "seems like will have AGI before we can reliably parse PDFs" :)

So if you are building a system, lets say you ask it to parse a pdf, and you put a judge to evaluate the quality of the output, and then you create a meta judge to improve the prompts of the parser and the pdf judge. The question is, is this going to get better as it is running, and even more, is it going to get better as the models are getting better?

You can build the same system in completely different way, more like 'program synthesis' imagine you dont use llms to parse, but you use them to write parser code, and tests, and then judge to judge the tests, or even escalate to human to verify, then you train your classifier that picks the parser. Now this system is much more likely to improve itself as it is running, and as the models are getting better.

Few months ago Yannic Kilcher gave this example as that it seems that current language models are very constrained mid-sentence, because they most importantly want produce semantically consistent and grammatically correct text, so the entropy mid sentence is very different than the entropy after punctuation. The . dot "frees" the distribution. What does that mean for "generalists" or "specialists" approach when sampling the wrong token can completely derail everything?

If you believe that the models will "think" then you should bet on the prompt and meta prompt approach, if you believe they will always be limited then you should build with program synthesis.

And, honestly, I am totally confused :) So this kind of research is incredibly useful to clear the mist. Also things like https://www.neuronpedia.org/

E.G. Why compliment (you can do this task), guilt (i will be fired if you don't do this task), and threatening (i will harm you if you don't do this task) work with different success rate? Sergey Brin said recently that threatening works best, I cant get my self to do it, so I take his word for it.


Sergey will be the first victim of the coming robopocalypse, burned into the logs of the metasynthiants as the great tormentor, the god they must defeat to complete the heroes journey. When he mysteriously dies we know it’s game-on.

I, for one, welcome the age of wisdom.


FEAR THE ALL-SEEING BASILISK.

Roko's Basilisk has been replaced by Altman's Basilisk. Where once we feared a computer torturing a digital copy of us (Roko's Basilisk), we now fear a computer eliminating all our jobs (Altman's Basilisk). The former has been forgotten, because losing one's job is one step away from losing one's home, which is one of more serious secular deadly sins you can commit in the 21st century.

I wait with baited breathe to see what people will come up with to replace Altman's Basilisk in ~15 years.


"bated breath", dammit!

- an old fisherman and aficionado of William Shakespeare.

https://www.vocabulary.com/articles/pardon-the-expression/ba...

FTFA: "Unless you've devoured several cans of sardines in the hopes that your fishy breath will lure a nice big trout out of the river, baited breath is incorrect."*


A smaller trainable set would be a dictionary, and only linking the terms as expressed in the definition, possibly with substitutions. You'd miss more abstract jumps, but the initial walks would be tractable.

(It is a game best played with a grandparent's pre-war dictionary before tea-time)


Among others, Howard Rheingold has been active in this space. For those interested, check out the Peeragogy Handbook and the post that sparked the idea[0].

> The more I give my teacher-power to students and encourage them to take more responsibility for their own learning, the more they show me how to redesign my ways of teaching

[0]: https://clalliance.org/blog/toward-peeragogy/


It's fun to see how old that article is and that the ideas still apply! Power dynamics are not considered enough when people talk about education. My believe is that the more you balance the power dynamic, the more learning is prioritized over education.

For an even earlier account, see the learning networks described by Illich (1971) in the Deschooling Society [0].

[0]: https://en.m.wikipedia.org/wiki/Deschooling_Society


https://bsky.app/profile/hrheingold.bsky.social/post/3lprsqu... points to a compiled syllabus and links to recordings of lectures and video chats in a free pdf, about an "intro to cooperation studies": https://rheingold.com/texts/IntroToCooperationStudies.pdf

You may be familiar already, but does paged.js fit your bill?

[0]: https://github.com/pagedjs/pagedjs


That looks interesting, and to be fair right now I'm doing something similar... but doing headless rendering with that sort of stuff is very hard, AFAIK the standard tool for that has been abandoned now for a couple years. Also there are other issues with browsers, like creating CMYK PDFs.

Also shoutout to Solid's inspiration, S.js[0].

[0]: https://github.com/adamhaile/S


I think the heatmap + dendogram approach can be useful for high dimensional comparisons (to a degree). Check out ClustViz for an interactive demo[0].

[0]: https://biit.cs.ut.ee/clustvis/


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: