Hacker Newsnew | past | comments | ask | show | jobs | submit | BenoitP's commentslogin

> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

> What's the intuition here? Law of large numbers?

Yep, the large number being the number of dimensions.

As you add another dimension to a random point on a unit sphere, you create another new way for this point to be far away from a starting neighbor. Increase the dimensions a lot and then all random neighbors are on the equator from the starting neighbor. The equator being a 'hyperplane' (just like a 2D plane in 3D) of dimension n-1, the normal of which is the starting neighbor, intersected with the unit sphere (thus becoming a n-2 dimensional 'variety', or shape, embedded in the original n dimensional space; like the earth's equator is 1 dimensional object).

The mathematical name for this is 'concentration of measure' [1]

It feels weird to think about it, but there's also a unit change in here. Paris is about 1/8 of the circle far away from the north pole (8 such angle segments of freedom). On a circle. But if that's the definition of location of Paris, on the 3D earth there would be an infinity of Paris. There is only one though. Now if we take into account longitude, we have Montreal, Vancouver, Tokyo, etc ; each 1/8 away (and now we have 64 solid angle segments of freedom)

[1] https://www.johndcook.com/blog/2017/07/13/concentration_of_m...


> "theory building"

Strongly agree with your comment. I wonder now if this "theory building" can have a grammar, and be expressed in code; be versioned, etc. Sort of like a 5th-generation language (the 4th-generation being the SQL-likes where you let the execution plan be chosen by the runtime).

The closest I can think of:

* UML

* Functional analysis (ie structured text about various stakeholders)

* Database schemas

* Diagrams


Prolog/Datalog with some nice primitives for how to interact with the program in various ways? Would essentially be something like "acceptance tests" but expressed in some logic programming language.


Cucumber-style BDD has been trying to do this for a long time now, though I never found it to be super comfortable.



A bit late to the discussion, but this has deep connections. As a programmer, your job is provide business invariants using complexity management techniques. And checking that your state space is small is a tool with gigantic payoff.

Maintaining a small state space it why we want to let it crash. Each program instruction can potentially multiply the number of states possible. Erlang even has this whole "Let It Crash" philosophy as a guideline [1].

Maintaining a small state space is how you tame concurrent programs, where adding one thread can cartesian-product your state space. But there are tools like TLA+ which can help you build proofs over this state space. And build invariants that your threads can use safely. Hre is a visualizer of that state space [2]. Notice any resemblance to the graphs you just saw in the video?

Programming sometimes feel like this "Rush Hour" puzzle.

[1] https://wiki.c2.com/?LetItCrash [2] https://prob.hhu.de/w/index.php?title=State_space_visualizat...


I'm failing to grasp how it solves/replaces what vector db were created for in the first place (high-dimensional neighborhood searching, where the space to be searched grows by distance^dimension)


It doesn't replace vector db, it's more for storing agentic memory, think of information which you would like agents to remember across conversations with users just like humans


Super simplistic example, but say i mention my Daughter, who is 9.

Then mention she is 10,

a few years later she is 12 but now i call her by her name.

I have struggled to get any of the RAG approaches to handle this effectively. It is also 3 entries, but 2 of them are no longer useful, they are nothing but noise in the system.


> I have struggled to get any of the RAG approaches to handle this effectively.

You need to annotate your text chunks. For example you can use a LLM to look over the chunks and their dates and generate metadata like summary or entities. When you run embedding the combination data+metadata will work better than data alone.

The problem with RAG is that it only sees the surface level, for example "10+10" will not embed close to "20" because RAG does not execute the meaning of the text, it only represents the surface form. Thus using LLM to extract that meaning prior to embedding is a good move.

Make the implicit explicit. Circulate information across chunks prior to embedding. Treat text like code, embed <text inputs + LLM outputs> not text alone. The LLM is how you "execute" text to get its implicit meaning.


Hmm, I'm trying to contextualize your insight with the example that was given.

That approach sounds great for a lot of usecases, but wouldn't it still struggle with the given example of the age changing over the years?

How old is x? -> 10

Two years later:

How old is x? -> 12


As a simplified example:

(memory) [2023-07-01] My daughter is 10

(memory) [2024-05-30] My daughter turned 11 today!

System prompt: Today is 2025-08-21

User prompt: How old is my daughter?

The vector DB does the work of fetching the daughter-age-related memories, your system decides (perhaps with another LLM) if the question needs time-based sorting or something else.


Still new to the rag space, but there op had an additional callout, "and [later] i call her by her name"

Is it capable to go from `[name] -> [my daughter as a concept] -> age` ?


Yeah good call, I missed that. I don't think there's a correct answer here, but it could be another step of the read or write. Either it would do another lookup of "my daughter" -> Name on read, or do a lookup on write if you already have a "my daughter is Name" memory. Whatever's less expensive in the long run. The graph memory someone else mentioned also seems like a good option there.


The agent loop takes care of that


When the RAG sees retrieved facts, it should see the timestamp for each. It will easily then use the latest fact if there aren't too many conflicting ones. I am assuming that the relevance ordering of the retrieved facts won't help.

Separating a RAG from a memory system, it is important for a memory system to be able to consolidate facts. Any decent memory system will have this feature. In our brain we even have an eight hour sleep window where memory consolidation can happen based on simulated queries via dreams.


That is because basic RAG is not very useful as a long-term knowledge base. You have to actively annotate and transform data for it to become useful knowledge. I have the same problem in the regulation domain, which also constantly evolves.

In your case, you do not want to store the age as fact without context. Better is e.g. to transform the relative fact (age) into an absolute fact (year of birth), or contextualize it enough to transform it into more absolutes (age 10 in 2025.


I could be wrong, but this seems to be exactly what Zep’s “Graphiti” library is made for, build as a semantic temporal knowledge graph.


I do not necessarily think it is noise, similar to how not all history is noise.


That’s a pretty good analogy though, noise is just information that isn’t immediately useful for the current task

If I need to know the current age I don’t need to know the past ages of someone


Exactly. If you're not using embeddings, why would you need a vector db?

> Why are we building complex vector stores

Because we want to use embeddings.


The submission and the readme fail to explain the important thing, which is how BM25 is run. If it creates bags of words for every document for every query, that would be ineffective. If it reuses the BM25 index, it is not clear when it is constricted, updated, and how it is stored.

Because BM25 ostensibly relies on word matching, there is no way it will extend to concept matching.


Now testify!


Rage Against the Machine[0] delivers it best & very fitting

so many quotes in this thread (and thanks for the 90/00 playlist for the day)

[0] https://www.youtube.com/watch?v=Q3dvbM6Pias


> Rage Against the Machine[0] delivers it best & very fitting

Not really: this false equivalence between Democrats and GOP (especially their recent incarnation) is absolutely delusional. Contrast Obama, Trump 1.0, Biden, Trump 2.0 (so far). Like really?

> [0] https://www.youtube.com/watch?v=Q3dvbM6Pias

With a few decades worth of hind sight, does anyone actually think that Gore would have handled 9/11 (leading into Afghanistan and Iraq) the same was as Bush (and Cheney/Rumsfeld/Wolfowitz)? Even at the time (I'm a GenXer) it was strange thinking.


> Not really: this false equivalence between Democrats and GOP (especially their recent incarnation) is absolutely delusional.

You may be more interested in reading the lyrics than watching the video, the song is not about "equivalence between Democrats and GOP".

It's about the powerful who control the media and how the media is used to control the narrative. (some much worse than others) It's not only about <media bad>, but also how the American people (already with an insatiable searching for satisfaction in movies, glamour and tabloids) have become slaves to it [all media/types]; glaringly obvious today vs the 2000s as media heads have gained cult status in parts, to the point of earning government appointments only this cohort would be able to appoint and not get laughed out of office.

Those who control the narrative controls the present, who controls the past -- controls the future. To bring it full circle, fitting in that rewritting reports is controlling the narrative; similar to firing a messenger of statistics

I fully agree with you, it was strange thinking; and if we're drawing parallels -- interesting how real election scandals led to GOP presidents. [scotus/hanging chads & Mueller report]

Oh, and the "delivers it best" part is biased as I like Rage; they deliver the line better than reading it imo


You mean Nouvelle Nouvelle France ?


Touche


* with shareholder money


And not even doing the proper thing and taking it directly from their accounts... I think that too is justified for this case.


Tabular data is everywhere in the corporate world. I feel like there is a zillion opportunities to make money with these transformers. But how to enter that market remains a mystery to me.


Is this "data is oil" or something else?


Brute force it. Gemm routines can give you a best dot product among 300k vectors well under a second


Exactly. See this for details: https://news.ycombinator.com/item?id=43162995


LCOE is good for marginal analysis, but quite bad for a systemic and holistic view


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: