I used to work at a drug discovery startup. A simple model generating directly from latent space 'discovered' some novel interactions that none of our medicinal chemists noticed e.g. it started biasing for a distribution of molecules that was totally unexpected for us.
Our chemists were split: some argued it was an artifact, others dug deep and provided some reasoning as to why the generations were sound. Keep in mind, that was a non-reasoning, very early stage model with simple feedback mechanisms for structure and molecular properties.
In the wet lab, the model turned out to be right. That was five years ago. My point is, the same moment that arrived for our chemists will be arriving soon for theoreticians.
A lot of interesting possibilities lie in latent space. For those unfamiliar, this means the underlying set of variables that drive everything else.
For instance, you can put a thousand temperature sensors in a room, which give you 1000 temperature readouts. But all these temperature sensors are correlated, and if you project them down to latent space (using PCA or PLS if linear, projection to manifolds if nonlinear) you’ll create maybe 4 new latent variables (which are usually linear combinations of all other variables) that describe all the sensor readings (it’s a kind of compression). All you have to do then is control those 4 variables, not 1000.
In the chemical space, there are thousands of possible combinations of process conditions and mixtures that produce certain characteristics, but when you project them down to latent variables, there are usually less than 10 variables that give you the properties you want. So if you want to create a new chemical, all you have to do is target those few variables. You want a new product with particular characteristics? Figure out how to get < 10 variables (not 1000s) to their targets, and you have a new product.
PCA (essentially SVD) the one that makes the fewest assumptions. It still works really well if your data is (locally) linear and more or less Gaussian. PLS is the regression version of PCA.
There are also nonlinear techniques. I’ve used UMAP and it’s excellent (particularly if your data approximately lies on a manifold).
The most general purpose deep learning dimensionality reduction technique is of course the autoencoder (easy to code in PyTorch). Unlike the above, it makes very few assumptions, but this also means you need a ton more data to train it.
> PCA (essentially SVD) the one that makes the fewest assumptions
Do you mean it makes the *strongest* assumptions? "your data is (locally) linear and more or less Gaussian" seems like a fairly strong assumption. Sorry for the newb question as I'm not very familiar with this space.
You’re correct in a mathematical sense: linearity and Gaussian are restrictive assumptions.
However I meant it colloquially in that those assumptions are trivially satisfied by many generating processes in the physical and engineering world, and there aren’t a whole lot of other requirements that need to be met.
There's a newer thing called PacMap which is an interesting thing that handles difference cases better. Not as robustly tested as UMAP but that could be said of any new thing. I'm a little wary that it might be overfitted to common test cases. To my mind it feels like PacMap seems like a partial solution of a better way of doing it.
The three stage process of PacMap is either asking to be developed into either a continuous system or finding a analytical reason/way to conduct a phase change.
A lot of relationships are (locally) linear so this isn’t as restrictive as it might seem. Many real-life productionized applications are based on it. Like linear regression, it has its place.
T-SNE is good for visualization and for seeing class separation, but in my experience, I haven’t found it to work for me for dimensionality reduction per se (maybe I’m missing something). For me, it’s more of a visualization tool.
On that note, there’s a new algorithm that improves on T-SNE called PaCMAP which preserves local and global structures better.
https://github.com/YingfanWang/PaCMAP
There's also Bonsai, it's parameter-free and supposedly 'better' than t-SNE, but it's clearly aimed at visualisation purposes (except that in Bonsai trees, distances between nodes are 'real' which is usually not the case in t-SNE)
I'd add that PCA/OLS is linear in the functional form (linear combination), but the input variables can be non-linear (X_new := X_{old,1}*X_{old,2}^2), so if the non-linearities are simple, then basic feature engineering to strip out the non-linearities before fitting PCA/OLS may be acceptable.
Attention query/key/value vectors are latent variables.
More generally, a latent variable is any internal, not-directly-observed representation that compresses or restructures information from inputs into a form useful for producing outputs.
They usually capture some underlying behavior in either lower dimensional or otherwise compressed space.
Interesting! Depending on your definition, "automated invention" has been a thing since at least the 1990's. An early success was the evolved antenna [1].
4F-* are mainly research chemicals (still?), still being sold widely, especially where there is not a blanket ban on them.
I remember 3-FPM, that was what I imagined stimulants should be doing. It did everything just right. I got it back when it was legal. Any other stimulants come nowhere as close, maybe similar ones, but 4FA or whatever is for example, mostly euphoric, which is not what I want.
My understanding is, iterating on possible sequences (of codons, base pairs, etc) is exactly what LLMs, these feedback-looped predictor machines, are especially great at. With the newest models, those that "reason about" (check) their own output, are even better at it.
Warning the below comment comes from someone who has no formal science degree, and just enjoys reading articles on the topic.
Similar for physicists, I think there’s a very confusing/unconventional antenna called the “evolved antenna” which was used on a NASA spacecraft. The idea behind it was supported from genetic programming. The science or understanding “why” the way the antenna bends at different areas supporting increased gain is not well understood by us today.
This all boils down to empirical reasoning, which underlies the vast majority of science (or science adjacent fields like software engineering, social sciences etc).
The question I guess is; does LLMs, “AI”, ML give us better hypothesis or tests to run to support empirical evidence-based science breakthroughs? The answer is yes.
Will these be substantial, meaningful or create significant improvements on today’s approaches?
You are confusing copyright and patents, which are two very different things. And yes, companies or people wielding AIs can patent anything that hasn't been claimed by others before.
Hallucinations or inhuman intuition? An obvious mistake made by a flawed machine that doesn't know the limits of its knowledge? Or a subtle pattern, a hundred scattered dots that were never connected by a human mind?
You never quite know.
Right now, it's mostly the former. I fully expect the latter to become more and more common as the performance of AI systems improves.
Ok but I have to point out something important here. Presumably, the model you're talking about was trained on chemical/drug inputs. So it models a space of chemical interactions, which means insights could be plausible.
GPT-5 (and other LLMs) are by definition language models and though they will happily spew tokens about whatever you ask, they don't necessarily have the training data to properly encode the latent space of (e.g) drug interactions.
Seems short sighted to me. LLMs could have any data in their training set encoded as tokens. Either new specialized tokens are explicitly included (e.g: Vision models) or the language encoded version of everything that usually exists (e.g: the research paper and the csv with the data).
To improve next token prediction performance on these datasets and generalize requires a much richer latent space. I think it could theoretically lead to better results from cross-domain connections (ex: being fluent in a specific area of advanced mathematics, quantum mechanics, and materials engineering is key to a particular breakthrough)
Our chemists were split: some argued it was an artifact, others dug deep and provided some reasoning as to why the generations were sound. Keep in mind, that was a non-reasoning, very early stage model with simple feedback mechanisms for structure and molecular properties.
In the wet lab, the model turned out to be right. That was five years ago. My point is, the same moment that arrived for our chemists will be arriving soon for theoreticians.