You sound paranoid and schizophrenic you should honestly try explaining your situation to someone you know or a professional. I think you’ll realize that your thinking is a bit delusional. I can’t really understand what you’re saying here.
The profile you responded to "techbro" was made 3 years ago, has 102 karma (i.e. 1 upvote every 10 days, on average), and has never submitted anything - zero submitted articles or profiles. My profile was made 12 years ago and has 3,118 karma (approximately averaging an upvote every 1.4 days) including lots of submissions of stuff I made, for example this Show HN that I was pleased to see make it to the front page with 36 points and lots of positive comments: https://news.ycombinator.com/item?id=43141139 (the resource itself is currently offline as I've temporarily replaced my website with 50 reasons why it's wrong to disrupt communications between a husband and wife. I'll replace my website as soon as this issue is solved.)
Right up your alley, I've actually written a cryptographic case study of some of the dynamics of this, I've just sent you a copy of it (you and I were in touch before - it was very well reviewed by professional cryptographers.) In your reply to this, please acknowledge your receipt of my email, and if you can, print it out as well, as it can become inaccessible later.
Of course, there are a lot of NSA-affiliated people who could come out of the woodworks to support parent's slander that I "sound paranoid and schizophrenic".
The reason they don't? They're witnesses in the FBI case and don't want to go to prison themselves. The FBI has already handwritten over 10,000 affidavits in this case. (They are writing by hand to avoid electronic tampering with evidence.)
I am not making a media story about it yet, which would be the next step, so there are no articles about this yet.
My reason for not doing so is not to bring extra attention to the case, but simply to solve it in a straightforward and expedient manner.
Cuda optimization actually doesn’t suck that much. I think NSight studio is amazing and super helpful for profiling and identifying bottlenecks in kernels
The problem is that their efficiency at converting electricity into projectile kinetic energy is really bad (like single digit percentages bad), getting electrical energy and power density is quite difficult in the first place (capacitors have abysmally bad energy density compared to gunpowder), and coils absolutely hate having their current changed quickly (which you need for this to work).
The problem with coil guns in particular is the ferrous slug is drawn to the center of the magnetic field. The field has to be collapsed at the right time to avoid sapping velocity from the slug, counterproductively.
Many designs that achieve respectable velocities use a multi-stage coil, which requires precise timing for each magnetic field, a lot of power, and high current capability. Generally, that means large batteries for a power source and large capacitors to feed the coils, which becomes heavy and expensive.
Even rifle variants rarely make more energy than a .22 LR, a feat which is easily overshadowed by air guns several hundred years old.
The electromagnet isn't the issue, it's the capacitors to power the magnets. A coil gun capable of matching a small handgun would be too heavy to reasonably carry. At the scale where they could become competitive with a conventional gun, you have an artillery piece.
I just have a hunch we're in early days still even with Transformers architectures. The MLP (Perceptron) is such a simple mathematical structure and mostly doing linear stuff (tons of multiplications, then a few adds, and a squashing-type activation function), plus the attention heads add-on from the Transformers paper too, of course (and other minor things) but ultimately it's a very easy to understand data structure so it's hard for me to believe there's not massive leaps and bounds that we can take to gain orders of magnitude more performance just like the leap that the Transformers paper had.
> We can take to gain orders of magnitude more performance just like the leap that the Transformers paper had.
Afaik the most important benefit of transformers aren't their “performance” (in the sense of ability to perform their tasks) but their scalability which come from their ability to be trained and evaluated efficiently on big GPU clusters, which isn't something you can do with recurrent neural networks.
And then, if I understood correctly, the benefit of state-space models being that you can train them in parallel and run them in a recurrent fashion, making inference cheaper than transformers especially when context size grow.
The biggest thing I had understood about the Transformers Paper (Attention is all you Need) is how the "attention heads" vectors are wired up in such a way as to allow words to be "understood" in the proper context. In other words "see spot run" is different from "run a computer program" has dramatically different but specific context for the word "run".
It was also my understanding that without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. We needed both very large models and words put into semantic context in semantic space.
Attention heads existed before Transformers, they where used in recurrent neural networks (RNN) to improve their performance. The paper is called “Attention is all you need” because transformers keep the attention head while discarding the RNN part entirely.
Getting rid of RNN vastly improved training scalability and allowed big players to start training enormous models on even more enormous training set in ways that weren't possible with a RNN AFAIK.
When discussing "Attention Heads" in the context of the Transformers Paper, there's no need to put the word "Self" in front of it, as in "Self-Attention". That's the context in which I used the word Attention above. Something similar to self-attention had pre-existed this paper, but not actual self-attention.
You're right that getting rid of "Recurrence" was another innovation, but removing it was probably more of a hack to make things parallelizable, than something that was architecturally justifiable from first principles (like self-attention is), because there's definite "power" in Recurrence (making it desirable), but it's just too costly to run that in LLMs because of CPU cycles.
> removing it was probably more of a hack to make things parallelizable
But that's the entire point of it. Transformer-based LLM are “more intelligent” just because you can make them bigger and train them on bigger datasets because of this parallelization.
It's not just about size. Self-Attention is every bit as important as large size, because if we had the current large size, but without Self-Attention we wouldn't have the emergent intelligence. Also "size" isn't even a new innovation. Self-Attention was a new innovation.
This doesn't match with the common knowledge on the topic, which is that model size is more important than the architecture. And training size is even more important, which is why single digit billion parameters are strongers than hundreds-of-billion ones from several years early when “Chinchilla optimal training” was in fashion.
SSM are literally the proof that all that really matters is training scalability.
The Universal approximation theorem doesn't care about the architecture after all.
If you parse my words a bit more carefully, you'll realize to test my claim there's a simple thought experiment (or real experiment) you can do which is this:
Take our "current large size" (my words from last post) LLMs, as they are currently today, and then simply remove the Self-Attention wiring, and see if that destroys the emergent intelligence aspect or not. I claim it would. But at the same time this doesn't mean you can just stick Self-Attention onto a small model and expect intelligence to once again emerge.
You are wildly overestimating the “emergent capabilities” of current models, and underestimate alternative architectures's (namely SSM) performance at the same size.
Also, performance of the modern “small” models show that your last sentence isn't really true either.
> wildly overestimating the “emergent capabilities”
How could I be "overestimating" the emergent capabilities when I never even quantified those capabilities other than to call them "emergent" and impressive?
> “small” models show that your last sentence isn't true either.
I never said that even a perfect architecture would make small models "intelligent". However to the extent that even smaller LLMs can exhibit surprising capabilities, that's more evidence IN FAVOR OF everything I've said, not evidence against.
EDIT: But in that last sentence (of prior reply) by "small" what I meant was genuinely small, meaning non-LLM, and you seem to have interpreted it as "a smaller LLM"
Even 1B parameters model show “impressive capabilities” for anyone not accustomed to the current state of the art. And there are plenty of relatively small models that perform as well as ChatGPT 3.5 when it was first released and felt like magic.
“All” that was needed to get there was “just” feeding it more data. The fact that we were actually able to train billion parameters models on multiple trillion tokens is the key property of the transformers, there's no magic beyond that (it's already cool enough though): it's not so much that they are more intelligent, it's simply that with them we can brute-force in a scalable fashion.
Yes even the original Transformers model had only millions of parameters and nonetheless showed "impressive capabilities" because it also had Self-Attention.
If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any.
There aren't many multi-billion-parameters non-transformer models because of path dependence, but that doesn't mean that only transformers can achieve this kind of results.
My statements (which you disagreed with, without exception) haven't been about Transformers v.s. non-Transformers. Everything above has been about the importance of the Self-Attention part of it. We could remove Self-Attention from Transformers and still have a functional (but dumb) NN, and that was my point.
Your position was that the Self-Attention is a less important part (because UAT, yadda yadda), and my position was that it's the key ingredient. Every statement above that I made, that you called wrong, was correct. lol.
You are moving the goalpost. The discussion has always been about transformers vs non transformers.
You claimed that self attention was needed to achieve the level of intelligence that we've seen with GPT 3.5:
> without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. (Verbatim quote from you https://news.ycombinator.com/item?id=41986010)
This is the claim I've been disputing, by responding that the key feature of the intelligence of tranformer models come from their scalability. And now that we have alternative that scale equally well (SSM and RWKV) unsurprisingly we see them achieve the same level of reasoning abilities.
> Every statement above that I made, that you called wrong, was correct. lol.
In the quote you're calling wrong (41986010), you're interpreting "scaling up" as "scaling up, including changing architecture". Scaling up transformers just means scaling up transformers, and keeping everything else the same. In other words you're interpreting "parameter size" as "parameter size, independent of architecture", and I meant parameter size of a Transformer (in the context of with v.s. without Self-Attention).
There's no staw-man, and you are now at the point of trying to re-invent the definition of words in order to somehow “win the argument ” without even respecting your own previous position. This behavior is legit pathetic, it's not an insult it's a fact. Respect yourself.
A lot of people struggle with regular literacy despite being educated on it for their entire life. I don’t think more education will help these people.