Hacker Newsnew | past | comments | ask | show | jobs | submit | jackcook's commentslogin

Thanks for reading! The two stories are of course deeply intertwined: we wouldn’t have found the new side channel without the cautionary tale about machine learning.

But the finding about ML misinterpretation is particularly notable because it calls a lot of existing computer architecture research into question. In the past, attacks like this were very difficult to pull off without an in-depth understanding of the side channel being exploited. But ML models (in this case, an LSTM) generally go a bit beyond “statistics” because they unlock much greater accuracy, making it much easier to develop powerful attacks that exploit side channels that aren’t really understood. And there are a lot of ML-assisted attacks created in this fashion today: the Shusterman et al. paper alone has almost 200 citations, a huge amount for a computer architecture paper.

The point of publishing this kind of research is to better understand our systems so we can build stronger defenses — the cost of getting this wrong and misleading the community is pretty high. And this would technically still be true even if we ultimately found that the cache was responsible for the prior attack. But of course, it helps that we discovered a new side channel along the way — this really drove our point home. I probably could have emphasized this more in my blogpost.


Websites doing this would have to be careful about it: they might become the only website triggering a lot of interrupts randomly, which then makes them easy to identify.

Our countermeasure which triggers interrupts randomly is implemented as a browser extension, the source code for which is available here: https://github.com/jackcook/bigger-fish

I'm not sure I would recommend it for daily use though, I think our tests showed it slowed page load times down by about 10%.


Thank you! Really appreciate it


Yes you're spot on, the nonlinearities come from the full Mamba blocks, which I left out of this post for simplicity/to focus on the bigger ideas the paper introduced. You can see it marked by the "X" on the right-most part of Figure 3 in the Mamba paper: https://arxiv.org/abs/2312.00752


Thank you for the kind words! I think it’s mostly to reduce complexity during training. Here’s an excerpt from page 9 of the Mamba paper:

“We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”


when I read the paper I thought the idea was changing \Delta permits getting the model to learn different things over different time scales. As you quoted “the main source of improvement".

I don’t have an llm backround, just controls, so I might wrong.


Wow, it was quite the surprise to wake up to seeing this post near the top of HN! I wrote the post, happy to answer questions if anyone is wondering about any details.


Well, that was pretty cool. Plus, I got to know about the way processes communicate via xpc and that opened a whole new rabbit whole!


Yes, you're right, I should have mentioned it in the post, but I used pure greedy sampling for the GPT-2 outputs since I couldn't do anything but that for the Apple model. So temperature was set to zero, and there was no repetition penalty.


I used greedy sampling (temperature 0) for all of them. Since I didn't have access to logits/probabilities for Apple's model, I wasn't able to do anything else in a way that would be fair.


Some context: In the upcoming versions of macOS and iOS, Apple is including a predictive text model which offers suggestions while you type, which they’ve said to be a "transformer model". I managed to find some details about this model, including details about its topology (which looks a lot like GPT-2) and its tokenizer, and I was even able to peek in and see several of its top predictions while typing!

Hopefully this can give some insight into some of the trade-offs that Apple went through to put a model on every iPhone and MacBook — it’s small, it has a pretty narrow scope, and it’s not very capable on its own.


Apple is including a predictive text model which offers suggestions while you type, which they’ve said to be a "transformer model". I managed to find some details about this model


The emoji interest me. I start almost everything I post on Mastodon with an emoji. I used to look these up in emojipedia but my iPad has a lot of memory pressure and I'll loose what I'm working on in my YOShInOn window if I switch between two many tabs so now I have a dropdown list of all the emoji I've actually used and only need to cut-n-paste something that isn't in the list.

Early on I thought the emoji would be associated with a clean classification (say a ringed planet for astronomy) but I found I wasn't really doing that (I used two people holding hands for a post about a possible companion for a black hole) From the very beginning I thought about making something that automatically suggests an emoji but I've yet to do anything about it: the existing models I have would treat it as a many-shot problem based on emoji I've used before but I've also thought about making something trained on definitions or plain text descriptions of the emoji and related concepts.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: