Follow up to “I booted Linux 293k times”

mcdonje · on June 18, 2023

>but thanks to a heroic effort of many people (and not me, really)...

That's funny because when I was reading through this I kept thinking, "Wow, this is a heroic effort."

It was a heroic individual effort that led to a heroic team effort. The team wouldn't have been assembled had the individual not raised the alarm and focused the attention. Then the team was able to handle it through their collective efforts. Doubly inspiring.

noir_lord · on June 18, 2023

I've observed this many times over my career, a thorny problem that no-one wants to touch or look at sits there for ages until I take a run at it, as soon as I do everyone piles in and helps.

It's the Bystander Effect in code.

aendruk · on June 18, 2023

Reminds me of the psychology behind crowdmatching a la Snowdrift.coop. If no one else is working on the problem I also don’t feel inclined to; it doesn’t seem worth it because the chance of success is low. But if I know that other people are in it with me I’m suddenly motivated because I can expect my contributions to help actually solve the problem.

rattray · on June 18, 2023

Bit of a spoiler, but I did find this a fun followup worth skimming:

> So the commit I had bisected to was the wrong one again – it exposed a latent bug simply because it ran the same code as a sleep.

The fix was by someone else, and is here: https://lore.kernel.org/stable/20230615091830.RxMV2xf_@linut...

mitko · on June 18, 2023

It is not that the commit was wrong. It was the right commit, in the sense that it did expose and trigger the bug. OP is being too modest :)

But the root cause of the bug was in another earlier commit. In a complex system such as the linux kernel, there could be multiple contributing factors. Both the "introducing a sleep" and the "root bug" are contributing factors, but we as engineers tend to think of the "root bug" as the actual problem, because we like more elegant explanations. Also, the "root bug" is likely something that goes against the ethos of the software project, and "common sense best practices", to a much larger degree than "introducing a sleep" is.

samwillis · on June 18, 2023

Previous thread from original post: https://news.ycombinator.com/item?id=36325986

808 points, 3 days ago, 266 comments.

MR4D · on June 18, 2023

“After 21 hours I got bored…”

There are devs and then there are kernel devs.

The relentless pursuit of bugs is truly impressive.

vasco · on June 18, 2023

Bored of waiting for the parallel run, I don't think they were investigating for 21h straight. Previous paragragh:

> Paolo Bonzini suggested to me that I boot the kernel in parallel, in a loop, for 24 hours at the point immediately before the commit

Otherwise I don't disagree that perseverance is important to get to the bottom of this stuff.

kneebonian · on June 18, 2023

A great article by James Mickens entitled "The night watch" explains what it really means to be a kernel dev.

https://scholar.harvard.edu/files/mickens/files/thenightwatc...

MR4D · on June 19, 2023

“zombies who say “sir” and “ma’am” but then try to eat your brain”

Awesome read! Thanks for sharing.

plq · on June 18, 2023

Man, what a masterpiece, even the title gives me chuckles :) Thanks for bringing it up

weinzierl · on June 18, 2023

Not surprised Thomas Gleixner ultimately found the root cause. He did a lot of real time work and this bug was a timer related bug.

x86x87 · on June 18, 2023

I think that the author is too modest. Refreshing to see.

The amount of attention this received probably led to a way faster fix.

asylteltine · on June 18, 2023

Can someone explain the problem and fix in plain English

delecti · on June 18, 2023

There was a rare race condition [1] bug in the linux kernel, and they didn't know precisely where it was introduced (which made it difficult to know what to change to fix it). They knew it was introduced at some point between two versions, but there were 52,363 code changes between those two versions. Because the bug was rare, you would have to test any given code change multiple times (thousands of times, in their case) to be confident that the bug was still present (if working backwards), or still absent (if working forwards). A way to do that faster is a binary search [2], or commit bisection.

Basically, you know commit 0 is fine, and change 52,363 is bad, so you run 1000 tests on change 26,180 to cut in half the number of commits that could have introduced the problem. If the bug is still there, then you run 1000 tests on change 13,090; if the bug is not in change 26,180, then you run 1000 tests on change 39,270. Eventually (after 15 or 16 iterations) you'll narrow down the problem to a single change which introduced the problem. Unfortunately code is complicated, so the first time they ran that procedure, their tests didn't find the right bug, so they did it again but slightly differently.

(many details omitted for simplicity)

[1] https://en.wikipedia.org/wiki/Race_condition

[2] https://en.wikipedia.org/wiki/Binary_search_algorithm?

ufo · on June 18, 2023

Furthermore, in this case there were actually two commits related to the bug. An older commit that introduced the race condition, and a second commit that triggered it (by subtly changing how long a certain function took to run).

The long bisection process identified the second commit but it took further work to identify the true cause of the bug.

sh34r · on June 18, 2023

It's worth mentioning that this generalized problem in user space can often be solved much more easily using concurrent fuzz testing (ex: https://www.microsoft.com/en-us/research/project/cuzz-concur...).

Unfortunately, when the problem is in kernel boot software, that is not a practical solution for obvious reasons, and you're left using more basic techniques like running a binary search on the commit history.

smaddox · on June 18, 2023

Based on my reading, under some circumstances the scheduler wasn't dedicating large enough chunks of time to a particular boot process for it actually make progress. The fix appears to have been to change how the time chunk size was calculated.

dev_tty01 · on June 18, 2023

As a curious onlooker, was the calculated time below a reasonable minimum?

paulclinger · on June 18, 2023

Based on my reading it looks like the accuracy wasn't enough, so they switched it to a bit later time when hires timer was already available.

jseutter · on June 18, 2023

In retrospect you could say yes, but it was hard to tell. It ended up being either okay or not okay depending on Intel vs AMD and clock speed.

amtamt · on June 18, 2023

> You see there are 52,363 commits between those two kernels,

Would it have been possible to generate a list of commits on timeline and perform bisects in a method similar to binary search? Or that was already done and still 293k boots were needed? I am genuinely curious.

detaro · on June 18, 2023

> in a method similar to binary search?

That's what bisect means. You do a binary search for the breaking commit

kangalioo · on June 18, 2023

Exactly, and the reason 293k boots were needed anyways is because the bug was hard to reproduce and needed thousands of boots to potentially trigger

hinkley · on June 18, 2023

That’s why I always push telemetry onto a team. Concurrency and resource bugs are extremely expensive per iteration. Anything that can hint that the problem you are certain happened somewhere in a two month period was more likely to have happened in a particular week saves you a ton of work.

The human brain struggles when it has to execute almost, but not quite, the same exercises over and over again. It becomes a blur and after a while you can’t recall if you ran all of the steps this cycle (if you can’t automate the test well) and all it takes is typing “bad” when you meant “good” to end in failure. Which doesn’t sound like a big likelihood but when you’re testing a negative it’s much easier to get your wires crossed.

cwillu · on June 18, 2023

Expanding on other comments: binary search is extremely sensitive to false positives and false negatives… false information generally. While other approaches will still put you in the right ballpark even with some amount of incorrect information, a binary search with a single flipped bit will consistently land you in entirely unrelated areas, so you need to be _damn_ sure your “git bisect good/bad” is correct.

wging · on June 18, 2023

There is `git bisect skip` (https://git-scm.com/docs/git-bisect#_bisect_skip), which lets you cope with a testing process that sometimes doesn't yield a true/false result. But it's more useful for untestable commits than for inconclusive results like you'd get from a single boot here.

rwmj · on June 18, 2023

In the other post there was an interesting discussion about different approaches to searching for a bug (other than straight bisection) when your test is unreliable: https://news.ycombinator.com/item?id=36327263

alsetmusic · on June 18, 2023

Previous discussion:

  https://news.ycombinator.com/item?id=36325986

swordbeta · on June 18, 2023

I am unfamiliar with kernel dev but shouldn't that commit also need to add some kind of unit test?

kevans91 · on June 19, 2023

Race conditions in places like this are exceedingly hard to write reliable tests for. It may take one, two boots; dozens, or thousands, or you may just get insanely lucky and whatever arbitrary # boots you do to try and reproduce it was still simply not enough. It's hard to have any level of confidence, in many cases.

snappythrowaway · on June 19, 2023

There are acceptance tests, intergration test and actually many other types. I'm similarly not familiar with kernel dev. Have studies both operating systems and distributed systems aswell as CS. But the build checked all the right boxes.