>but thanks to a heroic effort of many people (and not me, really)...
That's funny because when I was reading through this I kept thinking, "Wow, this is a heroic effort."
It was a heroic individual effort that led to a heroic team effort. The team wouldn't have been assembled had the individual not raised the alarm and focused the attention. Then the team was able to handle it through their collective efforts. Doubly inspiring.
I've observed this many times over my career, a thorny problem that no-one wants to touch or look at sits there for ages until I take a run at it, as soon as I do everyone piles in and helps.
Reminds me of the psychology behind crowdmatching a la Snowdrift.coop. If no one else is working on the problem I also don’t feel inclined to; it doesn’t seem worth it because the chance of success is low. But if I know that other people are in it with me I’m suddenly motivated because I can expect my contributions to help actually solve the problem.
It is not that the commit was wrong. It was the right commit, in the sense that it did expose and trigger the bug. OP is being too modest :)
But the root cause of the bug was in another earlier commit. In a complex system such as the linux kernel, there could be multiple contributing factors. Both the "introducing a sleep" and the "root bug" are contributing factors, but we as engineers tend to think of the "root bug" as the actual problem, because we like more elegant explanations. Also, the "root bug" is likely something that goes against the ethos of the software project, and "common sense best practices", to a much larger degree than "introducing a sleep" is.
There was a rare race condition [1] bug in the linux kernel, and they didn't know precisely where it was introduced (which made it difficult to know what to change to fix it). They knew it was introduced at some point between two versions, but there were 52,363 code changes between those two versions. Because the bug was rare, you would have to test any given code change multiple times (thousands of times, in their case) to be confident that the bug was still present (if working backwards), or still absent (if working forwards). A way to do that faster is a binary search [2], or commit bisection.
Basically, you know commit 0 is fine, and change 52,363 is bad, so you run 1000 tests on change 26,180 to cut in half the number of commits that could have introduced the problem. If the bug is still there, then you run 1000 tests on change 13,090; if the bug is not in change 26,180, then you run 1000 tests on change 39,270. Eventually (after 15 or 16 iterations) you'll narrow down the problem to a single change which introduced the problem. Unfortunately code is complicated, so the first time they ran that procedure, their tests didn't find the right bug, so they did it again but slightly differently.
Furthermore, in this case there were actually two commits related to the bug. An older commit that introduced the race condition, and a second commit that triggered it (by subtly changing how long a certain function took to run).
The long bisection process identified the second commit but it took further work to identify the true cause of the bug.
Unfortunately, when the problem is in kernel boot software, that is not a practical solution for obvious reasons, and you're left using more basic techniques like running a binary search on the commit history.
Based on my reading, under some circumstances the scheduler wasn't dedicating large enough chunks of time to a particular boot process for it actually make progress. The fix appears to have been to change how the time chunk size was calculated.
> You see there are 52,363 commits between those two kernels,
Would it have been possible to generate a list of commits on timeline and perform bisects in a method similar to binary search? Or that was already done and still 293k boots were needed?
I am genuinely curious.
That’s why I always push telemetry onto a team. Concurrency and resource bugs are extremely expensive per iteration. Anything that can hint that the problem you are certain happened somewhere in a two month period was more likely to have happened in a particular week saves you a ton of work.
The human brain struggles when it has to execute almost, but not quite, the same exercises over and over again. It becomes a blur and after a while you can’t recall if you ran all of the steps this cycle (if you can’t automate the test well) and all it takes is typing “bad” when you meant “good” to end in failure. Which doesn’t sound like a big likelihood but when you’re testing a negative it’s much easier to get your wires crossed.
Expanding on other comments: binary search is extremely sensitive to false positives and false negatives… false information generally. While other approaches will still put you in the right ballpark even with some amount of incorrect information, a binary search with a single flipped bit will consistently land you in entirely unrelated areas, so you need to be _damn_ sure your “git bisect good/bad” is correct.
There is `git bisect skip` (https://git-scm.com/docs/git-bisect#_bisect_skip), which lets you cope with a testing process that sometimes doesn't yield a true/false result. But it's more useful for untestable commits than for inconclusive results like you'd get from a single boot here.
In the other post there was an interesting discussion about different approaches to searching for a bug (other than straight bisection) when your test is unreliable: https://news.ycombinator.com/item?id=36327263
Race conditions in places like this are exceedingly hard to write reliable tests for. It may take one, two boots; dozens, or thousands, or you may just get insanely lucky and whatever arbitrary # boots you do to try and reproduce it was still simply not enough. It's hard to have any level of confidence, in many cases.
There are acceptance tests, intergration test and actually many other types. I'm similarly not familiar with kernel dev. Have studies both operating systems and distributed systems aswell as CS. But the build checked all the right boxes.
That's funny because when I was reading through this I kept thinking, "Wow, this is a heroic effort."
It was a heroic individual effort that led to a heroic team effort. The team wouldn't have been assembled had the individual not raised the alarm and focused the attention. Then the team was able to handle it through their collective efforts. Doubly inspiring.