Yeah, this is a bug in their quick example code. Basically think that you simply...

tspeterkim · on April 15, 2024

I tried this out today. While it works (no longer a pre-split step required), it makes the CUDA kernel run ridiculously slow. I believe it's because of the while loop:

  while (i < end_byte) {

Comparing it to my original solution, 50X divergent branches are introduced! (ncu profiling)

The only difference between the two is that the for loop could deterministically iterate, yet this while loop iterates for an unknown amount (at kernel launch time).

I admit, I don't perfectly understand the reason. But this is the most likely culprit.