Yeah, this is a bug in their quick example code. Basically think that you simply give each thread a start and stop offset with simple byte division. Then each thread handles the first line that starts after the start offset up to but excluding the first line that starts after its end offset. So there will be a small amount of "overread" at the boundaries but this is probably negligible.
let chunk_size = buf.length / threads;
let start_byte = chunk_size * bx;
let end_byte = start_byte + chunk_size;
let i = start_byte;
while (buffer[i++] != '\n') {} // Skip until the first new row.
while (i < end_byte) {
let row_start = i;
let row_end = i;
while (buffer[row_end++] != '\n') {}
process_row(buffer, row_start, row_end);
i = row_end + 1;
}
Basically you first roughly split the data with simple byte division. Then each thread aligns itself to the underling data chunks. This alignment can be done in parallel across all threads rather than being part of a serial step that examines every byte before the parallel work starts. You need to take care that the alignment each thread does doesn't skip or duplicate any rows, but for simple data formats like this I don't think that should be a major difficulty.
I tried this out today. While it works (no longer a pre-split step required), it makes the CUDA kernel run ridiculously slow. I believe it's because of the while loop:
while (i < end_byte) {
Comparing it to my original solution, 50X divergent branches are introduced! (ncu profiling)
The only difference between the two is that the for loop could deterministically iterate, yet this while loop iterates for an unknown amount (at kernel launch time).
I admit, I don't perfectly understand the reason. But this is the most likely culprit.