It does, radix sort is in fact O(N) if size of the universe of values counts as a constant. It’s just slow in practice.
The definition of the machine for which the O(N log N) bound is proved is very delicate: you have to allow O(1) operations on an arbitrarily large set of values but not encoding tricks allowing multiple values to be packed into one and then manipulated unrealistically cheaply using those operations. In particular, the machine must not be able to do arbitrary arithmetic.
Kiiinda. Two-element swaps are a stretch already for merge sort, especially the O(log N)-space linked list version, let alone search trees and so on. At some point you also need to make sure you can’t sneak arbitrary computation into the (necessarily unlimited-magnitude) array index.
I mean, it depends. In Unicode normalization you have to do a stable sort of an arbitrary number of values (code points) that can only ever map to a small finite number ( < 256) of sort keys (combining classes). Insertion sort is the best choice for ordinary inputs, but for adversarial ones a counting sort is probably your best bet.
It looks like it's rather sensitive to the distributions of inputs. The claim in the abstract is that it's O(1) "for the priority increment distributions recently considered by Jones in his review article." The conclusion gives a bit more detail.