Since you identify an instrument or voice by its formants (which are more or less at a fixed frequency), it's unlikely to yield good results over such a large range.
I disagree. Sure, a naive approach wouldn't work (shift everything), but everyone's voice covers multiple octaves, so I'm sure plenty of people already know what changes need to happen if you sing in C2 but want to transpose it to C4, etc.
Of course there's some knowledge about that, but the approach in the link identifies the pitch by an NN (this step is not relevant to the current discussion) and then applies an FFT based method for pitch shifting that doesn't take any of it into account. So it'll shift formants as well, making voices and instruments change their character substantially.
This may be because your sample rate was not a multiple of 16000. If that's the case, a low-quality, linear resampling is applied to the input audio to make it compatible with the pitch detection model. This resampling function should be improved.
I just pushed a change that will attempt to set your sample rate to 48000 now which may improve your quality. Additionally, your sample rate will now log out in the console.