And also article is testing on a different task (Needle in a Needlestack which i...

And also article is testing on a different task (Needle in a Needlestack which is kind of similar to Needle in a Haystack), compared to finding a difference between two documents. For sure it's useful to know that the model does ok in one and really bad in the other, does not mean that original test is flawed.