These predictions often surely make most sense and are most prevalent in a particular field of NLP research, like dialogue (open-domain, non-task oriented). Believable dialogue seems to require some kind of world knowledge, certain forms of 'common sense' (e.g. arithmetic reasoning), and the ability to track belief states and be coherent over long periods.
GPT-2 was critiqued for its inability to deal with simple arithmetic questions, as well as contradicting itself over long periods [1].
GPT-3 has similar deficiencies in terms of sensible real-world knowledge and displaying coherence both with its own previous answers and with a more general real-world knowledge:
'Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type “If I put cheese into the fridge, will it melt?”.' [2]) of the type that would obviously torpedo attempts at open domain dialogue over long periods.
Take a typical two paper pointing out the failings of brute force methods for dialogue. They trot out a classic example:
'Without getting into the specifics of how these systems are constructed, it is apparent that such an approach [brute force] provides a method for imitating a dialogue but not for participating in one. We can see the limitations of the approach from examples like these:
Human: what is your job ? Machine: i ’m a lawyer .
Human: what do you do ? Machine: i ’m a doctor
(Vinyals & Le, 2015).' [3]
Huge models can provide plausible, but not consistently coherent, dialogue inputs. In the dialogue domain, the 'winter' will come (/has come) when it becomes clear that Meena, BlenderBot etc need a little help when it comes to coherence over an arbitrary number of turns, displaying 'common sense physics', and so on.
GPT-2 was critiqued for its inability to deal with simple arithmetic questions, as well as contradicting itself over long periods [1].
GPT-3 has similar deficiencies in terms of sensible real-world knowledge and displaying coherence both with its own previous answers and with a more general real-world knowledge:
'Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type “If I put cheese into the fridge, will it melt?”.' [2]) of the type that would obviously torpedo attempts at open domain dialogue over long periods.
Take a typical two paper pointing out the failings of brute force methods for dialogue. They trot out a classic example:
'Without getting into the specifics of how these systems are constructed, it is apparent that such an approach [brute force] provides a method for imitating a dialogue but not for participating in one. We can see the limitations of the approach from examples like these: Human: what is your job ? Machine: i ’m a lawyer . Human: what do you do ? Machine: i ’m a doctor (Vinyals & Le, 2015).' [3] Huge models can provide plausible, but not consistently coherent, dialogue inputs. In the dialogue domain, the 'winter' will come (/has come) when it becomes clear that Meena, BlenderBot etc need a little help when it comes to coherence over an arbitrary number of turns, displaying 'common sense physics', and so on.
[1] https://thegradient.pub/gpt2-and-the-nature-of-intelligence/ [2] https://arxiv.org/pdf/2005.14165.pdf [3] https://arxiv.org/pdf/1812.01144.pdf