I mean nothing that is able to be benchmarked and validated by third parties is ...

I mean nothing that is able to be benchmarked and validated by third parties is GPT-4 quality. I know there are upcoming releases that are hyped as being equal to GPT-4, e.g. Gemini Ultra, which I am very excited to get my hands on — but regardless, Ultra is not small enough to run on phones, even using the sparse ReLU flash memory optimization. And we'll see how it benchmarks once it's released; according to some benchmarks Gemini Pro has somewhat underperformed GPT-3.5-Turbo [1], despite Google's initial claims. (Although there are criticisms of that benchmarking, and it does beat the current 1106 version of GPT-3.5-Turbo on the Chatbot Arena leaderboard [2], although it slightly underperforms the previous 0613 version.)

1: https://arxiv.org/pdf/2312.11444.pdf

2: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...