Without accounting for data and model architecture, it’s not a very useful numbe...

Without accounting for data and model architecture, it’s not a very useful number. For all we know, they may have sparse approximations which would throw this off by a lot. For example, if you measure a fully connected model over images of size N^2 and compare it to a convolutional one, the former would have O(N^4) parameters and the latter would have O(K^2) parameters, for K<N window size. It’s only useful if you know they essentially stacked additional layers on top of GPT3.5, which we know is not the case as they added a vision head.