LLMs typically do worse on specific extractive and classification tasks than a finetuned BERT-large model, so no you actually can't replace everything with LLM calls with similar performance.
(Cf. BloombergGPT paper all financial benchmark tasks).
And that's not even taking into account inference cost, but that is a business case issue.
BERT is a language model! It was considered a "large" language model for its time, and it's even based on transformers. It's a very small language model by today's standards (340MM params), and is encoder-only instead of decoder-only, but trying to draw a hard line between "BERT" and "LLMs" is more about parameter count than capabilities — in fact, the original GPT-3 paper benchmarked GPT-3 against finetuned BERT-large and beat it on nearly every measure [1]. And BERT-large is not unique in being able to be finetuned; finetuning Mistral 7B on your task should result in very good performance (similarly, OpenAI allows finetuning of gpt-3.5-turbo, and there are plenty of non-Mistral open-source LLMs like Yi and Qwen that should do well too).
I'm not sure what BloombergGPT has to do with LLMs vs non-LLMs; BloombergGPT is an LLM [2], and it defeating other LLMs on financial benchmarks doesn't prove much about large language models other than "LLMs can be trained to be better at specific tasks."
Of course BERT is an LM, I never claimed it wasn't. It is just much smaller than what is now termed "LLM" which is typically an extremely large generative decoder-only transformer for general multi-language (often instruction and chat tuned). I was training and finetuned dozens of LMs back when BERT-large was considered a GPU memory issue and I have finetuned hundreds LMs over my NLP research and engineering career. I have even implemented Hidden Markov Model LMs from scratch (as an exercise mostly, performance was mostly bad even in 2015).
When I used "specific task", I meant specialized, domain specific tasks like financial sentiment and event extraction in which I hold a PhD. As a matter of fact, for Fiqa SA finetuned Roberta scores 88% F1 while BloombergGPT scores 75% F1. [1] Still very impressive for zero/few shot learner, but depending on data availability, performance targets and inference cost tradeoffs, it might not need your meds.
My point was "small" masked encoder transformer LMs like BERT can still hold their own on narrow domain tasks. And what OP claims that all NLP is solved by prompting a general purpose LLM service is simply inaccurate.
Ah, yeah, it's definitely true that prompts alone are typically beaten by finetunes on narrow domain tasks.
I hadn't read the financial paper you linked, it's very interesting! One odd bit I did notice was they set the gpt-4 temperature to 1.0, which is... not a great setting for analysis, and probably harmed the results somewhat. Typically you'd want a value much closer to 0 for that. But while a lower, more reasonable temperature setting would probably improve gpt-4's performance, I would still expect a finetuned LM to outperform larger models with just prompting on those kinds of narrow domains, especially once cost is a factor.
It's somewhat surprising to see how bad Bloomberg-GPT was... Even gpt-4 trounced it on every published metric, and it wasn't trained for finance tasks specifically. The bitter scaling lesson, I suppose.
(Cf. BloombergGPT paper all financial benchmark tasks).
And that's not even taking into account inference cost, but that is a business case issue.