The ability of large language models has made huge leaps with the launch of ChatGPT in late 2022 and now GPT-4, both from OpenAI (and, effectively, Microsoft). Unfortunately, the GPT family of language models is being held increasingly close to the chest. From GPT-3 on, the models are simply not available other than as hosted services.
Much has been written suggesting that AI will become concentrated in a few Big Tech firms because language modeling at scale has become prohibitively expensive. In particular, we have heard that democratic AI – whereby the state of the art is truly open and available to all – is impractical given the amount of compute needed to reach or surpass language models such as GPT-3, Google’s PaLM, and others.
Papers on GPT-3 have gone into some detail about the proprietary models, such as the number of layers and attention heads, as well as model width. This has allowed the AI community to glean some insights and compare the performance of other models against GPT. There are many papers comparing smaller models such as DeepMind’s Chinchilla against GPT-3, and it is not uncommon for the smaller models to outperform their larger sibling.
In mid-March, OpenAI published a paper describing GPT-4, but it gives few details of the model architecture. For example, the number of parameters is not even disclosed. The authors attribute GPT-4’s improvement on exam taking over GPT-3.5 to pretraining methodology, but they explicitly state that no details will be shared for competitive and other reasons. We simply don’t know whether size matters as much in GPT-4 as it did previously.
Let’s step back to Meta’s late 2022 release of Galactica, a model 2/3 the size of GPT-3 that is trained not on arbitrary internet content but on scientific literature and data. As soon as it was made available, it was harshly criticized, even though it is superior to GPT in many regards. The criticism was mostly regarding toxicity and inaccuracies. In response, Meta took down the demo of Galactica.
Galactica was promising. The model could be obtained from Meta, unlike those of OpenAI, and it could be deployed on readily-available and affordable hardware. And, again, it was superior to GPT-3 in various regards.
Well, Meta has upped the ante significantly with the release of LLaMA: Open and Efficient Language Models.
LLaMA models range up to 1/3 the size of GPT-3. They differ by leveraging architectural improvements from the many works of Google, DeepMind, Meta, and others. This allows LLaMA to be trained more efficiently and to perform better given whatever training budget. The largest LLaMA model competes handily with models three times its size (GPT-3) and eight times its size (PaLM).
LLaMA is impressive. It’s truly open AI.
But wait! There’s more …
InstructGPT and ChatGPT are much better at following instructions and chatting than language models that have not been fine-tuned to follow instructions or chat. There is a lot going on to train non-GPT models with such capabilities, but one effort in particular warrants kudos.
Stanford’s work on Alpaca: a Strong, Replicable, Instruction-Following Model is quite fun and helpful. The Stanford team took a small LLaMA model and taught it to follow instructions. The fun part is that they prompted ChatGPT to give it instructions! The result is Alpaca, a fine-tuned LLaMA. It’s worth your time to check it out!
Let’s also give Anthropic an honorable mention for its work on Constitutional AI and related data sets that will enable the community to address toxicity and harm in language models.
Paul Haley is a Distinguished Engineer at Merlyn Mind. He has decades of commercial and research experience in artificial intelligence, natural language processing, and machine learning. Contact him at paul@merlyn.org.
Schedule a free personalized demo to see our purpose-built solutions in action, and hear how innovative schools are leveraging the power of Merlyn in their classrooms.