Sharing is caring: an introduction to open-source large language models

Introducing Sharing is caring: an introduction to open-source large language models, written by NICD Data Scientists Dr Mac Misiura and Dr Matt Edwards

Introduction

Large language models have captured not only the attention of the research community, but also the wider public. For example, Reuters has recently reported that ChatGPT, a chatbot based on a large language model from OpenAI, set the record for the fastest-growing consumer application in history. However, many large language models, including ChatGPT, are not open-source, meaning that their underlying code is typically proprietary and not available to the public. Consequently, such closed source projects are unsuitable for those who value transparency, community involvement and collaboration. In this blog post, we will introduce you to several key open-source large language models and discuss their relative merits.

What are large language models?

Currently, we lack a unified definition as to what precisely constitutes a large language model, as highlighted by Zhao et al. (2023). For the purposes of this blog post, we define a large language model as any natural language processing model that is capable of carrying out a wide range of tasks ranging from text classification to text generation. Moreover, large language models could be viewed as a subset of foundation models, which are defined as models trained (usually using self-supervision) on broad data, that can be adapted to different downstream tasks.

Currently, such models are Transformer-based and derived from Vaswani et al. (2017). The original Transformer model was developed for machine translation by "transforming" an input sequence of suitably tokenised text into an output sequence using two components:

an encoder, which is a stack of self-attention layers that learns to encode the input sequence into a single vector, and
a decoder, which is a stack of self-attention and cross-attention layers that learn to decode the encoded vector auto-regressively into an output sequence.

Since then, models that leverage only either the encoder component (i.e. BERT) or the decoder component (i.e. GPT) have been developed with the original view to perform tasks based text classification and token classification or text generation, respectively.

What makes large language models so interesting?

As noted by the likes of Chowdhery et al. (2022), scaling up Transformer models not only enhances their performance across standard natural language processing tasks but also enables it to unlock additional capabilities. For example, GPT-3 demonstrated that large language models could be used in few-shot learning scenarios without requiring task specific data or any finetuning. Furthermore, other large language models, such as models under the moniker of GPT-3.5 seem to also exhibit the following additional abilities:

responding to human instruction in a coherent manner rather than mostly outputting high-frequency prompt completion patterns within the training set;
zero-shot learning in addition to few-shot learning;
code generation and understanding in addition to text generation and understanding;
chain-of-thought reasoning to tackle multi-step problems.

Fu, Peng and Khot et al. (2022) attribute the acquisition of the first two abilities to instruction finetuning and the last two abilities to training on both text and code data. Specifically, they draw parallels on how learning to program in a procedure-oriented fashion could help with solving step-by-step problems and how learning to program in a object-oriented fashion could help with decomposing complex problems into smaller sub-problems.

These abilities are not only from an academic standpoint, but from a commercial one too. As a result, large language models are expected to be integrated into many existing applications and to enable the development of new applications using emerging frameworks, such as LangChain. Furthemore, some industry experts expect that large language models could become the catalyst of a new technological revolution, not dissimilar to to the one that was triggered by the introduction of the world wide web.

What are the key open-source large language models?

Most popular models

The following repo and spreadsheet provide a comprehensive overview of key large language models and whether they are open-source or otherwise. Evidently, there are many potential options to choose from and it may not be immediately clear which one would be most suitable for a particular use case. This is further amplified by the speed of innovation, with new models being released almost every month.

At the time of writing, the most popular (based on the number of Hugging Face downloads) open-source large language models appear to originate from the following research groups:

Meta Research: developers of the OPT and LLaMA series of models;
BigScience: developers of the BLOOM series of models;
EleutherAI: developers of the GPT-J / NeoX and Pythia series of models.

OPT

OPT is a series of eight Transformer decoder-only models with a variable size ranging from 125 million to 175 billion parameters. Their architecture is intended to resemble the architecture of the initial GPT-3 models described in Brown et al. (2020). Specifically, the OPT models have the following key features:

byte-level byte-pair encoding (BPE) tokenisation akin to GPT-2, which allows the tokeniser to efficiently deal with code data and analyse content with unknown UTF-8, such as emojis; the vocabulary size is set to 50,272 tokens;
training corpora created from the Pile, PushShift.io Reddit, BookCorpus, CC-Stories and CC-News v2 totalling approximately 180 billion tokens; this training corpora does not explicitly include any code data and is largely English-centric;
self-supervised training using the causal language modelling aka the next-word prediction objective;
training on 992 80 gigabyte A100 GPUs using Megatron-DeepSpeed over a period of two months; note that training was reported to be unstable with loss divergences and a considerable number of hardware failures;
variable number of attention heads for each decoder layer, ranging from 12 attention heads per layer for the smallest model size to 96 attention heads per layer for the largest model size
maximum input sequence length of 2,048 tokens.

Since instruction finetuning has been shown to considerably improve zero-shot and few-shot performance of large language models, Iyer et al. (2023) created instruction-tuned versions of the OPT 30 billion parameter and 175 billion models, collectively called OPT-IML. The OPT-IML models were fine-tuned using the approach described in the FLAN paper, namely using finetuning on collection of standard datasets described via instructions. This specific approach to instruction finetuning differs to the main alternative approach described by Ouyang et al. (2022), which uses reinforcement learning from human feedback favoured by the GPT-3.5 model series.

Note also that the OPT and OPT-IML models may only be used for non-commercial research purposes according to their license. This is also the case for other Meta Research models, such as LLaMA, and their instruction-tuned versions such as Vicuna.

BLOOM

BLOOM is a series of six Transformer decoder-only models with a variable size ranging from 560 million to 176 billion parameters. The BLOOM models have the following key features:

byte-level byte-pair encoding (BPE) tokenisation akin to GPT-2 and the aforementioned OPT; the vocabulary size is set to 250,680 tokens;
ROOTS training corpora comprising of 498 Hugging Face datasets that cover forty-six natural languages and thirteen programming languages totalling approximately 350 billion tokens;
self-supervised training using the causal language modelling aka the next-word prediction objective;
training on 384 80 gigabyte A100 GPUs using Megatron-DeepSpeed over a period of 3.5 months;
variable number of attention heads for each decoder layer, ranging from 16 attention heads per layer for the smallest model size to 112 attention heads per layer for the largest model size
maximum input sequence length of 2,048 tokens.

Muennighoff et al. (2022) also created instruction-tuned versions of the BLOOM models in a similar manner to OPT-IML

BLOOM's license is based on the Open & Responsible AI Licenses (OpenRAIL), which appears to be less restrictive than the OPT license since it allows some commercial use, but it embeds a specific set of restrictions in identified critical scenarios such as provision of medical advice or impersonation.

GPT-J / NeoX and Pythia

Similarly to OPT and BLOOM, models under the GPT-J / NeoX and Pythia monikers are series of Transformer decoder-only models with a variable model size, with the largest model in the series having 20 billion parameters. Relative to the other two series of models, their distinguishing features are:

they have been trained solely on the Pile;
GPT-NeoX uses rotary embeddings instead the learned position embeddings used by the other models.

At the time of writing, EleutherAI is yet to release instruction-tuned versions of their models, but other researchers have created such models. For example, Databricks have recently released Dolly and Dolly 2.0, which are instruction-tuned versions of GPT-J and Pythia models respectively. Another example is GPT4All-J, which is an instruction-tuned version of GPT-J developed by GPT4All.

Furthermore, the GPT-NeoX model is available under the Apache 2.0 license, whereas Pythia models are available under the MIT license. Both licenses are permissive and allow for commercial use.

Criteria to consider when choosing an open-source large language model

Overall, to choose a suitable large language model, it could be important to consider the following criteria:

specific skill set: aligning large language models to conform with the users' intentions and preference (e.g. via instruction finetuning) seems to adjust their skill sets towards branches and potentially trade-off performance for e.g. user interactivity, as mentioned in Ouyang et al. (2022). Moreover, some models may be originally optimised for human dialogue, others for text completion or code completion and others for zero-shot learning. Additionally, some models can handle multiple natural and programming languages, whereas some are predominantly English-centric. Having a precise knowledge of what is expected from the large language model should help you not only to choose the right one but also inform whether you need to fine-tune it to your specific use case or whether you can leverage prompt engineering to adjust its skill set;
license: ensure that the license is compatible with your intended use case and that it allows you to modify and redistribute the model as needed. Evidently, if you are planning to use a model for commercial purposes, then OPT and LLaMA-based models may not be suitable. Note also that the intellectual property landscape is complex and could change over time. For example, a lawsuit has been recently filed against GitHub Copilot, claiming that it violated the rights, including copyright, of authors who have contributed to codebases stored as public repositories on GitHub. The outcome of this lawsuit could have a significant impact on the future of large language models;
infrastructure: running inference for any neural network requires combining numerical parameter arrays with numerical inputs arrays via e.g matrix multiplications. This is a computationally intensive task and requires a considerable amount of memory, which increases as models get bigger. For example, to run inference using the 176 billion parameter version of the BLOOM model, eight 80GB A100 GPUs would be required, which is likely to be infeasible for many prospective users. Consequently, for many users, choosing a suitable large language model may revolve around finding a balance between the model size, performance and the available infrastructure.

Conclusion

It is usually prudent to exercise caution and skepticism about centralising on privately owned services. Large language models are no exception to this rule. Consequently, open-source large language models are a welcome addition to the current artificial intelligence landscape as they provide an increasingly feasible alternative to the existing proprietary models. In this post, we introduced several key open-source large language models and discussed some of the important criteria to consider when choosing one.

Sharing is caring: an introduction to open-source large language models

Introducing Sharing is caring: an introduction to open-source large language models, written by NICD Data Scientists Dr Mac Misiura and Dr Matt Edwards

Introduction

What are large language models?

What makes large language models so interesting?

What are the key open-source large language models?

Most popular models

OPT

BLOOM

GPT-J / NeoX and Pythia

Criteria to consider when choosing an open-source large language model

Conclusion

More of our latest news

Text-to-image: latent diffusion models

Data Deep Dive: Serving machine learning models using AWS Lambda

Deploying a Machine Learning Model Using Plumber and Docker

To find out more about working with us, get in touch.
We'd love to hear from you.

Sharing is caring: an introduction to open-source large language models

Introducing Sharing is caring: an introduction to open-source large language models, written by NICD Data Scientists Dr Mac Misiura and Dr Matt Edwards

Introduction

What are large language models?

What makes large language models so interesting?

What are the key open-source large language models?

Most popular models

OPT

BLOOM

GPT-J / NeoX and Pythia

Criteria to consider when choosing an open-source large language model

Conclusion

More of our latest news

Text-to-image: latent diffusion models

Data Deep Dive: Serving machine learning models using AWS Lambda

Deploying a Machine Learning Model Using Plumber and Docker

To find out more about working with us, get in touch. We'd love to hear from you.

To find out more about working with us, get in touch.
We'd love to hear from you.