LLM

Introduction to Large Language Models

Dear reader, welcome to Building Large Language Model Applications! In this book, we will explore the fascinating world of a new era of application developments, where large language models (LLMs) are the main protagonists.

During the last year, we all learned the power of generative artificial intelligence (AI) tools such as ChatGPT, Bing Chat, Bard, and Dall-E. What impressed us the most was their stunning capabilities of generating human-like content based on user requests made in natural language. It is, in fact, their conversational capabilities that made them so easily consumable and, therefore, popular as soon as they entered the market. Thanks to this phase, we learned to acknowledge the power of generative AI and its core models: LLMs. However, LLMs are more than language generators. They can be also seen as reasoning engines that can become the brains of our intelligent applications.

In this book, we will see the theory and practice of how to build LLM-powered applications, addressing a variety of scenarios and showing new components and frameworks that are entering the domain of software development in this new era of AI. The book will start with Part 1, where we will introduce the theory behind LLMs, the most promising LLMs in the market right now, and the emerging frameworks for LLMs-powered applications. Afterward, we will move to a hands-on part where we will implement many applications using various LLMs, addressing different scenarios and real-world problems. Finally, we will conclude the book with a third part, covering the emerging trends in the field of LLMs, alongside the risk of AI tools and how to mitigate them with responsible AI practices.

So, let’s dive in and start with some definitions of the context we are moving in. This chapter provides an introduction and deep dive into LLMs, a powerful set of deep learning neural networks that feature the domain of generative AI.

What are large foundation models and LLMs?

LLMs are deep-learning-based models that use many parameters to learn from vast amounts of unlabeled texts. They can perform various natural language processing tasks such as recognizing, summarizing, translating, predicting, and generating text.

Definition

Deep learning is a branch of machine learning that is characterized by neural networks with multiple layers, hence the term “deep.” These deep neural networks can automatically learn hierarchical data representations, with each layer extracting increasingly abstract features from the input data. The depth of these networks refers to the number of layers they possess, enabling them to effectively model intricate relationships and patterns in complex datasets.

LLMs belong to a wider set of models that feature the AI subfield of generative AI: large foundation models (LFMs). Hence, in the following sections, we will explore the rise and development of LFMs and LLMs, as well as their technical architecture, which is a crucial task to understand their functioning and properly adopt those technologies within your applications.

We will start by understanding why LFMs and LLMs differ from traditional AI models and how they represent a paradigm shift in this field. We will then explore the technical functioning of LLMs, how they work, and the mechanisms behind their outcomes.

AI paradigm shift – an introduction to foundation models

A foundation model refers to a type of pre-trained generative AI model that offers immense versatility by being adaptable for various specific tasks. These models undergo extensive training on vast and diverse datasets, enabling them to grasp general patterns and relationships within the data – not just limited to textual but also covering other data formats such as images, audio, and video. This initial pre-training phase equips the models with a strong foundational understanding across different domains, laying the groundwork for further fine-tuning. This cross-domain capability differentiates generative AI models from standard natural language understanding (NLU) algorithms.

Note

Generative AI and NLU algorithms are both related to natural language processing (NLP),which is a branch of AI that deals with human language. However, they have different goals and applications.

The difference between generative AI and NLU algorithms is that generative AI aims to create new natural language content, while NLU algorithms aim to understand existing natural language content. Generative AI can be used for tasks such as text summarization, text generation, image captioning, or style transfer. NLU algorithms can be used for tasks such as chatbots, question answering, sentiment analysis, or machine translation.

Foundation models are designed with transfer learning in mind, meaning they can effectively apply the knowledge acquired during pre-training to new, related tasks. This transfer of knowledge enhances their adaptability, making them efficient at quickly mastering new tasks with relatively little additional training.

One notable characteristic of foundation models is their large architecture, containing millions or even billions of parameters. This extensive scale enables them to capture complex patterns and relationships within the data, contributing to their impressive performance across various tasks.

Due to their comprehensive pre-training and transfer learning capabilities, foundation models exhibit strong generalization skills. This means they can perform well across a range of tasks and efficiently adapt to new, unseen data, eliminating the need for training separate models for individual tasks.

This paradigm shift in artificial neural network design offers considerable advantages, as foundation models, with their diverse training datasets, can adapt to different tasks based on users’ intent without compromising performance or efficiency. In the past, creating and training distinct neural networks for each task, such as named entity recognition or sentiment analysis, would have been necessary, but now, foundation models provide a unified and powerful solution for multiple applications.

Now, we said that LFMs are trained on a huge amount of heterogeneous data in different formats. Whenever that data is unstructured, natural language data, we refer to the output LFM as an LLM, due to its focus on text understanding and generation.

We can then say that an LLM is a type of foundation model specifically designed for NLP tasks. These models, such as ChatGPT, BERT, Llama, and many others, are trained on vast amounts of text data and can generate human-like text, answer questions, perform translations, and more.

Nevertheless, LLMs aren’t limited to performing text-related tasks. As we will see throughout the book, those unique models can be seen as reasoning engines, extremely good in common sense reasoning. This means that they can assist us in complex tasks, analytical problem-solving, enhanced connections, and insights among pieces of information.

In fact, as LLMs mimic the way our brains are made (as we will see in the next section),their architectures are featured by connected neurons. Now, human brains have about 100 trillion connections, way more than those within an LLM. Nevertheless, LLMs have proven to be much better at packing a lot of knowledge into those fewer connections than we are.

Under the hood of an LLM

LLMs are a particular type of artificial neural networks (ANNs): computational models inspired by the structure and functioning of the human brain. They have proven to be highly effective in solving complex problems, particularly in areas like pattern recognition, classification, regression, and decision-making tasks.

The basic building block of an ANN is the artificial neuron, also known as a node or unit. These neurons are organized into layers, and the connections between neurons are weighted to represent the strength of the relationship between them. Those weights represent the parameters of the model that will be optimized during the training process.

ANNs are, by definition, mathematical models that work with numerical data. Hence, when it comes to unstructured, textual data as in the context of LLMs, there are two fundamental activities that are required to prepare data as model input:

Tokenization: This is the process of breaking down a piece of text (a sentence, paragraph, or document) into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the chosen tokenization scheme or algorithm. The goal of tokenization is to create a structured representation of the text that can be easily processed by machine learning models.

Embedding: Once the text has been tokenized, each token is converted into a dense numerical vector called an embedding. Embeddings are a way to represent words, subwords, or characters in a continuous vector space. These embeddings are learned during the training of the language model and capture semantic relationships between tokens. The numerical representation allows the model to perform mathematical operations on the tokens and understand the context in which they appear.

In summary, tokenization breaks down text into smaller units called tokens, and embeddings convert these tokens into dense numerical vectors. This relationship allows LLMs to process and understand textual data in a meaningful and context-aware manner, enabling them to perform a wide range of NLP tasks with impressive accuracy.

For example, let’s consider a two-dimensional embedding space where we want to vectorize the words Man, King, Woman, and Queen. The idea is that the mathematical distance between each pair of those words should be representative of their semantic similarity. This is illustrated by the following graph:

As a result, if we properly embed the words, the relationship King – Man + Woman ≈ Queen should hold.

Once we have the vectorized input, we can pass it into the multi-layered neural network. There are three main types of layers:

Input layer: The first layer of the neural network receives the input data. Each neuron in this layer corresponds to a feature or attribute of the input data.
Hidden layers: Between the input and output layers, there can be one or more hidden layers. These layers process the input data through a series of mathematical transformations and extract relevant patterns and representations from the data.
Output layer: The final layer of the neural network produces the desired output, which could be predictions, classifications, or other relevant results depending on the task the neural network is designed for.

The process of training an ANN involves the process of backpropagation by iteratively adjusting the weights of the connections between neurons based on the training data and the desired outputs.

References

Attention is all you need: 1706.03762.pdf (arxiv.org)
Possible End of Humanity from AI? Geoffrey Hinton at MIT Technology Review’s EmTech Digital: https://www.youtube.com/watch?v=sitHS6UDMJc&t=594s&ab_channel=JosephRaczynski
The Glue Benchmark: https://gluebenchmark.com/
TruthfulQA: https://paperswithcode.com/dataset/truthfulqa
Hugging Face Open LLM Leaderboard: https://huggingface.co/spaces/optimum/llm-perf-leaderboard
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge: https://arxiv.org/abs/1803.05457

Article by Sumit Malhotra
Published 28 Feb 2024