Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

Table of contents

Building Large Language Models (LLMs)

Large Language Models (LLMs) are advanced neural network architectures used in many modern chatbots, such as ChatGPT from OpenAI, Claude from Anthropic, and Gemini. This document outlines key components involved in training and building LLMs, as well as important aspects of pre-training and post-training.

Key Components of LLM Training

  1. Architecture: LLMs are based on neural networks, specifically Transformers. The choice of architecture plays a vital role in model performance.
  2. Training Algorithm and Loss: The way the model learns from data is crucial. Training involves algorithms that minimize specific losses, helping the model make better predictions.
  3. Data: The data used to train LLMs defines what the model learns. High-quality, diverse datasets are essential for building robust models.
  4. Evaluation: Metrics are necessary to assess how well the model is performing in relation to its goal.
  5. Systems: System infrastructure, including hardware, is a key consideration due to the large size of these models. Efficient hardware usage significantly impacts training performance.

Pre-Training vs. Post-Training

Pre-Training

Pre-training refers to the initial phase where the model learns from large amounts of unstructured data, such as the entire internet. In this phase, models like GPT-2 and GPT-3 are trained. Pre-training involves building a probability distribution over sequences of tokens or words.

Given a sentence, like "The mouse ate the cheese", the model estimates the probability of this sentence appearing in natural language. If a sentence like "The cheese ate the mouse" is presented, the model should recognize the low likelihood of such a sentence due to semantic knowledge.

Post-Training

Post-training focuses on fine-tuning the model to act as an assistant, such as ChatGPT. This phase involves refining the pre-trained model to handle more specific tasks and interactions, like conversations.

Language Modeling

A language model assigns a probability to a sequence of words ( P(X_1, X_2, ..., X_L) ), where each ( X_i ) is a word in the sequence. Language models are generative because they can sample from this probability distribution and generate new sentences.

Autoregressive Language Models

In autoregressive models, the probability of a sentence is decomposed as a product of conditional probabilities:

$$ P(X_1, X_2, ..., X_L) = P(X_1) \cdot P(X_2 | X_1) \cdot P(X_3 | X_1, X_2) \cdot ... \cdot P(X_L | X_1, X_2, ..., X_{L-1}) $$

This follows the chain rule of probability. Autoregressive models predict one word at a time, based on the context of the previous words. While accurate, this method has a drawback: longer sequences take more time to generate, as each word is generated sequentially.

Tokenization and Inference

In autoregressive models, words are first tokenized—split into smaller subword units—and then assigned an ID. For example, "She likely prefers dogs" might be tokenized into several tokens (IDs) and passed through the model. The model outputs a probability distribution over the next word, and we sample from this distribution to predict the next token. This process is repeated until the sentence is complete.

During training, the model aims to predict the most likely token based on the context and adjusts its weights to improve the probability of the correct token.

Neural Network Architecture

LLMs use Transformer-based neural networks. The process involves:

  1. Embedding: Each token is represented as a vector (embedding).
  2. Neural Network (Transformer): These vectors pass through a series of Transformer layers to generate a contextualized representation of each token.
  3. Linear Layer: A linear layer maps the output of the Transformer to the vocabulary size.
  4. Softmax Layer: The softmax function is applied to convert the output into a probability distribution over all possible tokens.

The task is to predict the next token by comparing the predicted distribution to the actual token and adjusting the model's weights accordingly.

Loss Function

The task of predicting the next word in a sequence is a classification problem, where we use cross-entropy loss:

$$ \text{Loss} = -\sum_i \text{Target}_i \cdot \log(\text{Predicted}_i) $$

The target is a one-hot encoded vector representing the actual token, and the predicted vector is the model’s output. The goal is to minimize this loss, which corresponds to maximizing the likelihood of the correct sequence of tokens.

Minimizing cross-entropy loss is mathematically equivalent to maximizing the log-likelihood of the observed data:

$$ \text{Maximize} \log P(X_1, X_2, ..., X_L) = \sum_{i=1}^{L} \log P(X_i | X_1, ..., X_{i-1}) $$

Tokenization and Vocabulary

The size of the vocabulary (the number of tokens) affects the output size of the model. For each token prediction, the model outputs a probability distribution over the entire vocabulary. Efficient tokenization is critical because the model needs to cover diverse text without over-expanding the vocabulary.

In general, the larger the vocabulary, the larger the final output layer must be, leading to increased computation.

Tokenization

Tokenization is a crucial component in processing text for natural language models. It plays an essential role in converting raw text into a form that models can understand. Below is an industrial-level documentation of the key concepts and challenges associated with tokenization.

Why Tokenization?

  1. Generalization Beyond Words: Tokenization is more general than word-based segmentation. If a model relied solely on words, any typographical error could lead to unrecognized tokens, causing processing issues. For instance, words with minor typographical errors might not map to any known tokens. Tokenizers generalize beyond words and can handle errors by tokenizing substrings.

  2. Non-Latin Based Languages: Tokenization based purely on spaces works well for languages with word delimiters, like Latin-based languages. However, languages such as Thai do not have spaces between words, requiring a more sophisticated approach to break sentences into smaller units.

  3. Efficiency Concerns with Character-Level Tokenization: Tokenizing sentences character-by-character (e.g., 'a', 'b', 'c') is possible but impractical for modern models. The reason is that sequence length increases significantly, and transformer models have a computational complexity that grows quadratically with sequence length. Long sequences would drastically slow down processing.

Tokenization Algorithms

One of the most widely used algorithms is Byte Pair Encoding (BPE), which is commonly applied in various models like GPT-3 or ChatGPT. Below is an overview of how this process works:

  1. Step 1 - Corpus Preparation: Begin with a large corpus of text. This corpus will serve as the foundation for creating tokenization rules.

  2. Step 2 - Initial Token Assignment: Each character in the corpus is assigned a unique token. For instance, the word "token" will be split into t, o, k, e, n, with each character as its own token.

  3. Step 3 - Token Merging: The algorithm searches for the most common pair of consecutive tokens in the text. These pairs are merged into a new token. For instance, if the character sequence t and o appears frequently together, they will be merged into a single token to.

  4. Step 4 - Iterative Merging: This process continues iteratively, merging the most frequent token pairs. After multiple iterations, tokens are formed from common subsequences of characters, allowing the tokenizer to work on more complex patterns.

In practical tokenization for large models, this process applies to much larger corpuses, leading to highly efficient and scalable token sets.

Pre-tokenization

Before the tokenization process, text undergoes a step known as pre-tokenization, where spaces, punctuation, and other delimiters are handled. This step is particularly important for languages like English. The primary role of pre-tokenization is efficiency. It ensures that spaces and punctuation are treated as separate tokens while avoiding unnecessary complexity during the main tokenization phase. Pre-tokenizers prevent the merging of tokens across spaces for computational reasons but theoretically could be handled like any other character.

Token Retention

During the tokenization process, the smaller, initial tokens (like individual characters) are retained even after merging. This allows the model to handle typographical errors or out-of-vocabulary (OOV) words by falling back on individual characters when necessary. For example, if a word is not recognized, it can still be represented by its constituent characters.

Unique Token Representation

Each token has a unique identifier (ID), and these IDs are fed into the model. For example, the word "bank" would have a single token ID, regardless of whether it refers to a financial institution or a riverbank. The model itself, not the tokenizer, will infer the correct meaning based on surrounding context using its learned representations.

Post-Training Tokenization

When a trained tokenizer is applied to new text, it always selects the largest possible token. For instance, the word "tokenizer" will be split into "token" and "izer" rather than individual characters. This process improves efficiency by minimizing sequence length.

Tokenization Challenges

There are several practical challenges associated with tokenization that drive research toward alternative approaches:

  1. Numerical Representation: Numbers like 327 may have a unique token, which is problematic because models cannot generalize arithmetic operations the same way humans do. Ideally, numbers should be tokenized as individual digits (3, 2, 7) so that the model can learn how to add, subtract, or manipulate them.

  2. Code Tokenization: Code presents a unique challenge, as specific tokens (e.g., indentation in Python) can be difficult for models to understand. Recent advancements, like GPT-4, have improved how code is tokenized by handling common patterns (such as indentation) more effectively.

Future Directions

There is ongoing research aimed at moving away from tokenizers altogether, as they introduce complexities that limit model performance in certain tasks, such as math or code. Potential future architectures may use character-level or byte-level tokenization more effectively, without the current challenges of sequence length scaling quadratically. Such developments could lead to more flexible, efficient models.

Evaluation of Language Models

Overview of Evaluation Metrics

Language models (LMs) are typically evaluated using perplexity, which serves as a measure of how well the model predicts a sample. Perplexity is closely related to validation loss but provides a more interpretable metric.

Definition of Perplexity

Perplexity is defined as:

$$ \text{Perplexity} = 2^{\text{Average Per Token Loss}} $$

This average per token loss is computed by exponentiating the loss to eliminate the influence of logarithmic scales, which can be less intuitive for human interpretation. The use of average per token loss ensures that the perplexity metric is independent of the sequence length.

Interpretation of Perplexity Values

  • Optimal Perplexity: A perplexity of 1 indicates perfect predictions where the model assigns a probability of 1 to the correct token.
  • Worst Case: The worst-case perplexity occurs when the model predicts uniformly across all tokens in the vocabulary, yielding a perplexity equal to the vocabulary size.

Recent developments in language models have shown a significant reduction in perplexity values. Between 2017 and 2023, perplexity on standard datasets improved from approximately 70 tokens to fewer than 10 tokens. This indicates a marked enhancement in model performance, reflecting a decrease in uncertainty during word generation.

Limitations of Perplexity

Despite its utility, perplexity is less frequently used in academic benchmarking due to its dependence on tokenizer design and the specific datasets evaluated. Nonetheless, it remains a crucial metric during the development phase of language models.

Alternative Evaluation Approaches

Classical NLP Benchmarks

An increasingly common method for evaluating language models involves aggregating results from classical natural language processing (NLP) benchmarks. This approach encompasses a variety of tasks, including:

  • Question answering
  • Sentiment analysis
  • Text classification

Helm and Open LM Leaderboard

Two prominent evaluation frameworks include:

  1. Helm: Developed by Stanford, this benchmark covers a broad range of NLP tasks, allowing for automated evaluations.
  2. Hugging Face Open LM Leaderboard: This leaderboard provides insights into the performance of various language models across diverse tasks.

Question Answering Evaluation

In question answering tasks, models are evaluated based on their ability to generate correct answers from a given set of options. For instance, in the MLU benchmark, questions from domains such as astronomy and physics are posed alongside multiple potential answers.

Evaluation Methodology

Evaluation can be conducted in two ways:

  1. Likelihood Assessment: The model generates a probability distribution over the potential answers, and the likelihood of each answer is computed.
  2. Direct Selection: The model is prompted to select the most likely answer from the provided options.

This can be formalized as follows:

  • Given a question ( Q ) with answer options ( A_1, A_2, A_3, A_4 ):

$$ P(A_i | Q) \quad \text{for } i \in \{1, 2, 3, 4\} $$

Handling Unconstrained Output

When evaluating unconstrained text generation, determining correctness becomes challenging, as semantically identical outputs may differ in tokenization. Evaluation of open-ended questions will be addressed in detail later.

Impact of Tokenizer Design on Perplexity

Perplexity is influenced by the tokenizer used in the model. For instance, if ChatGPT employs a tokenizer with 10,000 tokens and another model (e.g., Gemini from Google) utilizes one with 100,000 tokens, the upper bounds of perplexity differ. This discrepancy can lead to misleading comparisons, as the tokenizer's design choices significantly affect evaluation outcomes.

Evaluation Challenges

There are many challenges in evaluating large language models (LLMs). I'll briefly discuss two of them.

Inconsistent Evaluation Methods

First, there are various methods to evaluate these models. Historically, different companies and organizations have used different evaluation benchmarks, leading to inconsistent results. For example, Meta's LLaMA 65B model achieved a 63.7% accuracy on the Helm benchmark, while on another benchmark, it scored only 48.8%. This discrepancy highlights the importance of standardized evaluation methods, as inconsistencies can arise not only from the evaluation benchmarks but also from prompting techniques.

Chain Test Contamination

Another significant challenge is chain test contamination. This is particularly important in academia, where the source of training data is often unclear. While companies may have insight into their training datasets, researchers frequently do not. To assess whether a test set was included in the training set, various methods can be employed. One interesting approach used in the lab involves examining the order of predictions made by the model. By generating examples in different orders and comparing their likelihood, researchers can infer whether specific data was part of the training set.

Data Challenges

Data is another substantial topic in the context of LLMs. At a high level, many claim that LLMs are trained on "all of the internet," but this description is vague. Some say it's "clean internet," which is even less defined. The reality is that the internet is often messy and not representative of the desired training material. If one were to download a random website, the content might be surprising and far from the quality expected from sources like Wikipedia.

Data Collection Process

To begin the training process, web crawlers are used to download data from the internet. These crawlers can access every web page available through Google, resulting in approximately 250 billion pages, which amounts to around one petabyte of data. Common Crawl is a widely-used web crawler that collects this data monthly.

After collecting raw data, several steps must be taken to make it suitable for training models:

  1. Extracting Text from HTML: The first task is to extract meaningful text from HTML pages, which can be challenging, especially for content like mathematical expressions.

  2. Filtering Undesirable Content: Next, undesirable content such as NSFW material, harmful content, and personally identifiable information (PII) is filtered out. Companies typically maintain a blacklist of sites to exclude from training datasets. Additionally, machine learning models may be trained to classify and remove PII.

  3. Deduplication: It's essential to eliminate duplicate content from the dataset. This includes removing repeated headers and footers from forums, as well as identical paragraphs from common books scattered across the internet.

  4. Heuristic Filtering: This step involves rules-based filtering to identify and remove low-quality documents. For instance, if a webpage contains an unusual distribution of tokens or has an extremely short or long length, it may be flagged for exclusion.

Model-Based Filtering

Once a significant amount of data has been filtered, model-based filtering is applied. A classifier can be trained using high-quality references, such as Wikipedia links, to distinguish between quality content and less reliable sources. This method seeks to prioritize data likely to improve the model's performance.

Domain Classification and Weighting

Next, the data is classified into different domains (e.g., entertainment, code, books), allowing researchers to adjust the training weight of each domain. For example, if training on code data enhances reasoning abilities, that domain's representation can be increased in the dataset. In contrast, less useful domains like entertainment might be downweighted.

Final Training Steps

Finally, after completing the data preparation, the training process involves using high-quality data to fine-tune the model. This typically involves a decreased learning rate to help the model overfit on high-quality datasets, such as Wikipedia and human-generated content.

Overall, data collection and preparation are critical components of training large language models, and these processes require significant effort and resources.

Discussion

Questions on Data Processing and Team Size

A fundamental question arises regarding the amount of data remaining after filtering. Typically, after rigorous filtering, the dataset size can vary significantly, but many terabytes of data may remain.

Regarding team size, it is difficult to provide an exact number. For example, in a team like LLaMA's, which comprises about 70 people, perhaps 15 would focus on data preparation. While a relatively small team can accomplish the tasks, it often requires substantial computational resources.

As the field advances, more efficient data processing techniques and methodologies are being explored, including synthetic data generation and the integration of multimodal data.

Data Processing for Large Language Models

Overview of Data Collection

When discussing the training of large language models (LLMs), it’s common to hear that they are trained on the entire Internet or a “clean” version of it. However, what does this really entail? The reality is that the Internet contains a vast array of data, much of which is not representative of what we want to model. For instance, if you were to randomly download a webpage, the content may not resemble the structured information found on platforms like Wikipedia. In fact, a random HTML page might include incomplete sentences or irrelevant content, making it less useful for training.

To address these challenges, data collection involves several steps:

  1. Web Crawling: This is the process of using web crawlers to traverse and download webpages. Currently, the Internet has around 250 billion pages, amounting to approximately one petabyte of data. Many researchers create their own web crawlers or utilize established ones like Common Crawl, which updates its dataset monthly with newly discovered websites.

  2. Data Extraction: Once the data is collected, the next step is to extract meaningful text from the HTML files. This is complicated by the need to handle mathematical content and boilerplate information (like headers and footers) that appears repeatedly across many sites.

  3. Content Filtering: This involves removing undesirable content, including not-safe-for-work material, harmful information, and personally identifiable information (PII). Most organizations maintain a blacklist of websites to exclude from their training datasets.

  4. Deduplication: Redundant content is common on the Internet. Deduplication helps eliminate repeated headers, footers, and common paragraphs to streamline the dataset.

  5. Heuristic Filtering: This step uses rules-based filtering to identify low-quality documents. For example, pages with unusual token distributions or atypical word lengths might be flagged for exclusion.

The Role of Machine Learning in Data Filtering

The idea behind filtering undesirable content is to focus on the quality of the training data rather than penalizing the model for generating unwanted content. This focus allows for a cleaner dataset that better represents human language.

Model-Based Filtering

After significant filtering, a clever technique involves leveraging Wikipedia as a reference point. By analyzing links within Wikipedia, researchers can identify high-quality sources. A classifier is trained to differentiate between content from these high-quality references and the random web, with a goal of prioritizing the former.

Once the data has been categorized, it can be weighted according to its domain. For instance, increasing the proportion of code-related data might enhance the model’s reasoning abilities, while entertainment content could be deprioritized.

At the end of the training phase, models typically fine-tune on high-quality data sources, such as Wikipedia or curated human-generated content. This process can also include continual pre-training to extend context understanding.

Challenges in Data Processing

The complexity of data processing for LLMs cannot be overstated. When people mention training on the Internet, they often overlook the extensive effort required to curate high-quality datasets. Collecting and processing data is indeed a critical aspect of training competitive LLMs.

Data Volume and Team Size

To provide some perspective, the initial terabyte of raw data may shrink significantly after filtering and deduplication. While the exact final size can vary, it’s common for substantial reductions to occur. As for the team size involved in these operations, a typical data team may consist of around 15 individuals, though this number can fluctuate based on project demands.

Data Scaling

Despite the extensive work involved in processing data, companies often remain tight-lipped about their collection methodologies due to competitive dynamics and potential copyright liabilities. While the academic community may reference benchmarks, the actual scales of data used in proprietary models are often much larger.

Common Academic Benchmarks

As data collection has evolved, the size of training datasets has grown tremendously. Early benchmarks began with around 150 billion tokens (approximately 800 GB) and now encompass up to 15 trillion tokens. This scaling reflects the increasing complexity and capability of the best-performing models.

Notable Datasets and Models

For instance, the Pile dataset serves as a comprehensive benchmark, incorporating a variety of sources, including academic papers, Wikipedia, and programming repositories like GitHub. In contrast, models like Llama 2 were trained on 22 trillion tokens, while GPT-4 likely also falls within the same range based on available leaks.

Evaluation Challenges

Evaluating these models presents its own set of challenges. Various methodologies exist for assessment, but the focus remains on ensuring that models perform well across different domains of knowledge.

Data Processing in Language Model Training

During the training phase, we usually focus on high-quality data, particularly at the end of the training process when we reduce the learning rate. This reduction signifies an overfitting process, where the model becomes finely tuned to the high-quality datasets, such as Wikipedia and curated human-generated data.

It's important to note that continual pre-training for extended context is a significant aspect, but I'll skip over that for now. The effort required for this training is substantial; simply stating that one will "train on the internet" is an oversimplification. The collection of world data is critical to the practical implementation of large language models—some may argue it's the key factor.

Team Size and Data Volume

To address a common question: typically, we start with about a terabyte of data. After filtering, the volume significantly decreases. However, estimating the size post-filtering is complex. The number of people involved in data processing can be quite large, often exceeding the number involved in model tuning.

For instance, in the LLaMA project, which has a team of around 70 people, approximately 15 are dedicated to data processing. While fewer people are needed for this task, it requires extensive computational resources, particularly CPUs.

Challenges in Data Processing

Despite the advancement in technology, we still face numerous challenges in data pre-training:

  1. Data Processing Efficiency: We haven't yet solved how to process data efficiently.
  2. Domain Balancing: Balancing various data domains is a concern.
  3. Synthetic Data Generation: There's ongoing research into whether synthetic data can enhance performance.
  4. Multimodal Data Usage: Exploring the benefits of multimodal data instead of solely text data is crucial.

Data Collection Techniques

When discussing data, it’s crucial to clarify what "training on all of the internet" entails. The reality is that the internet is messy and not representative of the ideal datasets for training. The process typically involves:

  1. Web Crawling: Utilizing web crawlers to gather data, which currently amounts to around 250 billion pages, roughly equating to one petabyte of information.

  2. Data Extraction: Extracting clean text from the HTML content. This process can be challenging, especially when it comes to extracting complex elements like mathematical content.

  3. Content Filtering: Filtering out undesirable content, such as not safe for work (NSFW) material, personally identifiable information (PII), and other harmful data. Companies maintain extensive blacklists of websites to exclude.

  4. Deduplication: Eliminating duplicate content, which can occur due to consistent headers, footers, or repeated paragraphs across different URLs.

  5. Heuristic Filtering: Applying heuristic methods to remove low-quality documents based on token distribution, word length, and other outlier characteristics.

Addressing the Quality of Data

A point of discussion arises around the filtering of undesirable content versus using it in supervised learning. The consensus is that while it might seem viable to incorporate such data with a supervised loss function, the implications of including low-quality or harmful content could detract from the overall efficacy and reliability of the language model.

Scaling Laws in Model Training

So now, let's delve into scaling laws in model training.

Imagine a scenario: you have been given access to 10,000 GPUs for a month. A big question arises—what model would you train? How do you even approach answering this question? While this may seem hypothetical, it's precisely the challenge these companies face.

The Old vs. New Pipeline

Old Pipeline

In the older approach:

  1. High-parameter tuning was performed on large models.
  2. With 30 days to train, you’d train 30 different models, each for a single day.
  3. You’d choose the best-performing model and proceed to use that in production.
  4. Consequently, the model ultimately deployed was trained for only one day.

New Pipeline

With scaling laws, the approach shifts:

  1. Start by finding a scaling recipe, such as "If you increase the model size, you should decrease the learning rate."
  2. Perform hyperparameter tuning on smaller models across different scales. For example, in the first 3 days, you might train various small models and tune them.
  3. Fit a scaling law to extrapolate how well the model might perform if trained as a larger model.
  4. Then, dedicate the remaining 27 days to training the final, large model.

This new pipeline means you're no longer tuning hyperparameters on the final model's scale, but rather on scaled-down versions. By understanding scaling laws, you predict how a small, well-performing model will scale up.

Example: Transformers vs. LSTMs

Say you're choosing between Transformer and LSTM models. You train Transformers at different scales and then plot test loss (y-axis) against model parameters (x-axis). Likewise, you do the same for LSTMs.

  1. After obtaining data points, fit a scaling law to see how the model would perform with increased compute.
  2. Based on this plot, you’ll see if Transformers perform better at larger scales, with LSTMs possibly showing a less linear trend.
  3. In reading these scaling laws, two components are crucial:
    • Scaling rate (slope of the law) and
    • Intercept (how the model starts relative to others).

Transformers might show a better scaling rate and intercept over LSTMs, confirming their efficiency at scale.

Sensitivity to Architecture Differences

Scaling laws also allow us to consider architecture sensitivity:

  • Small architectural changes (like new activations) typically only adjust the intercept, but don't drastically alter the scaling law.
  • In fact, once you see scaling laws in action, architecture and loss function adjustments tend to matter less. Data quality has a more significant effect on the scaling law's efficiency.

Resource Allocation with Scaling Laws

Scaling laws also help answer another critical question: How to optimally allocate training resources?

For example, you might wonder whether to:

  1. Train a smaller model on more data, or
  2. Train a larger model on less data.

One prominent study—Chinchilla—illustrates this. They show how training loss correlates with parameter size (model size) and compute.

  • Iso-flop curves plot models trained with equivalent compute resources by varying tokens and model sizes.
  • By optimizing parameters and compute, they found the optimal ratio for training: 20 tokens per parameter. For each additional parameter, the model should train on 20 more tokens.

Practical Inference Considerations

In real-world applications, inference costs are as crucial as training costs:

  • Smaller models have lower inference costs, leading companies to prefer models trained with fewer parameters.
  • Papers suggest an optimal 150 tokens per parameter for models used in production, balancing accuracy and inference cost.

Concluding Scaling Laws

Scaling laws have transformed the approach to model training, resource allocation, and architecture design, allowing us to predict the best configurations for given resources.

Post-Training and Alignment for AI Assistants

Now that we’ve covered pre-training, let's discuss post-training and alignment, which are essential for transforming language models into effective AI assistants.

The Purpose of Post-Training

Pre-trained language models like GPT-3 are excellent at generating language, but they don’t naturally provide the type of responses desired in an assistant setting. For example, if you ask a non-aligned model a question such as "Explain the moon landing to a six-year-old," the response might offer similar questions rather than a direct answer, as this pattern appears frequently online.

To turn a language model into an assistant, post-training involves alignment so that models can:

  • Follow user instructions effectively.
  • Avoid generating toxic or harmful responses (important for moderation).

In an aligned model:

  • The model generates helpful, contextually appropriate responses to questions.
  • When asked inappropriate requests, such as writing a divisive tweet, the model responds by refusing to comply.

Supervised Fine-Tuning (SFT)

The main technique for alignment is supervised fine-tuning (SFT), where:

  1. Human-annotated question-answer pairs serve as the fine-tuning data.
  2. These pairs train the model to generate desired responses, following specific instructions.

This process leverages pre-trained models that already understand language syntax and structure. For example, Open Assistant collected human responses to questions like "Can you write a short introduction about the relevance of the term monopsony?" to provide accurate answers, which are used in SFT.

SFT is crucial in transforming models like GPT-3 into widely usable tools, as seen with ChatGPT. This leap made ChatGPT broadly accessible and useful to a global audience, moving beyond the AI research community.

Challenges in Data Collection for SFT

Human-annotated data is essential for SFT but has limitations:

  • Time and Cost: Human-generated responses are slow and expensive to collect.
  • Volume Requirements: Although pre-training data is abundant, post-training data that captures desired responses is harder to source.

Scaling Data Collection with Language Models

To address these limitations, researchers have started using language models (LLMs) to generate synthetic data, as seen in the Alpaca project:

  1. A set of 175 human-generated question-answer pairs was used as a foundation.
  2. LLMs like Text-Davinci-003 generated additional pairs based on these examples.
  3. The model (Llama 7B) was then fine-tuned with 52,000 AI-generated pairs, creating Alpaca-7B.

This process allows for rapid generation of high-quality synthetic data, reducing reliance on human annotators.

Effectiveness of SFT: Data Quantity Insights

Interestingly, SFT does not require large amounts of data to be effective:

  • Research shows that increasing SFT data from 2,000 to 32,000 samples has minimal effect on performance (as observed in the LIMA paper).
  • This indicates that scaling laws don’t apply strongly here. The model only needs enough data to learn the desired response format, not new content.

Intuition

Pre-trained models already encompass general knowledge across various response styles, e.g., lists, bullet points, or narrative answers. SFT simply guides the model to prioritize specific response styles over others.

In essence, SFT optimizes models to align with one type of user behavior already present in pre-training, without fundamentally adding new knowledge.

Reinforcement Learning from Human Feedback (RLHF)

The second component of post-training, beyond supervised fine-tuning, is Reinforcement Learning from Human Feedback (RLHF). This process addresses limitations in SFT and aims to refine AI model responses further based on human preferences.

Why RLHF?

Supervised fine-tuning (SFT) focuses on behavioral cloning—attempting to imitate human-provided responses. However, SFT alone has several limitations:

  1. Bound by Human Abilities: SFT can only reproduce human-level outputs. While humans are good at distinguishing quality, they might not always generate the most desirable or ideal responses.
  2. Potential for Hallucinations: Even with accurate supervised data, hallucinations (generation of plausible but false information) can occur. This happens because, during SFT, the model might generate information that sounds correct without necessarily verifying it from the pre-training phase.
  3. Cost of Human-Generated Data: Collecting ideal human responses is time-consuming and expensive, further limiting SFT’s scalability.

The RLHF Pipeline

RLHF addresses these issues by shifting from imitating human behavior to optimizing human preference:

  1. For each instruction, the model generates two potential answers.
  2. Labelers are asked to select the preferred answer.
  3. Using reinforcement learning, the model is fine-tuned to generate more of the preferred answers.

Reinforcement Learning Strategies

Two primary strategies for applying reinforcement learning in RLHF are commonly used:

1. Direct Reward Comparison

In this approach:

  • The model output is compared with a baseline output.
  • A human evaluator decides which output is better.
  • Binary reward is assigned: +1 if better than the baseline, -1 if not.

However, binary rewards can be sparse and don’t reflect the degree of improvement, limiting the information the model gains from each evaluation.

2. Training a Reward Model

To provide more granular feedback, a reward model can be used, which acts as a classifier to quantify how much better one response is over another. Here’s the process:

  • A reward model ( R ) takes the input and one of the outputs, assigning a reward score based on how favorable it is.
  • This score is computed using a softmax function, comparing rewards for both outputs.

The reward model's goal is to differentiate the quality of responses, assigning higher scores (or logits) for better answers. A higher logit value indicates that the reward model predicts a higher likelihood that this response is superior, known as the Bradley-Terry model.

Reward Model Functionality

The reward model evaluates the entire response and provides a single score. This output then acts as a training signal for the primary language model, allowing it to prioritize responses that align better with human preferences.

Post-Training Evaluation

In this section, we discuss methods for evaluating post-training effectiveness, especially in systems like ChatGPT where answers are open-ended and often have multiple acceptable variations. Evaluating these models presents several challenges:

  • Validation Loss Limitations: Using validation loss to compare models (e.g., those trained with PPO versus DPO) is ineffective, as validation loss may not correlate with human preferences.
  • Perplexity Issues: Perplexity is also unreliable for evaluation. Once models are tuned for specific user-aligned behaviors, they may not provide meaningful distributions. For example, after PPO training, the model may output only one possible response, rather than a probability distribution over multiple responses.
  • Task Diversity: Models face a wide range of user queries, from open-ended generation to question-answering and summarization, making it difficult to establish uniform benchmarks.

Chatbot Arena and Other Benchmarks

The Chatbot Arena is one common benchmark for evaluating conversational models. It uses blind pairwise testing, where users interact with two chatbots and rate their responses. This method, while effective, has limitations:

  1. User Bias: Participants in Chatbot Arena are often tech-savvy, which may skew questions toward technical topics (e.g., software or AI).
  2. Cost and Speed: Extensive human testing is costly and time-consuming, making it unsuitable for frequent evaluation during model development.

Alternative: Using Language Models for Evaluation

To mitigate these issues, we can leverage language models (LMs) instead of humans for model evaluation. The process involves:

  1. Generating outputs from both a baseline model and the model under evaluation.
  2. Using a separate LM (e.g., GPT-4) to judge which output is better.
  3. Averaging these preferences across the dataset to compute a "win rate" for the evaluated model.

This approach, known as AlpacaEval, has shown high correlation (98%) with human evaluations from Chatbot Arena and is more cost-effective, requiring only about $10 and three minutes to complete.

Potential Pitfalls in LM-based Evaluation

One drawback of using LMs to judge outputs is bias towards longer responses. Both humans and LMs tend to prefer longer responses, but this can become problematic in automated evaluations. For instance:

  • When asked to generate a verbose response, GPT-4 scored a win rate of 64.4%, compared to 50% when producing typical responses.
  • Conciseness significantly reduced win rates, highlighting an overemphasis on verbosity.

To mitigate this, regression analysis can control for response length, making evaluations less sensitive to verbosity. While verbose prompts still show slight gains, the effects are less pronounced when length bias is accounted for.

Training and System Optimization

In this section, we discuss further training optimizations and system considerations for large language models, addressing questions about fine-tuning, parameter adjustments, and GPU utilization.

Fine-Tuning Approaches and Data Scaling

Fine-tuning in industry often involves adjusting all model weights, not just a subset. In open-source, techniques like LoRA (Low-Rank Adaptation) allow selective updates to specific weights, modifying the outputs of layers rather than all weights.

  • Data Volume: For fine-tuning, datasets are smaller than pre-training. Supervised fine-tuning (SFT) typically uses around 5,000–50,000 samples, while Reinforcement Learning with Human Feedback (RLHF) may use around 1 million samples. Despite this, fine-tuning has a substantial impact due to repeated exposure of the model to this data.
  • Effective Learning Rate: The learning rate and repetition play a crucial role. Even a single sentence repeated frequently with a high learning rate will eventually dominate model behavior.

Pre-Training as Initialization

Pre-training provides only the initialization of weights, and fine-tuning adjusts the model independently. Viewing pre-training as an initialization step clarifies that the final model reflects the fine-tuning process, without retaining specific influence from the pre-training corpus.

GPU Utilization and Optimization Techniques

In training large models, compute resources are a primary bottleneck. While adding GPUs could help, GPUs are costly, scarce, and require optimized usage for efficient training. Several key concepts help maximize GPU efficiency:

  1. Throughput vs. Latency: GPUs are designed for throughput, performing parallel computations efficiently. Optimizing throughput with matrix multiplications is crucial, as they run faster than other operations.

  2. Memory Bottlenecks: Memory access is slower than computation, especially with high data transfer rates between GPUs. Optimized data transfer minimizes GPU idle time.

  3. Mixed Precision: Reducing precision of floating-point operations to 16 bits (from 32 or 64) reduces memory usage and speeds up computation without sacrificing accuracy. The model’s weights are stored in 32 bits but converted to 16 bits during computations for faster processing.

  4. Operator Fusion: Operator Fusion reduces repeated data transfers by consolidating multiple operations into one kernel. Instead of moving data back and forth between the GPU processor and global memory, it performs all operations in a single transfer, reducing overhead and improving efficiency.

    • In PyTorch, the torch.compile function performs operator fusion by converting the model’s code to optimized CUDA code, increasing speed by approximately two times.

Further Optimizations: Tiling and Parallelism

Additional optimization techniques like tiling and parallelism further enhance performance. Tiling divides computations into smaller parts for efficient caching, and parallelism maximizes usage of all available processing units.

Did you find this article valuable?

Support Neural Nonsense by becoming a sponsor. Any amount is appreciated!