<aside> 🔏

A summary of 2000+ finetuning runs, 1000+ GPU hours and manually analyzing hundreds of model responses.

</aside>

image.png

TL;DR:

Base Models are bad at reasoning in the response space (Section 2). A small amount of SFT initially aligns the model’s response distribution to the required multistep reasoning style - it imparts it the ability to do reasoning, even if it is isn’t necessarily always correct (Section 4). Further SFT is useful, but the data curation is expensive, when compared to marginal improvements gains (Section 3). Preference finetuning on the other has a weaker per-sample reward signal, which is why many models resort to large-scale RL tuning. However, starting from an SFT checkpoint improves RL sample efficiency, by using the (weaker) reward signal to improve on the reasoning accuracy rather than the style, since it doesn’t have to stray too far from the response model distribution and incur a KL penalty (Section 5).

Each line in the above statement is backed up by experiments I ran throughout my research (which are described in the sections inside the parenthesis).

The rest of this blog summarizes and connects two papers I worked on, that studied how a model learns a capability during the pretraining-SFT-RLHF pipeline. (Please refer to the papers for more detailed experiments and empirical analysis).

  1. Revisiting the Superficial Alignment Hypothesis - https://arxiv.org/pdf/2410.03717
  2. Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning - https://arxiv.org/pdf/2502.11284

Introduction

The journey of an LLM begins with pre-training on massive text corpora collected from the internet, books, and other diverse sources. During this phase, models develop general language understanding and knowledge acquisition capabilities through unsupervised learning objectives like next-token prediction. This process equips models with broad knowledge about the world and fundamental language patterns, essentially building their foundational capabilities. However, a lot of the capabilities are hidden in the latent space and they aren’t easy to use. So it is hard to say if the model has learned a capability in the first place.

Say I want a model M to get better at a task T. Working backward, we have a base model which is largely agnostic to T and is just pretrained with lots of data (except for some very select domains like multilinguality and coding). There is a chance that the model has some data for it in pretraining - Shakespeare's poems are present on the internet in multiple places but a pretrained model is generally not the best at writing a rap song in Shakespeare’s style if prompted to. How do you get better at writing like Shakespeare? Model builders look at a bunch of such tasks - creative writing, mathematical problem-solving, brainstorming, real-world coding, etc. The conventional pipeline is to first “pre-train” a base model on general purpose text data, followed by Supervised Finetuning (SFT), and a final stage of Reinforcement Learning from Human Feedback (RLHF) or Preference Finetuning (PFT) [Tulu3, Llama3]. Although the SFT-RLHF/PFT stage (or Post-training) can be done in multiple rounds, it is more for performance maximization, and assuming one round of it doesn't lead to a loss of generality.

1. Is less really more for finetuning?