Direct Preference Optimization: Your Language Model is Secretly a Reward Model

# Authors
Rafael Rafailov∗† Archit Sharma∗† Eric Mitchell∗†
Stefano Ermon†‡ Christopher D. Manning† Chelsea Finn†
- †Stanford University ‡CZ Biohub


# Abstract
- However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM
using reinforcement learning to maximize this estimated reward without drifting
too far from the original model.
- In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss.
- The resulting algorithm, which we call Direct Prefer-
ence Optimization (DPO), is stable, performant, and computationally lightweight,
eliminating the need for sampling from the LM during fine-tuning or performing
significant hyperparameter tuning.
- Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sen-
timent of generations, and matches or improves response quality in summarization
and single-turn dialogue while being substantially simpler to implement and train

# Introduction
- we will show that the RL-based objective used by existing methods can be optimized exactly with a simple binary cross-entropy objective, greatly simplifying the preference learning pipeline.
- RLHF methods fit
a reward model to a dataset of human preferences and then use RL to optimize a language model policy to produce responses assigned high reward without drifting excessively far from the original model. While RLHF produces models with impressive conversational and coding abilities, the RLHF pipeline is considerably more complex than supervised learning, involving training multiple LMs and sampling from the LM policy in the loop of training, incurring significant computational costs
- We propose Direct Preference Optimization (DPO), an algorithm that implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint) but is simple to implement and straight forward to train
- Intuitively, the DPO update increases the relative log probability of preferred to
dispreferred responses, but it incorporates a dynamic, per-example importance weight that prevents the model degeneration that we find occurs with a naive probability ratio objective.
- Our main contribution is Direct Preference Optimization (DPO), a simple RL-free algorithm for training language models from preferences. Our experiments show that DPO is at least as effective as existing methods, including PPO-based RLHF, for learning from preferences in tasks such as sentiment modulation, summarization, and dialogue, using language models with up to 6B parameters.


<img width="633" height="225" alt="Image" src="https://github.com/user-attachments/assets/8c9d5ed0-2de9-4674-b1cd-9b5d19872446" />

# Preliminaries
- It usually includes three phases:
1) supervised fine-tuning (SFT); 2) preference sampling and reward learning and 3) RL optimization.

- SFT: RLHF typically begins by fine-tuning a pre-trained LM with supervised learning on high-quality
data for the downstream task(s) of interest (dialogue, summarization, etc.), to obtain a model πSFT

<img width="618" height="382" alt="Image" src="https://github.com/user-attachments/assets/f446eb05-7ca1-4437-8254-4838870303a9" />

<img width="860" height="579" alt="Image" src="https://github.com/user-attachments/assets/920887c9-ef96-42d0-b753-c58e0618ee5f" />

<img width="629" height="443" alt="Image" src="https://github.com/user-attachments/assets/b9d39500-465b-41d2-b2d4-7ff3732c51c1" />


# Direct Preference Optimization
- Unlike prior RLHF methods, which learn a reward and then optimize it
via RL, our approach leverages a particular choice of reward model parameterization that enables extraction of its optimal policy in closed form, without an RL training loop. As we will describe next in detail, our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct Preference Optimization: Your Language Model is Secretly a Reward Model #48

Authors

Abstract

Introduction

Preliminaries

Direct Preference Optimization

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Direct Preference Optimization: Your Language Model is Secretly a Reward Model #48

Description

Authors

Abstract

Introduction

Preliminaries

Direct Preference Optimization

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions