Agentic Post-Training Framework

Core

PPO

Proximal Policy Optimization — the backbone of RLHF alignment

Stable RL with clipped surrogate objectives

Schulman et al., 2017

Learn more ▶

PPO optimizes a clipped surrogate objective to prevent destructive policy updates. Used by OpenAI for ChatGPT alignment.

Pros:

Well-understood, extensively validated
Stable training dynamics
Works with any reward signal

Cons:

Requires separate reward model
Memory intensive (4 models in memory)
Sensitive to hyperparameters

Core

GRPO

Group Relative Policy Optimization — DeepSeek-R1's secret weapon

No value model needed — 50% less memory than PPO

Shao et al., 2024 (DeepSeekMath / DeepSeek-R1)

Learn more ▶

GRPO generates a group of responses per prompt, ranks them relative to each other, and uses group statistics as advantages. Eliminates the value model entirely.

Pros:

No value model (massive memory savings)
Excellent for reasoning tasks
Scales well

Cons:

Multiple samples per prompt needed
Group size is sensitive

Core

DPO

Direct Preference Optimization — alignment without RL

Simple classification loss, no reward model needed

Rafailov et al., 2023

Learn more ▶

DPO reparameterizes the RLHF objective into a simple binary cross-entropy loss on preference pairs. The language model IS the reward model.

Pros:

No reward model needed
Simple single-stage training
Very stable

Cons:

Needs paired preference data
Reference model still needed

Core

SPO

Self-Play Optimization — iterative self-improvement

Model improves by competing against itself

Wu et al., 2024

Learn more ▶

SPO creates a competitive dynamic: the model acts as both generator and discriminator, continuously improving through self-play with ELO tracking.

Pros:

Self-improving without external data
No reward model needed

Cons:

Can overfit to self
Computationally expensive

Core

RLHF

Classic RLHF — the full reward model + PPO pipeline

Proven at scale (InstructGPT, ChatGPT)

Ouyang et al., 2022

Learn more ▶

The original: train a reward model on human preferences, then optimize the policy with PPO. Maximum control, maximum complexity.

Pros:

Proven at scale by OpenAI/Anthropic
Maximum flexibility

Cons:

Complex multi-stage pipeline
Very memory intensive

Advanced

KTO

Kahneman-Tversky Optimization — alignment from thumbs up/down

Works with unpaired binary feedback

Ethayarajh et al., 2024

Learn more ▶

Uses asymmetric loss inspired by prospect theory. Only needs binary good/bad signals, not paired preferences.

Advanced

ORPO

Odds Ratio Preference Optimization — SFT + alignment in one

No reference model, combined training

Hong et al., 2024

Learn more ▶

Combines SFT and preference alignment in a single stage using odds ratio penalty. Very memory efficient.

Advanced

RLAIF

RL from AI Feedback — Constitutional AI approach

No human labelers needed

Bai et al., 2022

Learn more ▶

Uses AI-generated feedback based on constitutional principles. Scalable oversight without human labelers.

Advanced

SPIN

Self-Play Fine-Tuning — distinguish human from LLM text

Only needs SFT data

Chen et al., 2024

Learn more ▶

Trains the model to discriminate its own outputs from human text, iteratively improving until convergence.

Advanced

SimPO

Simple Preference Optimization — reference-free + length-normalized

No reference model, fair length comparison

Meng et al., 2024

Learn more ▶

Uses length-normalized log probabilities as implicit reward. No reference model needed at all.

Experimental

IPO

Identity Preference Optimization — regularized DPO

Better regularization than standard DPO

Azar et al., 2023

Learn more ▶

Adds regularization to address DPO's potential overfitting issues. Theoretically grounded improvement.

Agentic Post-Training
Framework

Architecture

Coordinator Agent

Training Agent

Optimization Agent

Evaluation Agent

Post-Training Techniques

PPO

GRPO

DPO

SPO

RLHF

KTO

ORPO

RLAIF

SPIN

SimPO

IPO

Pipeline Stages

Data Prep

Technique Selection

Training

Optimization

Evaluation

Agent Communication

Optimization

Quantization

Pruning

Knowledge Distillation

Quick Start

Agentic Post-TrainingFramework

Architecture

Coordinator Agent

Training Agent

Optimization Agent

Evaluation Agent

Post-Training Techniques

PPO

GRPO

DPO

SPO

RLHF

KTO

ORPO

RLAIF

SPIN

SimPO

IPO

Pipeline Stages

Data Prep

Technique Selection

Training

Optimization

Evaluation

Agent Communication

Optimization

Quantization

Pruning

Knowledge Distillation

Quick Start

Agentic Post-Training
Framework