Agentic Post-Training
Framework

Orchestrating LLM alignment through autonomous agent collaboration. 11 techniques, modular pipeline, beautiful agent communication.

from pipeline import AgenticPipeline, PipelineConfig

config = PipelineConfig(technique="grpo", model_name="gpt2")
pipeline = AgenticPipeline(config)
results = asyncio.run(pipeline.run()) # Agents coordinate automatically

Architecture

Specialized agents coordinate via a structured message bus

Coordinator Agent

Orchestrates pipeline, assigns tasks, monitors health

Message Bus (pub/sub)

Training Agent

PPO, GRPO, DPO, SPO, RLHF
KTO, ORPO, RLAIF, SPIN, SimPO, IPO

Optimization Agent

Quantization (GPTQ, AWQ, GGUF)
Pruning, Distillation

Evaluation Agent

MMLU, HumanEval, MT-Bench
GSM8K, TruthfulQA, ARC

Post-Training Techniques

11 techniques organized by priority. Core techniques shown first.

Core

PPO

Proximal Policy Optimization — the backbone of RLHF alignment

Stable RL with clipped surrogate objectives

Schulman et al., 2017

Learn more

PPO optimizes a clipped surrogate objective to prevent destructive policy updates. Used by OpenAI for ChatGPT alignment.

Pros:

  • Well-understood, extensively validated
  • Stable training dynamics
  • Works with any reward signal

Cons:

  • Requires separate reward model
  • Memory intensive (4 models in memory)
  • Sensitive to hyperparameters
Core

GRPO

Group Relative Policy Optimization — DeepSeek-R1's secret weapon

No value model needed — 50% less memory than PPO

Shao et al., 2024 (DeepSeekMath / DeepSeek-R1)

Learn more

GRPO generates a group of responses per prompt, ranks them relative to each other, and uses group statistics as advantages. Eliminates the value model entirely.

Pros:

  • No value model (massive memory savings)
  • Excellent for reasoning tasks
  • Scales well

Cons:

  • Multiple samples per prompt needed
  • Group size is sensitive
Core

DPO

Direct Preference Optimization — alignment without RL

Simple classification loss, no reward model needed

Rafailov et al., 2023

Learn more

DPO reparameterizes the RLHF objective into a simple binary cross-entropy loss on preference pairs. The language model IS the reward model.

Pros:

  • No reward model needed
  • Simple single-stage training
  • Very stable

Cons:

  • Needs paired preference data
  • Reference model still needed
Core

SPO

Self-Play Optimization — iterative self-improvement

Model improves by competing against itself

Wu et al., 2024

Learn more

SPO creates a competitive dynamic: the model acts as both generator and discriminator, continuously improving through self-play with ELO tracking.

Pros:

  • Self-improving without external data
  • No reward model needed

Cons:

  • Can overfit to self
  • Computationally expensive
Core

RLHF

Classic RLHF — the full reward model + PPO pipeline

Proven at scale (InstructGPT, ChatGPT)

Ouyang et al., 2022

Learn more

The original: train a reward model on human preferences, then optimize the policy with PPO. Maximum control, maximum complexity.

Pros:

  • Proven at scale by OpenAI/Anthropic
  • Maximum flexibility

Cons:

  • Complex multi-stage pipeline
  • Very memory intensive
Advanced

KTO

Kahneman-Tversky Optimization — alignment from thumbs up/down

Works with unpaired binary feedback

Ethayarajh et al., 2024

Learn more

Uses asymmetric loss inspired by prospect theory. Only needs binary good/bad signals, not paired preferences.

Advanced

ORPO

Odds Ratio Preference Optimization — SFT + alignment in one

No reference model, combined training

Hong et al., 2024

Learn more

Combines SFT and preference alignment in a single stage using odds ratio penalty. Very memory efficient.

Advanced

RLAIF

RL from AI Feedback — Constitutional AI approach

No human labelers needed

Bai et al., 2022

Learn more

Uses AI-generated feedback based on constitutional principles. Scalable oversight without human labelers.

Advanced

SPIN

Self-Play Fine-Tuning — distinguish human from LLM text

Only needs SFT data

Chen et al., 2024

Learn more

Trains the model to discriminate its own outputs from human text, iteratively improving until convergence.

Advanced

SimPO

Simple Preference Optimization — reference-free + length-normalized

No reference model, fair length comparison

Meng et al., 2024

Learn more

Uses length-normalized log probabilities as implicit reward. No reference model needed at all.

Experimental

IPO

Identity Preference Optimization — regularized DPO

Better regularization than standard DPO

Azar et al., 2023

Learn more

Adds regularization to address DPO's potential overfitting issues. Theoretically grounded improvement.

Pipeline Stages

Each stage is orchestrated by the coordinator agent

1

Data Prep

Validate & preprocess training data

2

Technique Selection

Choose optimal alignment method

3

Training

Execute post-training technique

4

Optimization

Quantize, prune, distill

5

Evaluation

Benchmark on MMLU, MT-Bench, etc.

Agent Communication

Watch agents coordinate a GRPO training pipeline in real-time

agentic-post-training — agent_demo.py

Optimization

Compress and optimize trained models for deployment

Quantization

GPTQ
8x • 99%
AWQ
8x • 99.5%
GGUF
8x • 98%
INT8
4x • 99.9%

Pruning

Magnitude
~95%
Structured
~92%
Movement
~96%
Wanda
~94%

Knowledge Distillation

Transfer knowledge from large teacher to smaller student model.

70B → 8B
92% quality
8B → 1B
85% quality

Methods: Logit-level, Feature-level, Progressive

Quick Start

Get running in under a minute

$ pip install -e .
$ python3 examples/agent_demo.py
$ python3 examples/run_pipeline.py --technique grpo --model gpt2
# Or use presets
from pipeline.config import PipelineConfig

# Quick DPO, Full RLHF, Efficient GRPO, Production
config = PipelineConfig.preset("production")