Orchestrating LLM alignment through autonomous agent collaboration. 11 techniques, modular pipeline, beautiful agent communication.
Specialized agents coordinate via a structured message bus
Orchestrates pipeline, assigns tasks, monitors health
PPO, GRPO, DPO, SPO, RLHF
KTO, ORPO, RLAIF, SPIN, SimPO, IPO
Quantization (GPTQ, AWQ, GGUF)
Pruning, Distillation
MMLU, HumanEval, MT-Bench
GSM8K, TruthfulQA, ARC
11 techniques organized by priority. Core techniques shown first.
Proximal Policy Optimization — the backbone of RLHF alignment
Stable RL with clipped surrogate objectives
Schulman et al., 2017
PPO optimizes a clipped surrogate objective to prevent destructive policy updates. Used by OpenAI for ChatGPT alignment.
Pros:
Cons:
Group Relative Policy Optimization — DeepSeek-R1's secret weapon
No value model needed — 50% less memory than PPO
Shao et al., 2024 (DeepSeekMath / DeepSeek-R1)
GRPO generates a group of responses per prompt, ranks them relative to each other, and uses group statistics as advantages. Eliminates the value model entirely.
Pros:
Cons:
Direct Preference Optimization — alignment without RL
Simple classification loss, no reward model needed
Rafailov et al., 2023
DPO reparameterizes the RLHF objective into a simple binary cross-entropy loss on preference pairs. The language model IS the reward model.
Pros:
Cons:
Self-Play Optimization — iterative self-improvement
Model improves by competing against itself
Wu et al., 2024
SPO creates a competitive dynamic: the model acts as both generator and discriminator, continuously improving through self-play with ELO tracking.
Pros:
Cons:
Classic RLHF — the full reward model + PPO pipeline
Proven at scale (InstructGPT, ChatGPT)
Ouyang et al., 2022
The original: train a reward model on human preferences, then optimize the policy with PPO. Maximum control, maximum complexity.
Pros:
Cons:
Kahneman-Tversky Optimization — alignment from thumbs up/down
Works with unpaired binary feedback
Ethayarajh et al., 2024
Uses asymmetric loss inspired by prospect theory. Only needs binary good/bad signals, not paired preferences.
Odds Ratio Preference Optimization — SFT + alignment in one
No reference model, combined training
Hong et al., 2024
Combines SFT and preference alignment in a single stage using odds ratio penalty. Very memory efficient.
RL from AI Feedback — Constitutional AI approach
No human labelers needed
Bai et al., 2022
Uses AI-generated feedback based on constitutional principles. Scalable oversight without human labelers.
Self-Play Fine-Tuning — distinguish human from LLM text
Only needs SFT data
Chen et al., 2024
Trains the model to discriminate its own outputs from human text, iteratively improving until convergence.
Simple Preference Optimization — reference-free + length-normalized
No reference model, fair length comparison
Meng et al., 2024
Uses length-normalized log probabilities as implicit reward. No reference model needed at all.
Identity Preference Optimization — regularized DPO
Better regularization than standard DPO
Azar et al., 2023
Adds regularization to address DPO's potential overfitting issues. Theoretically grounded improvement.
Each stage is orchestrated by the coordinator agent
Validate & preprocess training data
Choose optimal alignment method
Execute post-training technique
Quantize, prune, distill
Benchmark on MMLU, MT-Bench, etc.
Watch agents coordinate a GRPO training pipeline in real-time
Compress and optimize trained models for deployment
Transfer knowledge from large teacher to smaller student model.
Methods: Logit-level, Feature-level, Progressive
Get running in under a minute