Human-AI Alignment
Preference optimization, safety alignment, and density-ratio methods for aligning foundation models with human intent.
Goals
To build principled alignment methods that steer powerful models toward human preferences and safety constraints without sacrificing capability.
Overview
This direction develops theory and algorithms for aligning large language and multimodal models with human values. Work spans token-level preference optimization, off-policy reference tuning, and formalizing safety alignment as density ratio matching.
Key objectives
- Develop principled token-level preference optimization
- Formalize safety alignment as well-posed learning objectives
- Enable selective off-policy tuning with plan guidance
- Bridge preference learning with theoretical guarantees
Key topics
- Preference optimization and RLHF
- Safety alignment via density ratio matching
- Off-policy reference tuning
- Human feedback and reward modeling
Papers in this direction
BSO: Safety Alignment Is Density Ratio Matching
Nguyen, TP, Nguyen, T, Nguyen, T, Nguyen, DMH, Dinh, NT, Le, T
arXiv preprint arXiv:2605.12339
Selective Off-Policy Reference Tuning with Plan Guidance
Le, DA, Nguyen, TP, Nguyen, TH, Van, LN, Le, T
arXiv preprint arXiv:2605.11505
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Nguyen, T, Nguyen, TP, Van, LN, Nguyen, DMH, Doan, KD, Le, T
International Conference on Machine Learning (ICML)