Human-AI Alignment

Preference optimization, safety alignment, and token-based methods for aligning foundation models with human intent.

Goals

To build principled alignment methods that steer powerful models toward human preferences and safety constraints without sacrificing capability.

Overview

This direction develops theory and algorithms for aligning large language and multimodal models with human values. I aim to develop helpful, secure, and smart foundation models aligned with multiple human preferences, values, and cultures.

Key topics

Preference optimization and RLHF
Safety alignment for safer models without hurting model usefulness
Off-policy reference tuning
Multi-cultural alignment
Model sovereignty
Multi-objective preference alignments

News

May 2026
Paper TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching accepted at ICML 2026
May 2026
Paper f-Divergence Self-Play for Tabular Anomaly Detection via Large Language Models accepted at ICML 2026
Feb 2026
Paper Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization accepted at ICLR 2026
Dec 2025
Paper CTPD: Cross Tokenizer Preference Distillation accepted at AAAI 2025
Aug 2025
Paper Token-Level Self-Play with Importance-Aware Guidance for Large Language Models accepted at NeurIPS 2025

Papers in this direction

BSO: Safety Alignment Is Density Ratio Matching
Tien-Phat Nguyen, Truong Nguyen, Thin Nguyen, Duy Minh Ho Nguyen, Ngoc-Thanh Dinh, Trung Le
arXiv preprint arXiv:2605.12339
[arXiv]
f-Divergence Self-Play for Tabular Anomaly Detection via Large Language Models
Vuong Hoang Tran, Van Linh Ngo, Dang Nguyen, Thin Nguyen, Phuoc Nguyen, Mehrtash Harandi, Trung Le
International Conference on Machine Learning (ICML 2026)
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Truong Nguyen, Tien-Phat Nguyen, Linh Ngo Van, Duy Minh Ho Nguyen, Khoa D. Doan, Trung Le
International Conference on Machine Learning (ICML 2026)
CTPD: Cross Tokenizer Preference Distillation
Truong Nguyen, Van-Phi Dat, Ngan Nguyen, Van-Linh Ngo, Trung Le, Hong-Thanh Nguyen
AAAI Conference on Artificial Intelligence (AAAI 2026)
Token-Level Self-Play with Importance-Aware Guidance for Large Language Models
Tue Le, Hoang Tran Vuong, Quyen Tran, Linh Ngo Van, Mehrtash Harandi, Trung Le
Conference on Neural Information Processing Systems (NeurIPS 2025)

Human-AI Alignment

Goals

Overview

Key topics

News

Papers in this direction

BSO: Safety Alignment Is Density Ratio Matching

f-Divergence Self-Play for Tabular Anomaly Detection via Large Language Models

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

CTPD: Cross Tokenizer Preference Distillation

Token-Level Self-Play with Importance-Aware Guidance for Large Language Models