All research directions

Human-AI Alignment

Preference optimization, safety alignment, and density-ratio methods for aligning foundation models with human intent.

To build principled alignment methods that steer powerful models toward human preferences and safety constraints without sacrificing capability.

Overview

This direction develops theory and algorithms for aligning large language and multimodal models with human values. Work spans token-level preference optimization, off-policy reference tuning, and formalizing safety alignment as density ratio matching.

Key objectives

  • Develop principled token-level preference optimization
  • Formalize safety alignment as well-posed learning objectives
  • Enable selective off-policy tuning with plan guidance
  • Bridge preference learning with theoretical guarantees

Key topics

  • Preference optimization and RLHF
  • Safety alignment via density ratio matching
  • Off-policy reference tuning
  • Human feedback and reward modeling

Papers in this direction

  • 2026

    BSO: Safety Alignment Is Density Ratio Matching

    Nguyen, TP, Nguyen, T, Nguyen, T, Nguyen, DMH, Dinh, NT, Le, T

    arXiv preprint arXiv:2605.12339

  • 2026

    Selective Off-Policy Reference Tuning with Plan Guidance

    Le, DA, Nguyen, TP, Nguyen, TH, Van, LN, Le, T

    arXiv preprint arXiv:2605.11505

  • 2026

    TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    Nguyen, T, Nguyen, TP, Van, LN, Nguyen, DMH, Doan, KD, Le, T

    International Conference on Machine Learning (ICML)