Xudong Wu

吴煦东

Ph.D. Student in Reinforcement Learning, LLM Alignment & Agentic AI Systems
The University of Hong Kong

I work on post-training, RLHF/RLAIF, preference optimization, self-play agents, and long-horizon LLM-agent systems. I am actively seeking industry research internships where rigorous RL theory, scalable ML systems, and agentic AI products meet.

Seeking Industry Internship LLM Post-training RLHF & Preference Optimization Agentic AI Systems

Curriculum Vitae Email GitHub Scholar

wu.xudong [at] connect.hku.hk xudongwu02 [at] gmail.com Open to research internships in LLMs, RL, agents, and applied ML systems

Industry Internship Focus

Research-ready for LLM agents, post-training, and decision-making systems.

I am looking for industry research internship opportunities where I can contribute to practical agent systems and model-improvement pipelines while bringing a strong theoretical background in reinforcement learning, online alignment, and preference optimization.

LLM/agent research intern roles

Post-training and evaluation pipelines

RL, self-play, and preference learning

ML systems, tools, APIs, and experiment dashboards

About Me

I am a Ph.D. student at The University of Hong Kong, advised by Prof. Jiayu Chen. My research centers on reinforcement learning, LLM alignment, preference optimization, self-improving post-training, and agentic AI systems.

My current work connects theory and systems: I study convergence and stability for online LLM alignment, design robust preference optimization algorithms, and build long-horizon LLM agents that reason with tools, memory, safety constraints, and closed-loop feedback. Recent projects include EnergyBridge, self-play agent training with GRPO and PSRO, trust-region online RLHF, and occupancy-based reward shaping for offline RL.

I am especially interested in industry research environments that turn model-improvement ideas into reliable products: scalable training loops, evaluation infrastructure, agent benchmarks, data pipelines, and deployment-aware feedback systems. I enjoy work that requires both mathematical clarity and practical engineering.

Before HKU, I completed a BSc (Hons) in Mathematics and Statistics at the University of Edinburgh with First-Class Honours, and studied Information and Computing Science at Dalian University of Technology. I also worked as a research assistant at UC Irvine on LLM-assisted workflow analysis for the Texera platform.

Internship Fit

Areas where I can contribute quickly in an industry research team.

LLM Post-training & Alignment

DPO, GRPO, RLHF/RLAIF, listwise preference optimization, online alignment theory, reward modeling, and verifier-guided policy improvement.

Agentic AI Systems

Tool use, persistent memory, multi-agent negotiation, safety guardrails, scenario replay, persona simulation, and long-horizon planning loops.

Reinforcement Learning

Offline goal-conditioned RL, reward shaping, trust-region methods, self-play, game-theoretic optimization, MPC comparisons, and policy evaluation.

ML Engineering

Python, PyTorch, HuggingFace Transformers, OpenAI-compatible APIs, Docker, Linux, GitHub workflows, experiment dashboards, and automated reports.

News

Jun 2026 Updated my CV and research profile for industry research internship opportunities in LLM agents, post-training, RL, and applied ML systems.

May 2026 Started leading EnergyBridge, an end-to-end long-horizon LLM-agent system for virtual power plant coordination.

2026 Paper accepted to UAI 2026: On the Convergence of Self-Improving Online LLM Alignment.

2026 Paper accepted to ICLR 2026: Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning.

Oct 2025 Started Ph.D. study at The University of Hong Kong, advised by Prof. Jiayu Chen.

Jul 2025 Graduated from the University of Edinburgh with First-Class Honours in Mathematics and Statistics.

Selected Publications & Preprints

Recent work on online alignment, offline RL, and robust preference optimization.

UAI 2026 Accepted

On the Convergence of Self-Improving Online LLM Alignment

Xudong Wu, Pangpang Liu, Vaneet Aggarwal, and Jiayu Chen.

ICLR 2026 Accepted

Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning

Aravind Venugopal, Jiayu Chen, Xudong Wu, Chongyi Zheng, Benjamin Eysenbach, and Jeff Schneider.

NeurIPS 2026 Under Review

Distributionally Robust Listwise Preference Optimization

Xudong Wu, Jian Qian, Pangpang Liu, Vaneet Aggarwal, and Jiayu Chen.

Selected Research

Highlights from research and project experience. View all →

LLM Agents

EnergyBridge: Long-Horizon LLM Agents for Virtual Power Plants

HKU · Project Lead, Agentic Intelligence Lab · May 2026 - Present

Leading the development of an end-to-end home-grid coordination system that combines tool-augmented reasoning, persistent memory, preference learning, safety guardrails, multi-agent negotiation, and closed-loop execution.

LLM Agents Preference Learning Energy Systems Evaluation

Post-training

Self-Play LLM Agent Training with GRPO and PSRO

HKU · Researcher, Agentic Intelligence Lab · 2026 - Present

Studying population-based reinforcement learning for LLM agents, combining GRPO, policy rollouts, verifier rewards, PSRO, opponent sampling, and held-out strategic evaluation to improve tool use, robustness, and multi-agent adaptation.

GRPO Self-play PSRO RLAIF

LLM Alignment

RLHF, Preference Optimization, and Online Alignment Theory

HKU · Researcher, Agentic Intelligence Lab · 2025 - Present

Developing theoretical and algorithmic foundations for self-improving online LLM alignment, robust listwise preference optimization, distribution-shift-aware RLHF, and trust-region online policy improvement.

RLHF Preference Optimization Theory Online Learning

ML Systems

LLM Agents for Automated Workflow Analysis on the Texera Platform

UC Irvine · Research Assistant · Jun 2024 - Sep 2024

Built an LLM-assisted workflow interpretation and report-generation component for Texera, improving structured summaries, backend integration, HTML report generation, and real-time visualization for data-analysis workflows.

Texera

Education

Oct 2025 - Sep 2029 (Expected)

The University of Hong Kong

Doctor of Philosophy

Research: reinforcement learning, LLM alignment, agentic AI systems, and post-training

Supervisor: Prof. Jiayu Chen

Sep 2023 - Jul 2025

University of Edinburgh

BSc (Hons) Mathematics and Statistics

Average Score: 78/100 · First-Class Honours

Dissertation: Amortized Inference

Sep 2021 - Jun 2023

Dalian University of Technology

BSc Information and Computing Science

Overall GPA: 3.99/4.00 · Top 5%