Xudong Wu
吴煦东
Ph.D. Student in Reinforcement Learning, LLM Alignment & Agentic AI Systems
The University of Hong Kong
I work on post-training, RLHF/RLAIF, preference optimization, self-play agents, and long-horizon LLM-agent systems. I am actively seeking industry research internships where rigorous RL theory, scalable ML systems, and agentic AI products meet.
Research-ready for LLM agents, post-training, and decision-making systems.
I am looking for industry research internship opportunities where I can contribute to practical agent systems and model-improvement pipelines while bringing a strong theoretical background in reinforcement learning, online alignment, and preference optimization.
About Me
I am a Ph.D. student at The University of Hong Kong, advised by Prof. Jiayu Chen. My research centers on reinforcement learning, LLM alignment, preference optimization, self-improving post-training, and agentic AI systems.
My current work connects theory and systems: I study convergence and stability for online LLM alignment, design robust preference optimization algorithms, and build long-horizon LLM agents that reason with tools, memory, safety constraints, and closed-loop feedback. Recent projects include EnergyBridge, self-play agent training with GRPO and PSRO, trust-region online RLHF, and occupancy-based reward shaping for offline RL.
I am especially interested in industry research environments that turn model-improvement ideas into reliable products: scalable training loops, evaluation infrastructure, agent benchmarks, data pipelines, and deployment-aware feedback systems. I enjoy work that requires both mathematical clarity and practical engineering.
Before HKU, I completed a BSc (Hons) in Mathematics and Statistics at the University of Edinburgh with First-Class Honours, and studied Information and Computing Science at Dalian University of Technology. I also worked as a research assistant at UC Irvine on LLM-assisted workflow analysis for the Texera platform.
Internship Fit
Areas where I can contribute quickly in an industry research team.
LLM Post-training & Alignment
DPO, GRPO, RLHF/RLAIF, listwise preference optimization, online alignment theory, reward modeling, and verifier-guided policy improvement.
Agentic AI Systems
Tool use, persistent memory, multi-agent negotiation, safety guardrails, scenario replay, persona simulation, and long-horizon planning loops.
Reinforcement Learning
Offline goal-conditioned RL, reward shaping, trust-region methods, self-play, game-theoretic optimization, MPC comparisons, and policy evaluation.
ML Engineering
Python, PyTorch, HuggingFace Transformers, OpenAI-compatible APIs, Docker, Linux, GitHub workflows, experiment dashboards, and automated reports.
News
Selected Publications & Preprints
Recent work on online alignment, offline RL, and robust preference optimization.
On the Convergence of Self-Improving Online LLM Alignment
Xudong Wu, Pangpang Liu, Vaneet Aggarwal, and Jiayu Chen.
Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning
Aravind Venugopal, Jiayu Chen, Xudong Wu, Chongyi Zheng, Benjamin Eysenbach, and Jeff Schneider.
Distributionally Robust Listwise Preference Optimization
Xudong Wu, Jian Qian, Pangpang Liu, Vaneet Aggarwal, and Jiayu Chen.
Selected Research
Highlights from research and project experience. View all →
EnergyBridge: Long-Horizon LLM Agents for Virtual Power Plants
Leading the development of an end-to-end home-grid coordination system that combines tool-augmented reasoning, persistent memory, preference learning, safety guardrails, multi-agent negotiation, and closed-loop execution.
Self-Play LLM Agent Training with GRPO and PSRO
Studying population-based reinforcement learning for LLM agents, combining GRPO, policy rollouts, verifier rewards, PSRO, opponent sampling, and held-out strategic evaluation to improve tool use, robustness, and multi-agent adaptation.
RLHF, Preference Optimization, and Online Alignment Theory
Developing theoretical and algorithmic foundations for self-improving online LLM alignment, robust listwise preference optimization, distribution-shift-aware RLHF, and trust-region online policy improvement.
LLM Agents for Automated Workflow Analysis on the Texera Platform
Built an LLM-assisted workflow interpretation and report-generation component for Texera, improving structured summaries, backend integration, HTML report generation, and real-time visualization for data-analysis workflows.
Education
The University of Hong Kong
Doctor of Philosophy
Research: reinforcement learning, LLM alignment, agentic AI systems, and post-training
Supervisor: Prof. Jiayu Chen
University of Edinburgh
BSc (Hons) Mathematics and Statistics
Average Score: 78/100 · First-Class Honours
Dissertation: Amortized Inference
Dalian University of Technology
BSc Information and Computing Science
Overall GPA: 3.99/4.00 · Top 5%