Curriculum Vitae

Research profile for industry internships in LLM post-training, reinforcement learning, agentic AI systems, and applied ML infrastructure.

Ph.D. student working on reinforcement learning, LLM alignment, agentic AI systems, and post-training. I am seeking industry research internships where I can build reliable model-improvement pipelines, evaluate agent behavior, and connect RL theory with scalable AI systems.

LLM Agents RLHF/RLAIF Preference Optimization GRPO & Self-play ML Systems

Education

Oct 2025 - Sep 2029 (Expected)

The University of Hong Kong

Doctor of Philosophy · Hong Kong SAR

Research: reinforcement learning, LLM alignment, agentic AI systems, and post-training

Supervisor: Prof. Jiayu Chen

Sep 2023 - Jul 2025

University of Edinburgh

BSc (Hons) Mathematics and Statistics · Edinburgh, UK

Average Score: 78/100 · First-Class Honours

Dissertation: Amortized Inference

Sep 2021 - Jun 2023

Dalian University of Technology

BSc Information and Computing Science · Dalian, China

Overall GPA: 3.99/4.00 · Top 5%

Selected Publications and Preprints

UAI 2026 Accepted

On the Convergence of Self-Improving Online LLM Alignment

Xudong Wu, Pangpang Liu, Vaneet Aggarwal, and Jiayu Chen.

ICLR 2026 Accepted

Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning

Aravind Venugopal, Jiayu Chen, Xudong Wu, Chongyi Zheng, Benjamin Eysenbach, and Jeff Schneider.

NeurIPS 2026 Under Review

Distributionally Robust Listwise Preference Optimization

Xudong Wu, Jian Qian, Pangpang Liu, Vaneet Aggarwal, and Jiayu Chen.

NeurIPS 2026 Under Review

Core-Halo Decomposition: Decentralizing Large-Scale Fixed-Point Problem

Haixiang Sun, Yang Xu, Jiefu Zhang, Xudong Wu, Zihan Zhou, Jun He, and Jiayu Chen.

Manuscript

SAIL-TRPO: A Trust-Region Online RLHF Algorithm with Guaranteed Fast Convergence

Xudong Wu, Pangpang Liu, Vaneet Aggarwal, and Jiayu Chen.

Manuscript

Hierarchical Deep Counterfactual Regret Minimization

Jiayu Chen, Xudong Wu, Zhekai Wang, and Vaneet Aggarwal.

Research Experience

May 2026 - Present

EnergyBridge: Long-Horizon LLM Agents for Virtual Power Plants

Project Lead, Agentic Intelligence Lab · HKU

  • Lead the development of an end-to-end LLM-agent system for home-grid coordination, combining long-horizon planning, tool-augmented reasoning, persistent memory, preference learning, safety guardrails, and closed-loop execution.
  • Design the full agent loop from state perception and context retrieval to multi-agent negotiation, plan generation, tool/API invocation, constraint checking, device actuation, feedback collection, and memory update.
  • Build preference-aware household agents and grid-aware coordinator agents for personalized demand response, optimizing comfort, electricity cost, appliance schedules, VPP compliance, interruption cost, and user-specific constraints.
  • Develop evaluation and self-improvement infrastructure with persona simulation, human-in-the-loop feedback, trajectory logging, reward/objective scoring, failure diagnosis, scenario replay, and repeated-interaction adaptation.
LLM Agents Energy Systems Safety Evaluation
2026 - Present

Self-Play LLM Agent Training with GRPO and PSRO

Researcher, Agentic Intelligence Lab · HKU

  • Work on self-play post-training for LLM agents, combining Group Relative Policy Optimization, policy rollouts, reward/verifier signals, and Policy-Space Response Oracles.
  • Formulate agent training as population-based reinforcement learning where policies are iteratively optimized against evolving opponent, environment, and task distributions.
  • Explore scalable RLHF/RLAIF pipelines for agentic reasoning, including trajectory sampling, outcome-based rewards, verifier-guided policy improvement, opponent sampling, and strategic evaluation.
  • Focus on strategic reasoning, robustness, tool-use behavior, and multi-agent adaptation through game-theoretic self-play and reinforcement learning post-training.
GRPO PSRO Self-play RLAIF
2025 - Present

RLHF, Preference Optimization, and Online Alignment Theory

Researcher, Agentic Intelligence Lab · HKU

  • Study theoretical foundations of online LLM alignment, including self-improving preference optimization, policy improvement dynamics, stability, and convergence under model-generated feedback.
  • Develop robust RLHF algorithms beyond pairwise preference learning, including distributionally robust listwise preference optimization under preference noise, distribution shift, and ranking uncertainty.
  • Work on trust-region online RLHF methods with controlled policy updates, connecting practical post-training algorithms with classical reinforcement learning theory and convergence guarantees.
RLHF Preference Optimization Theory Online Learning
2025 - Present

Reinforcement Learning, Offline RL, and Game-Theoretic Optimization

Collaborating Researcher · HKU / Collaborators

  • Contribute to projects on offline goal-conditioned RL, occupancy-based reward shaping, hierarchical counterfactual regret minimization, and large-scale fixed-point problem decomposition.
  • Focus on algorithmic formulation, proof checking, theoretical positioning, and translating RL/game-theoretic ideas into publishable research claims.
Jun 2024 - Sep 2024

LLM Agents for Automated Workflow Analysis on the Texera Platform

Research Assistant, Donald Bren School of ICS · UC Irvine

  • Worked on Texera, a large-scale workflow engine for data analytics, with emphasis on LLM-assisted workflow interpretation, backend integration, HTML report generation, and real-time visualization.
  • Designed a modular LLM-agent reporting component to interpret pipeline outputs, generate structured summaries, and support natural-language workflow analysis.
  • Implemented system-level improvements for interactive report generation and scalable execution in a workflow-oriented data-analysis platform.

Selected Project Experience

2025

Self-Rewarding Framework for Medical LLMs

Project Lead, Machine Learning Practical · University of Edinburgh

  • Designed a self-rewarding alignment framework for medical LLMs using DPO-style preference optimization and dynamically refined LLM-as-a-judge prompts.
  • Built an evaluation pipeline with self-question generation, multi-response sampling, self-judgment scoring, and reward refinement on PubMedQA and MedMCQA-style tasks.
  • Analyzed failure modes caused by reward misspecification, hallucination accumulation, and task difficulty, highlighting limits of closed-loop self-improvement.
2024 - 2025

A Comparative Study of Simulation-Based Inference Algorithms

Honours Dissertation · Advisor: Dr. Amanda Lenzi · University of Edinburgh

  • Conducted a comparative study of amortized simulation-based inference methods, including BayesFlow, Sequential Neural Likelihood, and Affine Flow Matching.
  • Designed benchmark experiments for posterior recovery and calibration under structured statistical models, with emphasis on inference quality, robustness, and diagnostic evaluation.
  • Extended the study to spatial disease-mapping settings with Poisson-CAR models, analyzing posterior calibration, parameter recovery, and robustness under strong spatial dependence.

Skills

Research

  • Reinforcement Learning
  • LLM Alignment
  • Agentic Systems
  • Preference Optimization
  • MPC
  • Simulation-Based Evaluation

Programming

  • Python
  • PyTorch
  • HuggingFace Transformers
  • scikit-learn
  • R / MATLAB
  • C / C++ / SQL / Bash

Systems

  • Git / GitHub
  • Linux
  • Docker
  • LaTeX
  • OpenAI-compatible APIs
  • EnergyPlus
  • Experiment Dashboards

Languages

  • Mandarin Chinese (native)
  • English (fluent)
  • Cantonese (beginner)

Honors & Awards

First-Class Honours - University of Edinburgh
2025
Outstanding Student - Dalian University of Technology
2021-2023
First-Class Scholarship - Dalian University of Technology
2022-2023
International Study Scholarship - Dalian University of Technology
2022-2023
Second-Class Scholarship - Dalian University of Technology
2021-2022