Curriculum Vitae
Research profile for industry internships in LLM post-training, reinforcement learning, agentic AI systems, and applied ML infrastructure.
Ph.D. student working on reinforcement learning, LLM alignment, agentic AI systems, and post-training. I am seeking industry research internships where I can build reliable model-improvement pipelines, evaluate agent behavior, and connect RL theory with scalable AI systems.
Education
The University of Hong Kong
Doctor of Philosophy · Hong Kong SAR
Research: reinforcement learning, LLM alignment, agentic AI systems, and post-training
Supervisor: Prof. Jiayu Chen
University of Edinburgh
BSc (Hons) Mathematics and Statistics · Edinburgh, UK
Average Score: 78/100 · First-Class Honours
Dissertation: Amortized Inference
Dalian University of Technology
BSc Information and Computing Science · Dalian, China
Overall GPA: 3.99/4.00 · Top 5%
Selected Publications and Preprints
On the Convergence of Self-Improving Online LLM Alignment
Xudong Wu, Pangpang Liu, Vaneet Aggarwal, and Jiayu Chen.
Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning
Aravind Venugopal, Jiayu Chen, Xudong Wu, Chongyi Zheng, Benjamin Eysenbach, and Jeff Schneider.
Distributionally Robust Listwise Preference Optimization
Xudong Wu, Jian Qian, Pangpang Liu, Vaneet Aggarwal, and Jiayu Chen.
Core-Halo Decomposition: Decentralizing Large-Scale Fixed-Point Problem
Haixiang Sun, Yang Xu, Jiefu Zhang, Xudong Wu, Zihan Zhou, Jun He, and Jiayu Chen.
SAIL-TRPO: A Trust-Region Online RLHF Algorithm with Guaranteed Fast Convergence
Xudong Wu, Pangpang Liu, Vaneet Aggarwal, and Jiayu Chen.
Hierarchical Deep Counterfactual Regret Minimization
Jiayu Chen, Xudong Wu, Zhekai Wang, and Vaneet Aggarwal.
Research Experience
EnergyBridge: Long-Horizon LLM Agents for Virtual Power Plants
Project Lead, Agentic Intelligence Lab · HKU
- Lead the development of an end-to-end LLM-agent system for home-grid coordination, combining long-horizon planning, tool-augmented reasoning, persistent memory, preference learning, safety guardrails, and closed-loop execution.
- Design the full agent loop from state perception and context retrieval to multi-agent negotiation, plan generation, tool/API invocation, constraint checking, device actuation, feedback collection, and memory update.
- Build preference-aware household agents and grid-aware coordinator agents for personalized demand response, optimizing comfort, electricity cost, appliance schedules, VPP compliance, interruption cost, and user-specific constraints.
- Develop evaluation and self-improvement infrastructure with persona simulation, human-in-the-loop feedback, trajectory logging, reward/objective scoring, failure diagnosis, scenario replay, and repeated-interaction adaptation.
Self-Play LLM Agent Training with GRPO and PSRO
Researcher, Agentic Intelligence Lab · HKU
- Work on self-play post-training for LLM agents, combining Group Relative Policy Optimization, policy rollouts, reward/verifier signals, and Policy-Space Response Oracles.
- Formulate agent training as population-based reinforcement learning where policies are iteratively optimized against evolving opponent, environment, and task distributions.
- Explore scalable RLHF/RLAIF pipelines for agentic reasoning, including trajectory sampling, outcome-based rewards, verifier-guided policy improvement, opponent sampling, and strategic evaluation.
- Focus on strategic reasoning, robustness, tool-use behavior, and multi-agent adaptation through game-theoretic self-play and reinforcement learning post-training.
RLHF, Preference Optimization, and Online Alignment Theory
Researcher, Agentic Intelligence Lab · HKU
- Study theoretical foundations of online LLM alignment, including self-improving preference optimization, policy improvement dynamics, stability, and convergence under model-generated feedback.
- Develop robust RLHF algorithms beyond pairwise preference learning, including distributionally robust listwise preference optimization under preference noise, distribution shift, and ranking uncertainty.
- Work on trust-region online RLHF methods with controlled policy updates, connecting practical post-training algorithms with classical reinforcement learning theory and convergence guarantees.
Reinforcement Learning, Offline RL, and Game-Theoretic Optimization
Collaborating Researcher · HKU / Collaborators
- Contribute to projects on offline goal-conditioned RL, occupancy-based reward shaping, hierarchical counterfactual regret minimization, and large-scale fixed-point problem decomposition.
- Focus on algorithmic formulation, proof checking, theoretical positioning, and translating RL/game-theoretic ideas into publishable research claims.
LLM Agents for Automated Workflow Analysis on the Texera Platform
Research Assistant, Donald Bren School of ICS · UC Irvine
- Worked on Texera, a large-scale workflow engine for data analytics, with emphasis on LLM-assisted workflow interpretation, backend integration, HTML report generation, and real-time visualization.
- Designed a modular LLM-agent reporting component to interpret pipeline outputs, generate structured summaries, and support natural-language workflow analysis.
- Implemented system-level improvements for interactive report generation and scalable execution in a workflow-oriented data-analysis platform.
Selected Project Experience
Self-Rewarding Framework for Medical LLMs
Project Lead, Machine Learning Practical · University of Edinburgh
- Designed a self-rewarding alignment framework for medical LLMs using DPO-style preference optimization and dynamically refined LLM-as-a-judge prompts.
- Built an evaluation pipeline with self-question generation, multi-response sampling, self-judgment scoring, and reward refinement on PubMedQA and MedMCQA-style tasks.
- Analyzed failure modes caused by reward misspecification, hallucination accumulation, and task difficulty, highlighting limits of closed-loop self-improvement.
A Comparative Study of Simulation-Based Inference Algorithms
Honours Dissertation · Advisor: Dr. Amanda Lenzi · University of Edinburgh
- Conducted a comparative study of amortized simulation-based inference methods, including BayesFlow, Sequential Neural Likelihood, and Affine Flow Matching.
- Designed benchmark experiments for posterior recovery and calibration under structured statistical models, with emphasis on inference quality, robustness, and diagnostic evaluation.
- Extended the study to spatial disease-mapping settings with Poisson-CAR models, analyzing posterior calibration, parameter recovery, and robustness under strong spatial dependence.
Skills
Research
- Reinforcement Learning
- LLM Alignment
- Agentic Systems
- Preference Optimization
- MPC
- Simulation-Based Evaluation
Programming
- Python
- PyTorch
- HuggingFace Transformers
- scikit-learn
- R / MATLAB
- C / C++ / SQL / Bash
Systems
- Git / GitHub
- Linux
- Docker
- LaTeX
- OpenAI-compatible APIs
- EnergyPlus
- Experiment Dashboards
Languages
- Mandarin Chinese (native)
- English (fluent)
- Cantonese (beginner)