Research | Xudong Wu

May 2026 - Present

Project Lead, Agentic Intelligence Lab · The University of Hong Kong

Lead the development of an end-to-end LLM-agent system for home-grid coordination, combining long-horizon planning, tool-augmented reasoning, persistent memory, preference learning, safety guardrails, and closed-loop execution.
Design the agent loop from home/grid state perception, context retrieval, and memory grounding to multi-agent negotiation, plan generation, tool/API invocation, constraint checking, device actuation, feedback collection, and memory update.
Build preference-aware household agents and grid-aware coordinator agents for personalized demand response, optimizing comfort, electricity cost, appliance schedules, VPP compliance, interruption cost, and user-specific constraints.
Develop agent evaluation and self-improvement infrastructure with persona simulation, human-in-the-loop feedback, trajectory logging, reward/objective scoring, failure diagnosis, scenario replay, and repeated-interaction preference adaptation.
Compare LLM-agent policies against rule-based agents, MPC, EnergyPlus-informed MPC, and RL-based controllers under realistic demand-response events.

LLM Agents Long-Horizon Planning Preference Learning Energy Systems Safety Evaluation

2026 - Present

Researcher, Agentic Intelligence Lab · The University of Hong Kong

Work on self-play post-training for LLM agents, combining Group Relative Policy Optimization, policy rollouts, reward/verifier signals, and Policy-Space Response Oracles.
Formulate agent training as population-based reinforcement learning, where policies are iteratively optimized against evolving opponent, environment, and task distributions rather than static supervised data.
Explore scalable RLHF/RLAIF-style pipelines for agentic reasoning, including trajectory sampling, outcome-based rewards, verifier-guided policy improvement, opponent sampling, and held-out strategic evaluation.
Focus on improving strategic reasoning, robustness, tool-use behavior, and multi-agent adaptation through game-theoretic self-play and reinforcement learning post-training.

GRPO PSRO Self-play Agentic Reasoning RLAIF

2025 - Present

Researcher, Agentic Intelligence Lab · The University of Hong Kong

Study theoretical foundations of online LLM alignment, including self-improving preference optimization, policy improvement dynamics, stability, and convergence under model-generated feedback.
Develop robust RLHF algorithms beyond pairwise preference learning, including distributionally robust listwise preference optimization under preference noise, distribution shift, and ranking uncertainty.
Work on trust-region online RLHF methods with controlled policy updates, connecting practical post-training algorithms with classical reinforcement learning theory and convergence guarantees.
Contributed to On the Convergence of Self-Improving Online LLM Alignment accepted to UAI 2026 and ongoing work on SAIL-TRPO.

RLHF Preference Optimization Online Learning Trust Region Theory

2025 - Present

Collaborating Researcher · HKU / Collaborators

Contribute to projects on offline goal-conditioned RL, occupancy-based reward shaping, hierarchical counterfactual regret minimization, and large-scale fixed-point problem decomposition.
Focus on algorithmic formulation, proof checking, theoretical positioning, and translating RL/game-theoretic ideas into publishable research claims.
Co-authored Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning, accepted to ICLR 2026.
Work across RL and game-theoretic settings where evaluation, decomposition, and robust policy improvement matter for scalable decision-making systems.

Offline RL Reward Shaping Game Theory Fixed-Point Problems

Jun 2024 - Sep 2024

Research Assistant, Donald Bren School of ICS · University of California, Irvine

Worked on Texera, a large-scale workflow engine for data analytics, with emphasis on LLM-assisted workflow interpretation, backend integration, HTML report generation, and real-time visualization.
Designed a modular LLM-agent reporting component to interpret pipeline outputs, generate structured summaries, and support natural-language workflow analysis.
Implemented system-level improvements for interactive report generation and scalable execution in a workflow-oriented data-analysis platform.

ML Systems LLM Agents Data Analytics Report Generation

2025

Project Lead, Machine Learning Practical · University of Edinburgh

Designed a self-rewarding alignment framework for medical LLMs using DPO-style preference optimization and dynamically refined LLM-as-a-judge prompts.
Built an evaluation pipeline with self-question generation, multi-response sampling, self-judgment scoring, and reward refinement on PubMedQA and MedMCQA-style tasks.
Analyzed failure modes caused by reward misspecification, hallucination accumulation, and task difficulty, highlighting limits of closed-loop self-improvement.

LLM Alignment DPO Medical AI Evaluation

2024 - 2025

Honours Dissertation · Advisor: Dr. Amanda Lenzi · University of Edinburgh

Conducted a comparative study of amortized simulation-based inference methods, including BayesFlow, Sequential Neural Likelihood, and Affine Flow Matching.
Designed controlled benchmark experiments for posterior recovery and calibration under structured statistical models, with emphasis on inference quality, robustness, and diagnostic evaluation.
Extended the study to spatial disease-mapping settings with Poisson-CAR models, analyzing posterior calibration, parameter recovery, and robustness under strong spatial dependence.
Implemented a reproducible experimental workflow and released open-source code.

Bayesian Inference Simulation-Based Inference Normalizing Flows Calibration

Research Experience