Research Experience
My research spans LLM post-training, RLHF/RLAIF, agentic AI systems, offline reinforcement learning, game-theoretic optimization, and applied ML systems.
May 2026 - Present
EnergyBridge: Long-Horizon LLM Agents for Virtual Power Plants
Project Lead, Agentic Intelligence Lab · The University of Hong Kong
- Lead the development of an end-to-end LLM-agent system for home-grid coordination, combining long-horizon planning, tool-augmented reasoning, persistent memory, preference learning, safety guardrails, and closed-loop execution.
- Design the agent loop from home/grid state perception, context retrieval, and memory grounding to multi-agent negotiation, plan generation, tool/API invocation, constraint checking, device actuation, feedback collection, and memory update.
- Build preference-aware household agents and grid-aware coordinator agents for personalized demand response, optimizing comfort, electricity cost, appliance schedules, VPP compliance, interruption cost, and user-specific constraints.
- Develop agent evaluation and self-improvement infrastructure with persona simulation, human-in-the-loop feedback, trajectory logging, reward/objective scoring, failure diagnosis, scenario replay, and repeated-interaction preference adaptation.
- Compare LLM-agent policies against rule-based agents, MPC, EnergyPlus-informed MPC, and RL-based controllers under realistic demand-response events.
2026 - Present
Self-Play LLM Agent Training with GRPO and PSRO
Researcher, Agentic Intelligence Lab · The University of Hong Kong
- Work on self-play post-training for LLM agents, combining Group Relative Policy Optimization, policy rollouts, reward/verifier signals, and Policy-Space Response Oracles.
- Formulate agent training as population-based reinforcement learning, where policies are iteratively optimized against evolving opponent, environment, and task distributions rather than static supervised data.
- Explore scalable RLHF/RLAIF-style pipelines for agentic reasoning, including trajectory sampling, outcome-based rewards, verifier-guided policy improvement, opponent sampling, and held-out strategic evaluation.
- Focus on improving strategic reasoning, robustness, tool-use behavior, and multi-agent adaptation through game-theoretic self-play and reinforcement learning post-training.
2025 - Present
RLHF, Preference Optimization, and Online Alignment Theory
Researcher, Agentic Intelligence Lab · The University of Hong Kong
- Study theoretical foundations of online LLM alignment, including self-improving preference optimization, policy improvement dynamics, stability, and convergence under model-generated feedback.
- Develop robust RLHF algorithms beyond pairwise preference learning, including distributionally robust listwise preference optimization under preference noise, distribution shift, and ranking uncertainty.
- Work on trust-region online RLHF methods with controlled policy updates, connecting practical post-training algorithms with classical reinforcement learning theory and convergence guarantees.
- Contributed to On the Convergence of Self-Improving Online LLM Alignment accepted to UAI 2026 and ongoing work on SAIL-TRPO.
2025 - Present
Reinforcement Learning, Offline RL, and Game-Theoretic Optimization
Collaborating Researcher · HKU / Collaborators
- Contribute to projects on offline goal-conditioned RL, occupancy-based reward shaping, hierarchical counterfactual regret minimization, and large-scale fixed-point problem decomposition.
- Focus on algorithmic formulation, proof checking, theoretical positioning, and translating RL/game-theoretic ideas into publishable research claims.
- Co-authored Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning, accepted to ICLR 2026.
- Work across RL and game-theoretic settings where evaluation, decomposition, and robust policy improvement matter for scalable decision-making systems.
Jun 2024 - Sep 2024
LLM Agents for Automated Workflow Analysis on the Texera Platform
Research Assistant, Donald Bren School of ICS · University of California, Irvine
- Worked on Texera, a large-scale workflow engine for data analytics, with emphasis on LLM-assisted workflow interpretation, backend integration, HTML report generation, and real-time visualization.
- Designed a modular LLM-agent reporting component to interpret pipeline outputs, generate structured summaries, and support natural-language workflow analysis.
- Implemented system-level improvements for interactive report generation and scalable execution in a workflow-oriented data-analysis platform.
2025
Self-Rewarding Framework for Medical LLMs
Project Lead, Machine Learning Practical · University of Edinburgh
- Designed a self-rewarding alignment framework for medical LLMs using DPO-style preference optimization and dynamically refined LLM-as-a-judge prompts.
- Built an evaluation pipeline with self-question generation, multi-response sampling, self-judgment scoring, and reward refinement on PubMedQA and MedMCQA-style tasks.
- Analyzed failure modes caused by reward misspecification, hallucination accumulation, and task difficulty, highlighting limits of closed-loop self-improvement.
2024 - 2025
A Comparative Study of Simulation-Based Inference Algorithms
Honours Dissertation · Advisor: Dr. Amanda Lenzi · University of Edinburgh
- Conducted a comparative study of amortized simulation-based inference methods, including BayesFlow, Sequential Neural Likelihood, and Affine Flow Matching.
- Designed controlled benchmark experiments for posterior recovery and calibration under structured statistical models, with emphasis on inference quality, robustness, and diagnostic evaluation.
- Extended the study to spatial disease-mapping settings with Poisson-CAR models, analyzing posterior calibration, parameter recovery, and robustness under strong spatial dependence.
- Implemented a reproducible experimental workflow and released open-source code.