Research Experience

My research spans LLM post-training, RLHF/RLAIF, agentic AI systems, offline reinforcement learning, game-theoretic optimization, and applied ML systems.

May 2026 - Present

EnergyBridge: Long-Horizon LLM Agents for Virtual Power Plants

Project Lead, Agentic Intelligence Lab · The University of Hong Kong

  • Lead the development of an end-to-end LLM-agent system for home-grid coordination, combining long-horizon planning, tool-augmented reasoning, persistent memory, preference learning, safety guardrails, and closed-loop execution.
  • Design the agent loop from home/grid state perception, context retrieval, and memory grounding to multi-agent negotiation, plan generation, tool/API invocation, constraint checking, device actuation, feedback collection, and memory update.
  • Build preference-aware household agents and grid-aware coordinator agents for personalized demand response, optimizing comfort, electricity cost, appliance schedules, VPP compliance, interruption cost, and user-specific constraints.
  • Develop agent evaluation and self-improvement infrastructure with persona simulation, human-in-the-loop feedback, trajectory logging, reward/objective scoring, failure diagnosis, scenario replay, and repeated-interaction preference adaptation.
  • Compare LLM-agent policies against rule-based agents, MPC, EnergyPlus-informed MPC, and RL-based controllers under realistic demand-response events.
LLM Agents Long-Horizon Planning Preference Learning Energy Systems Safety Evaluation
2026 - Present

Self-Play LLM Agent Training with GRPO and PSRO

Researcher, Agentic Intelligence Lab · The University of Hong Kong

  • Work on self-play post-training for LLM agents, combining Group Relative Policy Optimization, policy rollouts, reward/verifier signals, and Policy-Space Response Oracles.
  • Formulate agent training as population-based reinforcement learning, where policies are iteratively optimized against evolving opponent, environment, and task distributions rather than static supervised data.
  • Explore scalable RLHF/RLAIF-style pipelines for agentic reasoning, including trajectory sampling, outcome-based rewards, verifier-guided policy improvement, opponent sampling, and held-out strategic evaluation.
  • Focus on improving strategic reasoning, robustness, tool-use behavior, and multi-agent adaptation through game-theoretic self-play and reinforcement learning post-training.
GRPO PSRO Self-play Agentic Reasoning RLAIF
2025 - Present

RLHF, Preference Optimization, and Online Alignment Theory

Researcher, Agentic Intelligence Lab · The University of Hong Kong

  • Study theoretical foundations of online LLM alignment, including self-improving preference optimization, policy improvement dynamics, stability, and convergence under model-generated feedback.
  • Develop robust RLHF algorithms beyond pairwise preference learning, including distributionally robust listwise preference optimization under preference noise, distribution shift, and ranking uncertainty.
  • Work on trust-region online RLHF methods with controlled policy updates, connecting practical post-training algorithms with classical reinforcement learning theory and convergence guarantees.
  • Contributed to On the Convergence of Self-Improving Online LLM Alignment accepted to UAI 2026 and ongoing work on SAIL-TRPO.
RLHF Preference Optimization Online Learning Trust Region Theory
2025 - Present

Reinforcement Learning, Offline RL, and Game-Theoretic Optimization

Collaborating Researcher · HKU / Collaborators

  • Contribute to projects on offline goal-conditioned RL, occupancy-based reward shaping, hierarchical counterfactual regret minimization, and large-scale fixed-point problem decomposition.
  • Focus on algorithmic formulation, proof checking, theoretical positioning, and translating RL/game-theoretic ideas into publishable research claims.
  • Co-authored Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning, accepted to ICLR 2026.
  • Work across RL and game-theoretic settings where evaluation, decomposition, and robust policy improvement matter for scalable decision-making systems.
Offline RL Reward Shaping Game Theory Fixed-Point Problems
Jun 2024 - Sep 2024

LLM Agents for Automated Workflow Analysis on the Texera Platform

Research Assistant, Donald Bren School of ICS · University of California, Irvine

  • Worked on Texera, a large-scale workflow engine for data analytics, with emphasis on LLM-assisted workflow interpretation, backend integration, HTML report generation, and real-time visualization.
  • Designed a modular LLM-agent reporting component to interpret pipeline outputs, generate structured summaries, and support natural-language workflow analysis.
  • Implemented system-level improvements for interactive report generation and scalable execution in a workflow-oriented data-analysis platform.
ML Systems LLM Agents Data Analytics Report Generation
2025

Self-Rewarding Framework for Medical LLMs

Project Lead, Machine Learning Practical · University of Edinburgh

  • Designed a self-rewarding alignment framework for medical LLMs using DPO-style preference optimization and dynamically refined LLM-as-a-judge prompts.
  • Built an evaluation pipeline with self-question generation, multi-response sampling, self-judgment scoring, and reward refinement on PubMedQA and MedMCQA-style tasks.
  • Analyzed failure modes caused by reward misspecification, hallucination accumulation, and task difficulty, highlighting limits of closed-loop self-improvement.
LLM Alignment DPO Medical AI Evaluation
2024 - 2025

A Comparative Study of Simulation-Based Inference Algorithms

Honours Dissertation · Advisor: Dr. Amanda Lenzi · University of Edinburgh

  • Conducted a comparative study of amortized simulation-based inference methods, including BayesFlow, Sequential Neural Likelihood, and Affine Flow Matching.
  • Designed controlled benchmark experiments for posterior recovery and calibration under structured statistical models, with emphasis on inference quality, robustness, and diagnostic evaluation.
  • Extended the study to spatial disease-mapping settings with Poisson-CAR models, analyzing posterior calibration, parameter recovery, and robustness under strong spatial dependence.
  • Implemented a reproducible experimental workflow and released open-source code.
Bayesian Inference Simulation-Based Inference Normalizing Flows Calibration