Projects

Research and engineering projects in LLM agents, post-training, RL, evaluation infrastructure, and applied ML systems.

End-to-end LLM-agent system for virtual power plant coordination. Builds preference-aware household agents and grid-aware coordinator agents with tool use, memory, safety constraints, closed-loop execution, persona simulation, trajectory logging, scenario replay, and policy comparison against MPC, rule-based agents, and RL controllers.

Self-Play LLM Agent Training

Population-based post-training project for LLM agents using GRPO, policy rollouts, verifier rewards, opponent sampling, and PSRO-style policy improvement. Targets stronger strategic reasoning, robust tool use, multi-agent adaptation, and evaluation against evolving task and environment distributions.

Online LLM Alignment & Preference Optimization

Research program on self-improving online LLM alignment, robust listwise preference optimization, and trust-region RLHF. Connects practical post-training algorithms with convergence guarantees, stability analysis, preference-noise robustness, distribution shift, and controlled policy updates.

Contributed to Texera, a collaborative data analytics workflow platform at UC Irvine. Designed an LLM-agent reporting component for pipeline interpretation, structured summaries, backend integration, HTML report generation, real-time visualization, and scalable workflow-oriented execution.

Self-Rewarding Framework for Medical LLMs

DPO-style alignment framework for medical LLMs using dynamically refined LLM-as-a-judge prompts. Built self-question generation, multi-response sampling, self-judgment scoring, reward refinement, and failure-mode analysis for hallucination accumulation, reward misspecification, and task difficulty.

Reproducible benchmarking framework for amortized simulation-based inference algorithms. Compares BayesFlow, Sequential Neural Likelihood, and Affine Flow Matching on controlled statistical tasks and Poisson-CAR spatial disease-mapping models using posterior recovery, calibration, robustness, and diagnostic evaluation.

Offline Goal-Conditioned RL Reward Shaping

Research contribution to occupancy-based reward shaping for offline goal-conditioned reinforcement learning, improving credit assignment and policy learning in settings where direct online environment interaction is limited. Work accepted to ICLR 2026.

Game-Theoretic RL & Fixed-Point Optimization

Collaboration on hierarchical counterfactual regret minimization and large-scale fixed-point problem decomposition. Focus areas include algorithmic formulation, proof checking, theoretical positioning, and scalable decomposition for decision-making systems.