Socratic Tutoring LLM via Multi-Stage Policy Optimization
Working on fine-tuning an open-source model using multi-stage policy optimization (SFT, Offline DPO, Online GRPO) to align open-source LLMs as Socratic tutors that guide students without prematurely leaking answers.
Under the supervision of Prof. Mrinmaya Sachan and Jakup Macina at ETH Zurich.
