Socratic Tutoring LLM via Multi-Stage Policy Optimization

Working on fine-tuning an open-source model using multi-stage policy optimization (SFT, Offline DPO, Online GRPO) to align open-source LLMs as Socratic tutors that guide students without prematurely leaking answers.

Under the supervision of Prof. Mrinmaya Sachan and Jakup Macina at ETH Zurich.