AgentDebuggerEnv — GRPO Training Monitor

Training Qwen2.5-Coder-7B-Instruct on structured hypothesis-driven debugging.

  • Algorithm: GRPO (same as DeepSeek-R1)
  • Dataset: 90 hand-validated bugs across 3 difficulty tiers
  • Curriculum: Tier 1 (steps 0–150) → Tier 1+2 (150–350) → All tiers (350–500)