AlphaQ
GRPO-Trained RL Agent for Minesweeper

Problem
LLMs struggle with tasks requiring iterative reasoning under partial information. Minesweeper is a clean testbed: probabilistic, partially observable, and demands consistent JSON output across many turns. Most off-the-shelf models either hallucinate structure or fail to maintain constraints across moves.
Approach
GRPO (Group Relative Policy Optimization) fine-tuning on Qwen2.5-14B-Instruct via Unsloth + LoRA, with a custom reward combining JSON-output purity, gameplay correctness, and strategic bonuses. Phase-stratified curriculum across 6 difficulty levels on 4,000 samples. Trained on AMD Instinct MI300X (192GB HBM3) under ROCm in BFloat16. A stack-based iterative engine handles boards up to 50×50 without re-prompting full state.
At a glance
Base model
Qwen2.5-14B-Instruct
Method
GRPO + LoRA
Training samples
4,000
Difficulty levels
6 (curriculum)
Max grid
50×50
Hardware
AMD MI300X · 192GB HBM3
Tech decisions
GRPO over PPO
Sample-efficient on small datasets and no separate value-model overhead.
Phase-stratified curriculum
Starting easy prevents reward collapse and unstable rollouts on hard boards.
Composite reward (JSON + gameplay + strategy)
Single signal kept outputs both syntactically valid and strategically coherent.
Stack-based iterative engine
Avoids re-prompting full state on every move; scales to 50×50 without context blow-up.
AMD MI300X (192GB HBM3)
Fits the 14B model plus GRPO group rollouts in a single device.