AMD AI Hackathon · Special MentionMLAI/ML

AlphaQ

GRPO-Trained RL Agent for Minesweeper

Problem

LLMs struggle with tasks requiring iterative reasoning under partial information. Minesweeper is a clean testbed: probabilistic, partially observable, and demands consistent JSON output across many turns. Most off-the-shelf models either hallucinate structure or fail to maintain constraints across moves.

Approach

GRPO (Group Relative Policy Optimization) fine-tuning on Qwen2.5-14B-Instruct via Unsloth + LoRA, with a custom reward combining JSON-output purity, gameplay correctness, and strategic bonuses. Phase-stratified curriculum across 6 difficulty levels on 4,000 samples. Trained on AMD Instinct MI300X (192GB HBM3) under ROCm in BFloat16. A stack-based iterative engine handles boards up to 50×50 without re-prompting full state.

At a glance

Base model

Qwen2.5-14B-Instruct

Method

GRPO + LoRA

Training samples

4,000

Difficulty levels

6 (curriculum)

Max grid

50×50

Hardware

AMD MI300X · 192GB HBM3

Tech decisions

GRPO over PPO
Sample-efficient on small datasets and no separate value-model overhead.
Phase-stratified curriculum
Starting easy prevents reward collapse and unstable rollouts on hard boards.
Composite reward (JSON + gameplay + strategy)
Single signal kept outputs both syntactically valid and strategically coherent.
Stack-based iterative engine
Avoids re-prompting full state on every move; scales to 50×50 without context blow-up.
AMD MI300X (192GB HBM3)
Fits the 14B model plus GRPO group rollouts in a single device.

Stack

PythonPyTorchUnslothGRPOLoRAQwen2.5ROCmBFloat16

GitHub