← All projects
AMD AI Hackathon · Special MentionMLAI/ML

AlphaQ

GRPO-Trained RL Agent for Minesweeper

AlphaQ - GRPO-Trained RL Agent for Minesweeper

Problem

LLMs struggle with tasks requiring iterative reasoning under partial information. Minesweeper is a clean testbed: probabilistic, partially observable, and demands consistent JSON output across many turns. Most off-the-shelf models either hallucinate structure or fail to maintain constraints across moves.

Approach

GRPO (Group Relative Policy Optimization) fine-tuning on Qwen2.5-14B-Instruct via Unsloth + LoRA, with a custom reward combining JSON-output purity, gameplay correctness, and strategic bonuses. Phase-stratified curriculum across 6 difficulty levels on 4,000 samples. Trained on AMD Instinct MI300X (192GB HBM3) under ROCm in BFloat16. A stack-based iterative engine handles boards up to 50×50 without re-prompting full state.

At a glance

Base model

Qwen2.5-14B-Instruct

Method

GRPO + LoRA

Training samples

4,000

Difficulty levels

6 (curriculum)

Max grid

50×50

Hardware

AMD MI300X · 192GB HBM3

Tech decisions

  • GRPO over PPO

    Sample-efficient on small datasets and no separate value-model overhead.

  • Phase-stratified curriculum

    Starting easy prevents reward collapse and unstable rollouts on hard boards.

  • Composite reward (JSON + gameplay + strategy)

    Single signal kept outputs both syntactically valid and strategically coherent.

  • Stack-based iterative engine

    Avoids re-prompting full state on every move; scales to 50×50 without context blow-up.

  • AMD MI300X (192GB HBM3)

    Fits the 14B model plus GRPO group rollouts in a single device.

Stack

PythonPyTorchUnslothGRPOLoRAQwen2.5ROCmBFloat16
GitHub