META RL Phase 2 · OpenEnvMLAI/ML

QuantumScribe

OpenEnv RL Environment for LLM Quantum Error Correction

Problem

Quantum processors need decoders that map stabilizer syndromes to Pauli corrections without measuring data qubits directly. PyMatching is a strong classical baseline (sparse blossom on detector graphs). DeepMind's AlphaQubit showed a transformer can beat it on hard cases, but at large-scale TPU training cost. The META RL Phase 2 track asked for an OpenEnv environment where an off-the-shelf LLM learns decoding from verifiable physics rewards.

Approach

Built QuantumScribe (Qubit-Medic): a FastAPI OpenEnv server over Stim's surface_code:rotated_memory_z circuits with SI1000 noise. The LLM emits a terminal Pauli frame (X_ERRORS / Z_ERRORS); five independent rewards score logical correction, final-round syndrome consistency, Jaccard overlap vs PyMatching, format compliance, and a pymatching_beat bonus only when the model is right and PyMatching is wrong. Curriculum runs L1_warmup → L2_target → L3_stretch. Training: LoRA SFT on PyMatching labels, then GRPO with diversity-focused rollouts (temperature 1.2) so reward variance does not collapse.

At a glance

Logical correction (GRPO)

96.4%

Base Qwen (same prompt)

92.0%

Exact-match PyMatching

73.4%

PyMatching beat-rate

0% (disclosed)

Training

Colab T4 · ~3 h

Model

Qwen2.5-3B + LoRA

Tech decisions

Five independent verifiable rewards
GRPO games single scalars; decomposed Stim/PyMatching checks block empty-collapse, mimicry, and format spam by construction.
GRPO over offline labels only
SFT ceiling is PyMatching imitation; RL needs on-policy rollouts against real syndromes to sharpen format and logical correction.
OpenEnv HTTP contract
Same submission pattern as InferenceGym - typed reset/step, deployable Docker Space, trainer swaps local vs remote client.
Stim + PyMatching (not a custom simulator)
Aligns with AlphaQubit/Willow literature and gives unfakeable logical_correction ground truth.
Honest pymatching_beat reporting
Primary eval shows match-not-beat; portfolio claims stay defensible for reviewers.

Stack

PythonPyTorchGRPOLoRAUnslothQwen2.5StimPyMatchingOpenEnvFastAPITRLHugging Face Spaces

Live demo GitHub