InferenceGym
OpenEnv RL Environment for LLM Inference Serving

Problem
LLM serving systems make many interdependent decisions per second - batching, KV-cache budget, speculative decoding depth, quantization tier, prefill/decode disaggregation, priority routing - under non-stationary workloads. Hand-coded policies (Orca, vLLM-style heuristics) plateau because the trade-offs shift with every traffic pattern. The Meta × PyTorch × Hugging Face hackathon asked for a true RL environment that exposes this as a learnable control problem.
Approach
An OpenEnv-compliant environment modelling the serving control loop. Action space is a typed ServeAction (batch_cap, kv_budget_fraction, speculation_depth, quantization_tier, prefill_decode_split, priority_routing). Observations surface queue depth, p50/p99 TTFT, inter-token latency, throughput, KV occupancy, GPU memory, SLO compliance, and cost-per-1k. Per-step shaped reward over throughput + SLO + memory + cost. Three deterministic tasks (static / bursty / adversarial-multitenant). Reference PPO trained on CPU on the simulator; an OpenAI baseline path is wired in for submission compliance.
At a glance
Hackathon
Meta × PyTorch × HF · OpenEnv
Prize pool
$30,000
Tasks
Static / Bursty / Adversarial
PPO vs heuristic (static)
0.55 vs 0.30
PPO vs heuristic (adversarial)
0.38 vs 0.20
Runtime
Docker Space on Hugging Face
Tech decisions
OpenEnv typed Action / Observation contract
Round-1 grading is automated against the OpenEnv spec - typed schemas plus reset / step / grader / baseline endpoints are non-optional for submission.
CPU-only deterministic simulator
Keeps training, grading, and reproducibility cheap and ships the environment as a small Docker image to a Hugging Face Space.
Per-step shaped reward (throughput + SLO + memory + cost)
Pure terminal rewards collapse on bursty/adversarial workloads; per-step shaping gives the policy gradient a stable signal.
PPO as the reference RL agent
Sample-efficient on the small simulator and a known-good baseline - the goal was demonstrating RL beats heuristics, not chasing SOTA.
OpenAI baseline path via in-process runtime
Submission requires the official OpenAI client; in-process avoids the server having to call itself over HTTP and stays robust on Docker Spaces.