← All projects
Meta × PyTorch × Hugging Face HackathonMLAI/ML

InferenceGym

OpenEnv RL Environment for LLM Inference Serving

InferenceGym - OpenEnv RL Environment for LLM Inference Serving

Problem

LLM serving systems make many interdependent decisions per second - batching, KV-cache budget, speculative decoding depth, quantization tier, prefill/decode disaggregation, priority routing - under non-stationary workloads. Hand-coded policies (Orca, vLLM-style heuristics) plateau because the trade-offs shift with every traffic pattern. The Meta × PyTorch × Hugging Face hackathon asked for a true RL environment that exposes this as a learnable control problem.

Approach

An OpenEnv-compliant environment modelling the serving control loop. Action space is a typed ServeAction (batch_cap, kv_budget_fraction, speculation_depth, quantization_tier, prefill_decode_split, priority_routing). Observations surface queue depth, p50/p99 TTFT, inter-token latency, throughput, KV occupancy, GPU memory, SLO compliance, and cost-per-1k. Per-step shaped reward over throughput + SLO + memory + cost. Three deterministic tasks (static / bursty / adversarial-multitenant). Reference PPO trained on CPU on the simulator; an OpenAI baseline path is wired in for submission compliance.

At a glance

Hackathon

Meta × PyTorch × HF · OpenEnv

Prize pool

$30,000

Tasks

Static / Bursty / Adversarial

PPO vs heuristic (static)

0.55 vs 0.30

PPO vs heuristic (adversarial)

0.38 vs 0.20

Runtime

Docker Space on Hugging Face

Tech decisions

  • OpenEnv typed Action / Observation contract

    Round-1 grading is automated against the OpenEnv spec - typed schemas plus reset / step / grader / baseline endpoints are non-optional for submission.

  • CPU-only deterministic simulator

    Keeps training, grading, and reproducibility cheap and ships the environment as a small Docker image to a Hugging Face Space.

  • Per-step shaped reward (throughput + SLO + memory + cost)

    Pure terminal rewards collapse on bursty/adversarial workloads; per-step shaping gives the policy gradient a stable signal.

  • PPO as the reference RL agent

    Sample-efficient on the small simulator and a known-good baseline - the goal was demonstrating RL beats heuristics, not chasing SOTA.

  • OpenAI baseline path via in-process runtime

    Submission requires the official OpenAI client; in-process avoids the server having to call itself over HTTP and stays robust on Docker Spaces.

Stack

PythonPyTorchPPOOpenEnvFastAPIDockerHugging Face SpacesOpenAI
Live demoGitHub