ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving
Abstract
ELDR is an expert-locality-aware decode router for prefill-decode disaggregated Mixture-of-Experts serving that improves performance by predicting expert activations and routing requests accordingly.
In prefill-decode (PD) disaggregated LLM serving, each request is assigned to a decode worker after prefill. Existing decode routers balance only load; for mixture-of-experts (MoE) models this is incomplete: equally loaded workers can differ in latency, since each decode step loads the weights of every distinct expert its batch activates. We present ELDR, an expert-locality-aware decode router for PD-disaggregated MoE serving. From a request's prefill expert activations, ELDR builds an expert signature predicting the experts it will activate during generation. Offline, balanced K-means partitions signature space across decode workers; online, locality-band routing sends each request to the least-loaded worker among those best matching its signature. A signature cache, co-indexed with the KV cache at KV-block granularity, keeps signatures exact under prefix caching. Implemented in vLLM and evaluated on deployments of up to 40 GPUs, ELDR reduces median TPOT by 5.9-13.9% over the strongest of four load-balancing baselines across three MoE models and two workloads, with model outputs unchanged.
Community
This papers optimizes PD disaggregation for LLM serving by proposing new routing algorithm for reducing the number of activated experts.
The separation of prefill and decode workers in MoE serving is a critical bottleneck that most routers ignore by focusing solely on request counts. ELDR's approach of using expert signatures from the prefill stage to predict future activations is a pragmatic way to handle the weight-loading latency inherent in MoE architectures. By partitioning requests based on expert locality rather than just load, you're effectively optimizing for the memory bandwidth and cache hits of the decode workers. This is the kind of engineering-grounded optimization that actually moves the needle on tokens-per-second in production. I'm curious to see how the K-means partitioning holds up under highly dynamic request distributions where expert activation patterns shift rapidly.
Get this paper in your agent:
hf papers read 2607.00466 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper