arxiv:2607.00466

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

Published on Jul 1

· Submitted by

Sukmin Cho on Jul 2

Microsoft Research

Upvote

Authors:

Abstract

ELDR is an expert-locality-aware decode router for prefill-decode disaggregated Mixture-of-Experts serving that improves performance by predicting expert activations and routing requests accordingly.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

In prefill-decode (PD) disaggregated LLM serving, each request is assigned to a decode worker after prefill. Existing decode routers balance only load; for mixture-of-experts (MoE) models this is incomplete: equally loaded workers can differ in latency, since each decode step loads the weights of every distinct expert its batch activates. We present ELDR, an expert-locality-aware decode router for PD-disaggregated MoE serving. From a request's prefill expert activations, ELDR builds an expert signature predicting the experts it will activate during generation. Offline, balanced K-means partitions signature space across decode workers; online, locality-band routing sends each request to the least-loaded worker among those best matching its signature. A signature cache, co-indexed with the KV cache at KV-block granularity, keeps signatures exact under prefix caching. Implemented in vLLM and evaluated on deployments of up to 40 GPUs, ELDR reduces median TPOT by 5.9-13.9% over the strongest of four load-balancing baselines across three MoE models and two workloads, with model outputs unchanged.

View arXiv page View PDF Add to collection

Community

zomss

Paper submitter about 13 hours ago

This papers optimizes PD disaggregation for LLM serving by proposing new routing algorithm for reducing the number of activated experts.

O96a

about 8 hours ago

The separation of prefill and decode workers in MoE serving is a critical bottleneck that most routers ignore by focusing solely on request counts. ELDR's approach of using expert signatures from the prefill stage to predict future activations is a pragmatic way to handle the weight-loading latency inherent in MoE architectures. By partitioning requests based on expert locality rather than just load, you're effectively optimizing for the memory bandwidth and cache hits of the decode workers. This is the kind of engineering-grounded optimization that actually moves the needle on tokens-per-second in production. I'm curious to see how the K-means partitioning holds up under highly dynamic request distributions where expert activation patterns shift rapidly.