WARP: An Efficient Engine for Multi-Vector Retrieval

Jan Luca Scheerer - ETH Zurich
Matei Zaharia - UC Berkeley & Databricks
Christopher Potts - Stanford University
Gustavo Alonso - ETH Zurich
Omar Khattab - Databricks

DOI: https://doi.org/10.1145/3726302.3729904

Multi-vector retrieval methods such as ColBERT and its recent variant, the ConteXtualized Token Retriever (XTR), offer high accuracy but face efficiency challenges at scale. To address this, we present WARP, a retrieval engine that substantially improves the efficiency of retrievers trained with the XTR objective through three key innovations: (1) WARPSELECT for dynamic similarity imputation; (2) implicit decompression, avoiding costly vector reconstruction during retrieval; and (3) a two-stage reduction process for efficient score aggregation. Combined with highly-optimized C++ kernels, our system reduces end-to-end latency compared to XTR`s reference implementation by 41x, and achieves a 3x speedup over the ColBERTv2/PLAID engine, while preserving retrieval quality. WARP also reduces index sizes by a factor of 2x--4x compared to XTR, enabling deployment on memory-constrained devices.

Updating Slides...
Presented by:

Jan Luca, Scheerer
ETH Zurich
Slides: Hidden