-
Notifications
You must be signed in to change notification settings - Fork 48
Description
I’m interested in extending the LIMIT benchmark to explore how multilingual and local-language encoding affects the observed theoretical limitations of single-vector embedding-based retrieval. While LIMIT convincingly demonstrates that these limitations are language-agnostic and dimension-dependent, it would be valuable to empirically analyze whether encoding queries and documents in their native or local languages changes the failure structure, retrieval dynamics, or error distribution across languages.
Specifically, I’d like to investigate
(1) multilingual variants of the LIMIT dataset,
(2) language-conditioned or language-aware embedding pipelines,
(3) cross-lingual versus mono-lingual retrieval settings, while keeping the theoretical guarantees intact.
My goal is not to circumvent the LIMIT result, but to better understand how these theoretical constraints manifest in multilingual conversational and retrieval systems. I’d appreciate feedback on whether such an extension aligns with the project’s direction and any guidance on best practices for contributing this experimentally.