This repository contains the artifact for VectorLiteRAG, including the full implementation, experiment scripts, preprocessing pipelines, and plotting utilities required to reproduce the evaluation results reported in the paper. The artifact supports both end-to-end evaluation runs and fine-grained, individual executions for profiling and ablation studies.
The artifact is tested using Anaconda / Miniconda. A complete environment specification is provided.
conda env create -n vlite -f scripts/env.yml
conda activate vliteAfter activating the environment build the required native component.
git submodule update --init
./scripts/build.shThis step compiles modified FAISS (from version 1.9.0) with GPU support and links required libraries including Intel MKL.
HuggingFace models are downloaded automatically at runtime. You must manually specify a local cache directory to avoid repeated downloads.
Edit the following file vliterag/engine.py
at the top of the file, set
model_cache = "/path/to/your/model/cache"
Ensure the directory has sufficient disk space (models up to 70B parameters)
Download a dataset using:
./database/download.sh <dataset>
Supported datasets are:
- wikiall
- orcas1k
- orcas2k
To save time, we provide preprocessed artifacts (encoded/compressed query vectors and a pre-built IVFPQ index) for each dataset. These are available in the “_pp” variants of the dataset names (e.g., wikiall_pp).
Download the preprocessed package with:
./database/download.sh wikiall_pp
This will fetch:
- Pre-encoded query vectors, and
- A pre-indexed ivfpq.index (IVFPQ).
Important: For a full run, you must still execute train.sh for each dataset to build the FastScan-specific indexes used by our pipeline:
./scripts/train.sh wikiall
(Repeat for orcas1k, orcas2k, etc.; use the matching dataset name.)
For datasets requiring preprocessing (e.g., ORCAS benchmarks):
./database/encode.sh <dataset>
./scripts/train.sh <dataset>This step performs:
- Document chunking
- Vector encoding
- IVF index training
Note, full preprocessing can take 40 ~ 60 hrs and requires 1.5TB ~ 2TB storage & system memory
A lightweight test run can be executed before full evaluation (discarded immediately after execution):
./scripts/runall_l40s.sh testThis performs a small CPU-based run on Wiki-All to verify correctness.
./scripts/runall_l40s.sh <main|inout|dispatcher|slo|ngpu>./scripts/runall_h100.sh <main|inout|slo|ngpu>Available options:
- main - evaluation sweep for figure 10 and 11
- inout - input / output length ablation
- dispatcher - dispatcher on/off ablation
- slo - slo level ablation
- ngpu - gpu number ablation
Evaluation outputs are soted under VectorLiteRAG/results/
Structure:
results/<index>/<model>/<ngpus>gpus/<mode>/
├── raw/ # request-level parquet files
├── summary/ # aggregated CSV summaries`All plotting scripts are centralized under
analysis//plot.py
Use the provided wrapper sript after running evaluations:
./scripts/plotall.sh <args>Examples:
./scripts/plotall.sh all # generate all figures
./scripts/plotall.sh main # generate main figures only
./scripts/plotall.sh 14 # generate Figure 14 onlyGenerated figures are saved to VectorLiteRAG/figures/
Advanced users can invoke runs directly
Used to generate latency models and partitioned indexes (is_profiling = True)
python main.py \
--model llama8b \
--index wikiall \
--is_profilingExample single run:
python main.py \
--model llama8b \
--index orcas2k \
--search_mode vlite \
--arrival_rate 32 \
--input_len 1024 \
--output_len 256Sweep mode:
python main.py \
--model llama8b \
--index orcas2k \
--search_mode all \
--sweepNotes
- Results may vary slightly depending on hardware conditions.
- Large-scale preprocessing is expected to be time- and storage-intensive.
- All scripts are designed to run from the project root directory.
- configuration flexibility are currently limited; some parameters must be modified directly through the JSON files in
VectorLiteRAG/configs.