Generalizing Query Performance Prediction under Retriever and Concept Shifts via Data-driven Correction
Contents
We recommend running all components in a Linux environment. To set up the environment on a new server or machine, simply run:
bash setup.shQuery performance prediction needs query, corpus, BM25_index, ANCE_faiss_index, retrieval results files, and actual performance files.
You can download the MS MARCO passage corpus using the command below. The dataset contains approximately 8.8M passages. The compressed file is 1.0 GB, and the extracted size is 2.9 GB.
mkdir -p datasets/collections/msmarco-passage
wget https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz -P datasets/collections/msmarco-passage
tar xvfz datasets/collections/msmarco-passage/collection.tar.gz -C datasets/collections/msmarco-passageWe convert the original corpus from TSV format to JSONL format. The raw query and qrels files are also converted to JSONL.
The corpus in TSV format is split into 9 JSONL files.
This step requires approximately 3.2 GB of disk space.
python convert_collection_to_jsonl.py \
--collection-path datasets/collections/msmarco-passage/collection.tsv \
--output-folder datasets/collections/msmarco-passage/collection_jsonlWe convert the query files from four TREC datasets - msmarcodev, DL2019, DL2020, and DLHard - into JSONL format.
python data_load.py --path_raw ./datasets/TREC --dataset_list msmarcotrain msmarcodev DL2019 DL2020 DLHardWe perform two types of indexing on the corpus before running the QPP models.
First, we build a Lucene index using Pyserini for language model-based QPP models.
Next, for dense retrieval, we generate FAISS indexes of the corpus embeddings using ANCE.
You can adjust the number of threads based on your system environment.
The resulting Lucene index requires approximately 4.2 GB of disk space.
python -m pyserini.index.lucene \
--collection JsonCollection \
--input datasets/collections/msmarco-passage/collection_jsonl \
--index datasets/collections/lucene-index-msmarco-passage \
--generator DefaultLuceneDocumentGenerator \
--threads 9 \
--storePositions --storeDocvectors --storeRawWe embed the corpus using ANCE and store the result as a FAISS index.
This process requires approximately 52 GB of disk space.
python corpus_index.py --base_model anceWe generate retrieval results for all combinations of retrievers (BM25, ANCE) and datasets (msmarcotrain, msmarcodev, DL2019, DL2020, DLHard).
python retrieval.py \
--base_model_list ance \
--dataset_list msmarcotrain msmarcodev DL2019 DL2020 DLHardpython bm25.py --base_model_list bm25 --dataset_list msmarcotrain msmarcodev DL2019 DL2020 DLHardWe compute target metrics such as nDCG and MRR using the retrieval results and the corresponding qrels. These metric values will later serve as ground-truth labels for supervised QPP training.
python evaluation_retrieval.py \
--base_model_list bm25 dpr ance \
--dataset_list msmarcotrain msmarcodev DL2019 DL2020 DLHardRun post-retrieval QPP methods (e.g., Clarity, NQC) on the retrieval results:
python unsupervisedQPP/post_retrieval.py \
--base_model_list bm25 ance \
--dataset_list msmarcotrain msmarcodev DL2019 DL2020 DLHardCompute correlations (e.g., Pearson, Kendall) between predicted and actual performance:
python evaluation_QPP.py \
--base_model_list bm25 ance \
--dataset_list msmarcodev DL2019 DL2020 DLHardTrain the multi-label classification (MLC) model on the msmarcotrain dataset:
python supervisedQPP/QPP_MLC/main.py \
--name QPP_MLC \
--mode normal \
--base_model bm25 \
--dataset msmarcotrain \
--dataset_list msmarcodev DL2019 DL2020 DLHard \
--batch_size 16 \
--lr 2e-5 \
--top_k 10 \
--top_m 10 \
--embed_model bert_cross \
--trans_nhead 8 \
--trans_num_layers 1 \
--class_weight one \
--posi_weight 1.0 \
--err True \
--threshold 0.5 \
--action trainingPredict query performance on new datasets using the trained MLC model:
python supervisedQPP/QPP_MLC/main.py \
--name QPP_MLC \
--mode normal \
--base_model bm25 \
--dataset msmarcotrain \
--base_model_list bm25 ance \
--dataset_list msmarcodev DL2019 DL2020 DLHard \
--batch_size 16 \
--lr 2e-5 \
--top_k 10 \
--top_m 10 \
--embed_model bert_cross \
--trans_nhead 8 \
--trans_num_layers 1 \
--class_weight one \
--posi_weight 1.0 \
--err True \
--threshold 0.5 \
--action inferenceGenerates embedding and metadata files from the trained QPP-MLC model.
The output will be stored in: ./supervisedQPP/QPP_MLC/checkpoint/data_{base_model}_{dataset}QPP_MLC{target_metric}
python supervisedQPP/QPP_MLC/thres.py --action embedComputes optimal thresholds for binarizing predicted relevance scores. The output will be saved in: ./supervisedQPP/QPP_MLC/checkpoint/thres_{base_model}_{dataset}QPP_MLC{target_metric}
python supervisedQPP/QPP_MLC/thres.py --action gener_thresEvaluates QPP-MLC-b predictions using computed thresholds and generates result tables.
The output will be saved in: ./output/
python supervisedQPP/QPP_MLC/thres.py --action gener_result