This repository contains the source code to generate the dataset and results presented our CIKM'25 publication "AOL4FOLTR: A Large-Scale Web Search Dataset for Federated Online Learning to Rank" [PDF].
AOL4FOLTR is a dataset specifically tailored with its use in Federated Online Learning-to-Rank (short: FOLTR) in mind. It contains raw search queries and document contents, user IDs, and timestamps, based on AOL-IA, and originally, the 2006 AOL query logs. Furthermore, we generated top-20 result lists for each query, and designed 103 features to enable learning-to-rank.
- Download Dataset
- Generate Dataset
- How to Use
- Reproduce Experimental Results
- Learning-to-Rank Feature List
- Acknowledgments
- Cite
This section provides instructions on how to generat the AOL4FOLTR dataset. Please be aware that due to the size of this dataset, this process requires a lot of CPU compute time. Access to a computing cluster is highly recommended. For this tutorial, we assume access to a SLURM cluster.
This setup assumes the AOL-IOA dataset is completely downloaded and located in ~/.ir_datasets/aol-ia.
If this is not the case, please follow the instructions in the aolia-tools repository. (estimated time: 2 days)
Install project dependencies in an environment with Python 3.10.
conda create -n pyserini python=3.10
conda activate pyserini
make install
Our scripts also require Java (JDK) 21. Make sure to download the right distribution for your OS.
cd ~
wget https://download.oracle.com/java/21/archive/jdk-21.0.6_linux-x64_bin.tar.gz
tar -xvzf jdk-21.0.6_linux-x64_bin.tar.gz
rm jdk-21.0.6_linux-x64_bin.tar.gz
The final dataset consists of two files:
metadata.csv(~1 GB)letor.txt(~55 GB)indexes/(~7 GB)ctrs.lmdb(~28 GB)
- Indexing: Creates an index of the entire corpus of AOL-IA. Output goes to a new folder
indexes/(~7 GB). Folder can be deleted after dataset creation. (estimated time: 15 minutes) - Create metadata: Reconstructs top-k results for each query log. (estimated time: 11x12x64 CPU hours)
- Merge metadata: Merges created metadata files and creation of qids.
- Creating CTRs: Creates Clickthrough Records (CTRs) for every query-document pair. A CTR contain qid, relevance label, and computed features. Results are stored into an LMDB, where each qid maps to a list of
kCTRs (fork=20candidate documents). The CTRs are stored as numpy arrays to save space. (estimated time: 3x12x64 CPU hours) - Merge CTRs: Unifies created LMDBs into one.
- Write LETOR: Writes dataset to disk in LETOR format. Records are created based on metadata input; LMDB is used for lookup. (estimated time: 90 minutes)
You can run all steps in sequence by submitting the following job chain:
make
This section explains how to load and use this dataset for Federated Online Learning-to-Rank (FOLTR).
AOL4FOLTR consists of two files: metadata.csv and letor.txt (i.e., after decompression): both files are linked via the 'qid' attribute.
We intentionally used open standard formats to ensure broad accessibility and ease of use with popular libraries such as pandas.
It is important to note that our LETOR dataset contains raw feature values. LTR models tend to learn more effectively when features are normalized, either on the level of the query or globally.
We provide two light-weight abstractions to facilitate FOLTR simulations and take care of the feature-wise normalization.
from aol4foltr.data.metadata import Metadata
from aol4foltr.data.letor import AOL4FOLTRDataset
metadata = Metadata('dataset/metadata.csv')
letor_ds = AOL4FOLTRDataset('dataset/letor.txt')For a full example of how to use this dataset for FOLTR, please refer to experiment.py.
This repository contains scripts for reproducing the results stated in the paper. This includes the dataset analysis (Section 4) and the FOLTR simulation with 100 clients (Section 5).
The results encompass
- Basic statistics
- Data quantity (queries per user)
- Temporal patterns
- Feature distribution divergence
- FOLTR simulation
Make sure to have the dataset either downloaded or generated from sources in the dataset/ directory.
All analyses and experiments can be run on consumer-grade hardware.
The most expensive workloads still finished within 1 hour on a MacBook M2 Max.
We use R for analytics and plotting.
To generate results for feature distribution divergence, run:
python measure_feat_div.py
To generate results for FOLTR simulation, run:
python experiment.py
The remaining results can be extracted from the dataset itself (i.e., metadata.csv).
Analytics and plotting is done with R.
Rscript analysis.R
Basic statistics are printed to console. The rest is exported as both TEX and PDF to results/.
Each query-url pair is represented by a 103-dimensional vector, see code.
| ID | Description | Stream |
|---|---|---|
| 1 | BM25 | title |
| 2 | body | |
| 3 | url | |
| 4 | Min of term frequency (TF) | title |
| 5 | body | |
| 6 | url | |
| 7 | Max of term frequency (TF) | title |
| 8 | body | |
| 9 | url | |
| 10 | Sum of term frequency (TF) | title |
| 11 | body | |
| 12 | url | |
| 13 | Mean of term frequency (TF) | title |
| 14 | body | |
| 15 | url | |
| 16 | Variance of term frequency (TF) | title |
| 17 | body | |
| 18 | url | |
| 19 | Min of inverse document frequency (IDF) | title |
| 20 | body | |
| 21 | url | |
| 22 | Max of inverse document frequency (IDF) | title |
| 23 | body | |
| 24 | url | |
| 25 | Sum of inverse document frequency (IDF) | title |
| 26 | body | |
| 27 | url | |
| 28 | Mean of inverse document frequency (IDF) | title |
| 29 | body | |
| 30 | url | |
| 31 | Variance of inverse document frequency (IDF) | title |
| 32 | body | |
| 33 | url | |
| 34 | Min of TF*IDF | title |
| 35 | body | |
| 36 | url | |
| 37 | Max of TF*IDF | title |
| 38 | body | |
| 39 | url | |
| 40 | Sum of TF*IDF | title |
| 41 | body | |
| 42 | url | |
| 43 | Mean of TF*IDF | title |
| 44 | body | |
| 45 | url | |
| 46 | Variance of TF*IDF | title |
| 47 | body | |
| 48 | url | |
| 49 | Min of stream length | title |
| 50 | body | |
| 51 | url | |
| 52 | Max of stream length | title |
| 53 | body | |
| 54 | url | |
| 55 | Sum of stream length | title |
| 56 | body | |
| 57 | url | |
| 58 | Mean of stream length | title |
| 59 | body | |
| 60 | url | |
| 61 | Variance of stream length | title |
| 62 | body | |
| 63 | url | |
| 64 | Min of stream length normalized TF | title |
| 65 | body | |
| 66 | url | |
| 67 | Max of stream length normalized TF | title |
| 68 | body | |
| 69 | url | |
| 70 | Sum of stream length normalized TF | title |
| 71 | body | |
| 72 | url | |
| 73 | Mean of stream length normalized TF | title |
| 74 | body | |
| 75 | url | |
| 76 | Variance of stream length normalized TF | title |
| 77 | body | |
| 78 | url | |
| 79 | Cosine similarity | title |
| 80 | body | |
| 81 | url | |
| 82 | Covered query term number | title |
| 83 | body | |
| 84 | url | |
| 85 | Covered query term ratio | title |
| 86 | body | |
| 87 | url | |
| 88 | Character length | title |
| 89 | body | |
| 90 | url | |
| 91 | Term length | title |
| 92 | body | |
| 93 | url | |
| 94 | Total query terms | title |
| 95 | body | |
| 96 | url | |
| 97 | Exact match (bool) | title |
| 98 | body | |
| 99 | url | |
| 100 | Match ratio | title |
| 101 | body | |
| 102 | url | |
| 103 | Number of slahes | url |
This work was funded by the Dutch National NWO/TKI Science Grant BLOCK.2019.004 and NWO Grant KICH3.LTP.20.006.
Our implementation of FPDGD is based on the code in https://github.com/ielab/fpdgd-ictir2021.
If you use our dataset, please cite our paper.
Marcel Gregoriadis, Jingwei Kang, and Johan Pouwelse. 2025. A Large-Scale Web Search Dataset for Federated Online Learning to Rank. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25). Association for Computing Machinery, New York, NY, USA, 6387–6391. https://doi.org/10.1145/3746252.3761651
@inproceedings{aol4foltr,
author = {Gregoriadis, Marcel and Kang, Jingwei and Pouwelse, Johan},
title = {A Large-Scale Web Search Dataset for Federated Online Learning to Rank},
year = {2025},
isbn = {9798400720406},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746252.3761651},
doi = {10.1145/3746252.3761651},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
pages = {6387–6391},
location = {Seoul, Republic of Korea},
series = {CIKM '25}
}