AOL4FOLTR

This repository contains the source code to generate the dataset and results presented our CIKM'25 publication "AOL4FOLTR: A Large-Scale Web Search Dataset for Federated Online Learning to Rank" [PDF].

AOL4FOLTR is a dataset specifically tailored with its use in Federated Online Learning-to-Rank (short: FOLTR) in mind. It contains raw search queries and document contents, user IDs, and timestamps, based on AOL-IA, and originally, the 2006 AOL query logs. Furthermore, we generated top-20 result lists for each query, and designed 103 features to enable learning-to-rank.

Links

Download Dataset
Generate Dataset
How to Use
Reproduce Experimental Results
Learning-to-Rank Feature List
Acknowledgments
Cite

Generating Dataset

This section provides instructions on how to generat the AOL4FOLTR dataset. Please be aware that due to the size of this dataset, this process requires a lot of CPU compute time. Access to a computing cluster is highly recommended. For this tutorial, we assume access to a SLURM cluster.

Step 1: Download AOL-IA Dataset

This setup assumes the AOL-IOA dataset is completely downloaded and located in ~/.ir_datasets/aol-ia. If this is not the case, please follow the instructions in the aolia-tools repository. (estimated time: 2 days)

Step 2: Initialize Environment

Install project dependencies in an environment with Python 3.10.

conda create -n pyserini python=3.10
conda activate pyserini
make install

Our scripts also require Java (JDK) 21. Make sure to download the right distribution for your OS.

cd ~
wget https://download.oracle.com/java/21/archive/jdk-21.0.6_linux-x64_bin.tar.gz
tar -xvzf jdk-21.0.6_linux-x64_bin.tar.gz
rm jdk-21.0.6_linux-x64_bin.tar.gz

Step 3: Compile Dataset from Sources

The final dataset consists of two files:

metadata.csv (~1 GB)
letor.txt (~55 GB)
indexes/ (~7 GB)
ctrs.lmdb (~28 GB)

Indexing: Creates an index of the entire corpus of AOL-IA. Output goes to a new folder indexes/ (~7 GB). Folder can be deleted after dataset creation. (estimated time: 15 minutes)
Create metadata: Reconstructs top-k results for each query log. (estimated time: 11x12x64 CPU hours)
Merge metadata: Merges created metadata files and creation of qids.
Creating CTRs: Creates Clickthrough Records (CTRs) for every query-document pair. A CTR contain qid, relevance label, and computed features. Results are stored into an LMDB, where each qid maps to a list of k CTRs (for k=20 candidate documents). The CTRs are stored as numpy arrays to save space. (estimated time: 3x12x64 CPU hours)
Merge CTRs: Unifies created LMDBs into one.
Write LETOR: Writes dataset to disk in LETOR format. Records are created based on metadata input; LMDB is used for lookup. (estimated time: 90 minutes)

You can run all steps in sequence by submitting the following job chain:

make

How to Use This Dataset

This section explains how to load and use this dataset for Federated Online Learning-to-Rank (FOLTR).

AOL4FOLTR consists of two files: metadata.csv and letor.txt (i.e., after decompression): both files are linked via the 'qid' attribute. We intentionally used open standard formats to ensure broad accessibility and ease of use with popular libraries such as pandas.

It is important to note that our LETOR dataset contains raw feature values. LTR models tend to learn more effectively when features are normalized, either on the level of the query or globally.

We provide two light-weight abstractions to facilitate FOLTR simulations and take care of the feature-wise normalization.

from aol4foltr.data.metadata import Metadata
from aol4foltr.data.letor import AOL4FOLTRDataset

metadata = Metadata('dataset/metadata.csv')
letor_ds = AOL4FOLTRDataset('dataset/letor.txt')

For a full example of how to use this dataset for FOLTR, please refer to experiment.py.

Reproduce Experimental Results

This repository contains scripts for reproducing the results stated in the paper. This includes the dataset analysis (Section 4) and the FOLTR simulation with 100 clients (Section 5).

The results encompass

Basic statistics
Data quantity (queries per user)
Temporal patterns
Feature distribution divergence
FOLTR simulation

Requirements

Make sure to have the dataset either downloaded or generated from sources in the dataset/ directory. All analyses and experiments can be run on consumer-grade hardware. The most expensive workloads still finished within 1 hour on a MacBook M2 Max. We use R for analytics and plotting.

Generate Results

To generate results for feature distribution divergence, run:

python measure_feat_div.py

To generate results for FOLTR simulation, run:

python experiment.py

The remaining results can be extracted from the dataset itself (i.e., metadata.csv).

Show Results

Analytics and plotting is done with R.

Rscript analysis.R

Basic statistics are printed to console. The rest is exported as both TEX and PDF to results/.

LTR Feature List

Each query-url pair is represented by a 103-dimensional vector, see code.

ID	Description	Stream
1	BM25	title
2		body
3		url
4	Min of term frequency (TF)	title
5		body
6		url
7	Max of term frequency (TF)	title
8		body
9		url
10	Sum of term frequency (TF)	title
11		body
12		url
13	Mean of term frequency (TF)	title
14		body
15		url
16	Variance of term frequency (TF)	title
17		body
18		url
19	Min of inverse document frequency (IDF)	title
20		body
21		url
22	Max of inverse document frequency (IDF)	title
23		body
24		url
25	Sum of inverse document frequency (IDF)	title
26		body
27		url
28	Mean of inverse document frequency (IDF)	title
29		body
30		url
31	Variance of inverse document frequency (IDF)	title
32		body
33		url
34	Min of TF*IDF	title
35		body
36		url
37	Max of TF*IDF	title
38		body
39		url
40	Sum of TF*IDF	title
41		body
42		url
43	Mean of TF*IDF	title
44		body
45		url
46	Variance of TF*IDF	title
47		body
48		url
49	Min of stream length	title
50		body
51		url
52	Max of stream length	title
53		body
54		url
55	Sum of stream length	title
56		body
57		url
58	Mean of stream length	title
59		body
60		url
61	Variance of stream length	title
62		body
63		url
64	Min of stream length normalized TF	title
65		body
66		url
67	Max of stream length normalized TF	title
68		body
69		url
70	Sum of stream length normalized TF	title
71		body
72		url
73	Mean of stream length normalized TF	title
74		body
75		url
76	Variance of stream length normalized TF	title
77		body
78		url
79	Cosine similarity	title
80		body
81		url
82	Covered query term number	title
83		body
84		url
85	Covered query term ratio	title
86		body
87		url
88	Character length	title
89		body
90		url
91	Term length	title
92		body
93		url
94	Total query terms	title
95		body
96		url
97	Exact match (bool)	title
98		body
99		url
100	Match ratio	title
101		body
102		url
103	Number of slahes	url

Acknowledgments

This work was funded by the Dutch National NWO/TKI Science Grant BLOCK.2019.004 and NWO Grant KICH3.LTP.20.006.

Our implementation of FPDGD is based on the code in https://github.com/ielab/fpdgd-ictir2021.

Cite

If you use our dataset, please cite our paper.

ACM Ref

Marcel Gregoriadis, Jingwei Kang, and Johan Pouwelse. 2025. A Large-Scale Web Search Dataset for Federated Online Learning to Rank. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25). Association for Computing Machinery, New York, NY, USA, 6387–6391. https://doi.org/10.1145/3746252.3761651

BibTex

@inproceedings{aol4foltr,
    author = {Gregoriadis, Marcel and Kang, Jingwei and Pouwelse, Johan},
    title = {A Large-Scale Web Search Dataset for Federated Online Learning to Rank},
    year = {2025},
    isbn = {9798400720406},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3746252.3761651},
    doi = {10.1145/3746252.3761651},
    booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
    pages = {6387–6391},
    location = {Seoul, Republic of Korea},
    series = {CIKM '25}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
aol4foltr		aol4foltr
fpdgd		fpdgd
ltr		ltr
results		results
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
analysis.R		analysis.R
create_ctrs.py		create_ctrs.py
create_index.py		create_index.py
create_metadata.py		create_metadata.py
experiment.py		experiment.py
measure_feat_div.py		measure_feat_div.py
merge_ctrs.py		merge_ctrs.py
merge_metadata.py		merge_metadata.py
requirements.txt		requirements.txt
run.sh		run.sh
write_letor.py		write_letor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AOL4FOLTR

Links

Generating Dataset

Step 1: Download AOL-IA Dataset

Step 2: Initialize Environment

Step 3: Compile Dataset from Sources

How to Use This Dataset

Reproduce Experimental Results

Requirements

Generate Results

Show Results

LTR Feature List

Acknowledgments

Cite

ACM Ref

BibTex

About

Uh oh!

Releases

Packages

Languages

License

mg98/aol4foltr

Folders and files

Latest commit

History

Repository files navigation

AOL4FOLTR

Links

Generating Dataset

Step 1: Download AOL-IA Dataset

Step 2: Initialize Environment

Step 3: Compile Dataset from Sources

How to Use This Dataset

Reproduce Experimental Results

Requirements

Generate Results

Show Results

LTR Feature List

Acknowledgments

Cite

ACM Ref

BibTex

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages