Skip to content
/ aol4foltr Public

A Large-Scale Web Search Dataset for Federated Online Learning to Rank

License

Notifications You must be signed in to change notification settings

mg98/aol4foltr

Repository files navigation

AOL4FOLTR

ACM Paper

This repository contains the source code to generate the dataset and results presented our CIKM'25 publication "AOL4FOLTR: A Large-Scale Web Search Dataset for Federated Online Learning to Rank" [PDF].

AOL4FOLTR is a dataset specifically tailored with its use in Federated Online Learning-to-Rank (short: FOLTR) in mind. It contains raw search queries and document contents, user IDs, and timestamps, based on AOL-IA, and originally, the 2006 AOL query logs. Furthermore, we generated top-20 result lists for each query, and designed 103 features to enable learning-to-rank.

Links

Generating Dataset

This section provides instructions on how to generat the AOL4FOLTR dataset. Please be aware that due to the size of this dataset, this process requires a lot of CPU compute time. Access to a computing cluster is highly recommended. For this tutorial, we assume access to a SLURM cluster.

Step 1: Download AOL-IA Dataset

This setup assumes the AOL-IOA dataset is completely downloaded and located in ~/.ir_datasets/aol-ia. If this is not the case, please follow the instructions in the aolia-tools repository. (estimated time: 2 days)

Step 2: Initialize Environment

Install project dependencies in an environment with Python 3.10.

conda create -n pyserini python=3.10
conda activate pyserini
make install

Our scripts also require Java (JDK) 21. Make sure to download the right distribution for your OS.

cd ~
wget https://download.oracle.com/java/21/archive/jdk-21.0.6_linux-x64_bin.tar.gz
tar -xvzf jdk-21.0.6_linux-x64_bin.tar.gz
rm jdk-21.0.6_linux-x64_bin.tar.gz

Step 3: Compile Dataset from Sources

The final dataset consists of two files:

  • metadata.csv (~1 GB)
  • letor.txt (~55 GB)
  • indexes/ (~7 GB)
  • ctrs.lmdb (~28 GB)
  1. Indexing: Creates an index of the entire corpus of AOL-IA. Output goes to a new folder indexes/ (~7 GB). Folder can be deleted after dataset creation. (estimated time: 15 minutes)
  2. Create metadata: Reconstructs top-k results for each query log. (estimated time: 11x12x64 CPU hours)
  3. Merge metadata: Merges created metadata files and creation of qids.
  4. Creating CTRs: Creates Clickthrough Records (CTRs) for every query-document pair. A CTR contain qid, relevance label, and computed features. Results are stored into an LMDB, where each qid maps to a list of k CTRs (for k=20 candidate documents). The CTRs are stored as numpy arrays to save space. (estimated time: 3x12x64 CPU hours)
  5. Merge CTRs: Unifies created LMDBs into one.
  6. Write LETOR: Writes dataset to disk in LETOR format. Records are created based on metadata input; LMDB is used for lookup. (estimated time: 90 minutes)

You can run all steps in sequence by submitting the following job chain:

make 

How to Use This Dataset

This section explains how to load and use this dataset for Federated Online Learning-to-Rank (FOLTR).

AOL4FOLTR consists of two files: metadata.csv and letor.txt (i.e., after decompression): both files are linked via the 'qid' attribute. We intentionally used open standard formats to ensure broad accessibility and ease of use with popular libraries such as pandas.

It is important to note that our LETOR dataset contains raw feature values. LTR models tend to learn more effectively when features are normalized, either on the level of the query or globally.

We provide two light-weight abstractions to facilitate FOLTR simulations and take care of the feature-wise normalization.

from aol4foltr.data.metadata import Metadata
from aol4foltr.data.letor import AOL4FOLTRDataset

metadata = Metadata('dataset/metadata.csv')
letor_ds = AOL4FOLTRDataset('dataset/letor.txt')

For a full example of how to use this dataset for FOLTR, please refer to experiment.py.

Reproduce Experimental Results

This repository contains scripts for reproducing the results stated in the paper. This includes the dataset analysis (Section 4) and the FOLTR simulation with 100 clients (Section 5).

The results encompass

  • Basic statistics
  • Data quantity (queries per user)
  • Temporal patterns
  • Feature distribution divergence
  • FOLTR simulation

Requirements

Make sure to have the dataset either downloaded or generated from sources in the dataset/ directory. All analyses and experiments can be run on consumer-grade hardware. The most expensive workloads still finished within 1 hour on a MacBook M2 Max. We use R for analytics and plotting.

Generate Results

To generate results for feature distribution divergence, run:

python measure_feat_div.py

To generate results for FOLTR simulation, run:

python experiment.py

The remaining results can be extracted from the dataset itself (i.e., metadata.csv).

Show Results

Analytics and plotting is done with R.

Rscript analysis.R

Basic statistics are printed to console. The rest is exported as both TEX and PDF to results/.

LTR Feature List

Each query-url pair is represented by a 103-dimensional vector, see code.

ID Description Stream
1BM25title
2body
3url
4Min of term frequency (TF)title
5body
6url
7Max of term frequency (TF)title
8body
9url
10Sum of term frequency (TF)title
11body
12url
13Mean of term frequency (TF)title
14body
15url
16Variance of term frequency (TF)title
17body
18url
19Min of inverse document frequency (IDF)title
20body
21url
22Max of inverse document frequency (IDF)title
23body
24url
25Sum of inverse document frequency (IDF)title
26body
27url
28Mean of inverse document frequency (IDF)title
29body
30url
31Variance of inverse document frequency (IDF)title
32body
33url
34Min of TF*IDFtitle
35body
36url
37Max of TF*IDFtitle
38body
39url
40Sum of TF*IDFtitle
41body
42url
43Mean of TF*IDFtitle
44body
45url
46Variance of TF*IDFtitle
47body
48url
49Min of stream lengthtitle
50body
51url
52Max of stream lengthtitle
53body
54url
55Sum of stream lengthtitle
56body
57url
58Mean of stream lengthtitle
59body
60url
61Variance of stream lengthtitle
62body
63url
64Min of stream length normalized TFtitle
65body
66url
67Max of stream length normalized TFtitle
68body
69url
70Sum of stream length normalized TFtitle
71body
72url
73Mean of stream length normalized TFtitle
74body
75url
76Variance of stream length normalized TFtitle
77body
78url
79Cosine similaritytitle
80body
81url
82Covered query term numbertitle
83body
84url
85Covered query term ratiotitle
86body
87url
88Character lengthtitle
89body
90url
91Term lengthtitle
92body
93url
94Total query termstitle
95body
96url
97Exact match (bool)title
98body
99url
100Match ratiotitle
101body
102url
103Number of slahesurl

Acknowledgments

This work was funded by the Dutch National NWO/TKI Science Grant BLOCK.2019.004 and NWO Grant KICH3.LTP.20.006.

Our implementation of FPDGD is based on the code in https://github.com/ielab/fpdgd-ictir2021.

Cite

If you use our dataset, please cite our paper.

ACM Ref

Marcel Gregoriadis, Jingwei Kang, and Johan Pouwelse. 2025. A Large-Scale Web Search Dataset for Federated Online Learning to Rank. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25). Association for Computing Machinery, New York, NY, USA, 6387–6391. https://doi.org/10.1145/3746252.3761651

BibTex

@inproceedings{aol4foltr,
    author = {Gregoriadis, Marcel and Kang, Jingwei and Pouwelse, Johan},
    title = {A Large-Scale Web Search Dataset for Federated Online Learning to Rank},
    year = {2025},
    isbn = {9798400720406},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3746252.3761651},
    doi = {10.1145/3746252.3761651},
    booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
    pages = {6387–6391},
    location = {Seoul, Republic of Korea},
    series = {CIKM '25}
}

About

A Large-Scale Web Search Dataset for Federated Online Learning to Rank

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages