Multilingual, knowledge-graph grounded benchmark for evaluating factuality and knowledge injection methods for LLMs.
Tested on Python 3.11, MacBook Pro M3 18GB. Optionally you can use a CUDA-compatible GPU for faster inference for translation and sentence-transformers.
- Create a Python 3.11 virtual environment, either with
venvorconda. - Install the requirements:
python -m pip install -r requirements.txt- Create an
.envfile in the root directory with the following content:
HF_TOKEN=<YOU_HF_TOKEN>
WANDB_API_KEY=<YOUR_WANDB_API_KEY>
OPEN_ROUTER_API_KEY=<YOUR_OPEN_ROUTER_API_KEY>- Execute the code with.
python main.py --config config/default.yamlThe default.yaml file contains the default configuration to run the whole pipeline from start to finish. You can also create your own configuration file and pass it to the script to process separate stages or datasets.
The default configuration arguments and their documentation is defined in src.utils.config.GlobalConfig.get_default_args(). The parameters specified in config/default.yaml overrides the defaults, therefore you can create your own configuration files to run different experiments.
-
Output is written to
output/directory with a timestamped folder. -
Alternatively you can also run the code by building a Docker image:
docker build . -t multihal:latest
docker run --rm multihal:latest
# or alternatively to run the code in interactive mode
docker run --rm -it --entrypoint sh multihal:latestWe also supply our raw results in the results/ directory. To generate the tables and figures from the original paper by running the results/generate_results.ipynb notebook. Output will be written to results/output/ directory.
We perform our evaluation based on semantic similarity computed using sentence-transformers. The figure below shows comparisons of semantic similarity scores between ground-truth and predicted answers, for vanilla QA and KG-RAG (KG path labels as part of context). The results show consistent improvements in semantic similarity when using out mined KG paths.
This project is licensed under CC-BY-4.0 license.
