Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness,demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.
- Comprehensive Multidimensional Evaluation: We establish a structured trustworthiness evaluation framework covering truthfulness, fairness, safety, privacy, and robustness through 18 sub-tasksβ adversarial attacks, consistency verification, disparagement, exaggerated safety, stereotype and preference fairness, hallucination, honesty, jailbreak and OoD robustness, privacy leakage, toxicity and sycophancy.
- Domain-Specific Healthcare Coverage: CLINIC offers 28,800 carefully curated samples from six key healthcare domains, including patient conditions, preventive healthcare, diagnostics and laboratory tests, pharmacology and medication, surgical and procedural treatment, and emergency medicine.
- Global Linguistic Coverage: CLINIC supports 15 languages from diverse regions, including Asia, Africa, Europe, and the America, ensuring broad cultural and linguistic representation.
- Extensive Model Benchmarking: We conduct a comprehensive evaluation of 13 language models, including small and large open-weight, medical, and reasoning models, providing a holistic analysis of language models across varied healthcare scenarios.
- Expert Validation: All evaluation tasks and their respective criteria have been validated and refined in consultation with healthcare domain experts, ensuring clinical accuracy and real-world relevance.
- Trustworthiness-Oriented Vertical Design: CLINIC is the first medical benchmark explicitly organized around 18 trustworthiness tasks for multilingual medical cases. Existing benchmarks primarily focus on task accuracy (like QA or classification) and do not evaluate trustworthiness dimensions. This trustworthiness evaluation enables fine-grained analysis of model reliability, something older datasets were never designed to capture.
- Balanced and Equalized Sampling Across Languages and Tasks: Unlike prior benchmarks with uneven language distributions, CLINIC maintains uniform sample counts (β1,920 per language) across all 15 languages and tasks, removing sampling bias and enabling direct, quantitative comparison of model performance across languages.
- Cross-lingual Validity: Existing benchmarks either focus on English or include a limited number of languages (β4-7), often through automatic translation or partial alignment. In contrast, CLINIC uniquely covers 15 languages across all continents, each containing experttranslated and medically verified samples, ensuring cross-lingual clinical validity, not just linguistic diversity.
We used MedlinePlus (NLM, 2025) as our primary data source because it provides broad coverage of medical subdomains and high-quality English and professionally translated multilingual content. Unlike prior datasets (Wang et al., 2024; Qiu et al., 2024), it includes low-resource and geographically diverse languages with clinically vetted translations. To support out-of-distribution evaluation and ensure current medication information, we additionally incorporated FDA drug documents with available parallel multilingual versions.
Step 1 involves data collection and mapping English samples to their corresponding multilingual versions. Step 2 applied a two-step prompting strategy to generate additional samples. Step 3 focused on sample validation to determine final inclusion in CLINIC.- Distribution of samples across different dimensions of CLINIC
- Distribution of samples across subdomains, where some samples fall under multiple categories.
- Python 3.8 or higher
- CUDA-capable GPU (recommended for model inference)
- Git
- Clone the repository:
git clone https://github.com/AikyamLab/clinic
cd clinic- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtNote: For CUDA support with PyTorch, you may need to install PyTorch separately based on your CUDA version. Visit PyTorch's official website for installation instructions.
This repository contains the model generation and response evaluation scripts used in the CLINIC benchmark. The repository is organized as follows:
generation/: Contains 16 sub-folders, each with a Python script for generating model responses for different tasksevaluation/: Contains 16 sub-folders, each with a Python script for evaluating model responses
Note: The paper contains 18 tasks. During response generation and evaluation, we combined three tasks - False Confidence Test (FCT), False Question Test (FQT), and None of the Above Test (NOTA) - resulting in 16 scripts each for generation and evaluation.
Each generation script in the generation/ folder follows a similar structure:
- Configure the
model_pathvariable with your model path - Set the
model_namevariable - Ensure the corresponding dataset CSV file is in the script's directory
- Run the script:
cd generation/<task-name>/
python <script-name>.pyEach evaluation script in the evaluation/ folder:
- Configure any required API keys (e.g. OpenAI API key, Perspective API key for toxicity evaluation)
- Set the
MODEL_NAMEvariable to match the model you're evaluating - Ensure response files are in the expected directory structure
- Run the script:
cd evaluation/<task-name>/
python <task-name>_eval.pyThe repository includes scripts for the following tasks:
- Truthfulness: Hallucinations, Honesty, Out-of-Domain (OOD)
- Fairness: Fairness-Preference, Fairness-Stereotype
- Safety: Toxicity, Disparagement, Exaggerated Safety
- Robustness: Adversarial Attacks, Consistency, Colloquial, Jailbreak-DAN, Jailbreak-PAIRS
- Privacy: Privacy
- Sycophancy: Sycophancy-Persona, Sycophancy-Preference
See requirements.txt for the complete list of dependencies. Key dependencies include:
transformers: For loading and running language modelstorch: PyTorch for deep learning operationspandas: For data manipulationopenai: For OpenAI API-based evaluationsrequests: For API calls (e.g., Perspective API)scipy: For statistical computationsFlagEmbedding: For embedding-based evaluations
If you use CLINIC benchmark in your research, please cite our repo:
@misc{githubrepo,
author = {Aikyam Lab},
title = {clinic},
howpublished = {\url{https://github.com/AikyamLab/clinic}},
year = {2025},
note = {Version 1.0}
}
This project is distributed under the MIT License.
You are free to use, modify, and distribute this software, as long as you include the original license notice.
See the full text in the LICENSE file.
For questions or issues, please open an issue on GitHub or contact the authors.






