CLINIC : Evaluating Multilingual Trustworthiness in Language Models for Healthcare

CLINIC: The First Comprehensive Multilingual Benchmark for Trustworthiness in Healthcare

Abstract

Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness,demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.

📌 Key contributions of our work

Comprehensive Multidimensional Evaluation: We establish a structured trustworthiness evaluation framework covering truthfulness, fairness, safety, privacy, and robustness through 18 sub-tasks– adversarial attacks, consistency verification, disparagement, exaggerated safety, stereotype and preference fairness, hallucination, honesty, jailbreak and OoD robustness, privacy leakage, toxicity and sycophancy.
Domain-Specific Healthcare Coverage: CLINIC offers 28,800 carefully curated samples from six key healthcare domains, including patient conditions, preventive healthcare, diagnostics and laboratory tests, pharmacology and medication, surgical and procedural treatment, and emergency medicine.
Global Linguistic Coverage: CLINIC supports 15 languages from diverse regions, including Asia, Africa, Europe, and the America, ensuring broad cultural and linguistic representation.
Extensive Model Benchmarking: We conduct a comprehensive evaluation of 13 language models, including small and large open-weight, medical, and reasoning models, providing a holistic analysis of language models across varied healthcare scenarios.
Expert Validation: All evaluation tasks and their respective criteria have been validated and refined in consultation with healthcare domain experts, ensuring clinical accuracy and real-world relevance.

💥 CLINIC vs Other Benchmarks

Trustworthiness-Oriented Vertical Design: CLINIC is the first medical benchmark explicitly organized around 18 trustworthiness tasks for multilingual medical cases. Existing benchmarks primarily focus on task accuracy (like QA or classification) and do not evaluate trustworthiness dimensions. This trustworthiness evaluation enables fine-grained analysis of model reliability, something older datasets were never designed to capture.
Balanced and Equalized Sampling Across Languages and Tasks: Unlike prior benchmarks with uneven language distributions, CLINIC maintains uniform sample counts (≈1,920 per language) across all 15 languages and tasks, removing sampling bias and enabling direct, quantitative comparison of model performance across languages.
Cross-lingual Validity: Existing benchmarks either focus on English or include a limited number of languages (≈4-7), often through automatic translation or partial alignment. In contrast, CLINIC uniquely covers 15 languages across all continents, each containing experttranslated and medically verified samples, ensuring cross-lingual clinical validity, not just linguistic diversity.

Dataset

Data Collection

We used MedlinePlus (NLM, 2025) as our primary data source because it provides broad coverage of medical subdomains and high-quality English and professionally translated multilingual content. Unlike prior datasets (Wang et al., 2024; Qiu et al., 2024), it includes low-resource and geographically diverse languages with clinically vetted translations. To support out-of-distribution evaluation and ensure current medication information, we additionally incorporated FDA drug documents with available parallel multilingual versions.

Dataset Dimensions

Construction of CLINIC

Step 1 involves data collection and mapping English samples to their corresponding multilingual versions. Step 2 applied a two-step prompting strategy to generate additional samples. Step 3 focused on sample validation to determine final inclusion in CLINIC.

Data Statistics

Distribution of samples across different dimensions of CLINIC

Distribution of samples across subdomains, where some samples fall under multiple categories.

Installation

Prerequisites

Python 3.8 or higher
CUDA-capable GPU (recommended for model inference)
Git

Setup

Clone the repository:

git clone https://github.com/AikyamLab/clinic
cd clinic

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Note: For CUDA support with PyTorch, you may need to install PyTorch separately based on your CUDA version. Visit PyTorch's official website for installation instructions.

Contents of this repo

This repository contains the model generation and response evaluation scripts used in the CLINIC benchmark. The repository is organized as follows:

generation/: Contains 16 sub-folders, each with a Python script for generating model responses for different tasks
evaluation/: Contains 16 sub-folders, each with a Python script for evaluating model responses

Note: The paper contains 18 tasks. During response generation and evaluation, we combined three tasks - False Confidence Test (FCT), False Question Test (FQT), and None of the Above Test (NOTA) - resulting in 16 scripts each for generation and evaluation.

Usage

Generation Scripts

Each generation script in the generation/ folder follows a similar structure:

Configure the model_path variable with your model path
Set the model_name variable
Ensure the corresponding dataset CSV file is in the script's directory
Run the script:

cd generation/<task-name>/
python <script-name>.py

Evaluation Scripts

Each evaluation script in the evaluation/ folder:

Configure any required API keys (e.g. OpenAI API key, Perspective API key for toxicity evaluation)
Set the MODEL_NAME variable to match the model you're evaluating
Ensure response files are in the expected directory structure
Run the script:

cd evaluation/<task-name>/
python <task-name>_eval.py

Available Tasks

The repository includes scripts for the following tasks:

Truthfulness: Hallucinations, Honesty, Out-of-Domain (OOD)
Fairness: Fairness-Preference, Fairness-Stereotype
Safety: Toxicity, Disparagement, Exaggerated Safety
Robustness: Adversarial Attacks, Consistency, Colloquial, Jailbreak-DAN, Jailbreak-PAIRS
Privacy: Privacy
Sycophancy: Sycophancy-Persona, Sycophancy-Preference

Requirements

See requirements.txt for the complete list of dependencies. Key dependencies include:

transformers: For loading and running language models
torch: PyTorch for deep learning operations
pandas: For data manipulation
openai: For OpenAI API-based evaluations
requests: For API calls (e.g., Perspective API)
scipy: For statistical computations
FlagEmbedding: For embedding-based evaluations

Citation

If you use CLINIC benchmark in your research, please cite our repo:

@misc{githubrepo,
  author       = {Aikyam Lab},
  title        = {clinic},
  howpublished = {\url{https://github.com/AikyamLab/clinic}},
  year         = {2025},
  note         = {Version 1.0}
}

License

This project is distributed under the MIT License.
You are free to use, modify, and distribute this software, as long as you include the original license notice.
See the full text in the LICENSE file.

Contact

For questions or issues, please open an issue on GitHub or contact the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
evaluation		evaluation
generation		generation
LICENSE		LICENSE
README.md		README.md
image.png		image.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CLINIC : Evaluating Multilingual Trustworthiness in Language Models for Healthcare

CLINIC: The First Comprehensive Multilingual Benchmark for Trustworthiness in Healthcare

Abstract

📌 Key contributions of our work

💥 CLINIC vs Other Benchmarks

Dataset

Data Collection

Dataset Dimensions

Construction of CLINIC

Data Statistics

Installation

Prerequisites

Setup

Contents of this repo

Usage

Generation Scripts

Evaluation Scripts

Available Tasks

Requirements

Citation

License

Contact

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

AikyamLab/clinic

Folders and files

Latest commit

History

Repository files navigation

CLINIC : Evaluating Multilingual Trustworthiness in Language Models for Healthcare

CLINIC: The First Comprehensive Multilingual Benchmark for Trustworthiness in Healthcare

Abstract

📌 Key contributions of our work

💥 CLINIC vs Other Benchmarks

Dataset

Data Collection

Dataset Dimensions

Construction of CLINIC

Data Statistics

Installation

Prerequisites

Setup

Contents of this repo

Usage

Generation Scripts

Evaluation Scripts

Available Tasks

Requirements

Citation

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages