This repository accompanies the thesis project: Natural Language to SQL (NL2SQL) for the MIMIC-IV Medical Database
The complexity of medical databases like MIMIC-IV often prevents clinicians and researchers from fully utilizing valuable healthcare data. By enabling natural language queries over structured clinical data, this project aims to lower technical barriers and support more inclusive, data-driven decision-making in medicine.
- Conduct a comprehensive analysis of different language models on MIMIC-IV natural language queries to assess their performance across various implementation techniques:
- Evaluate both general-purpose and medical-domain LLMs (including Phi-4, Qwen 1.5/2.5, MedQwen, Meditron, Medalpaca, and specialized SQL models)
- Compare performance under zero-shot and few-shot prompting regimes
- Assess model capabilities across two distinct tasks:
- Query generation: Converting natural language questions to executable SQL
- Query validation: Verifying and correcting generated SQL for schema compliance and logical consistency
- Develop robust evaluation metrics focused on execution success, result accuracy, and clinical relevance
- Building on the model analysis findings, the primary objective is to develop and implement an efficient two-stage natural language to SQL pipeline specifically optimized for the MIMIC-IV database:
- Select the best-performing models for both generation and validation stages based on comparative analysis
- Fine-tune selected models to enhance performance on medical text-to-SQL tasks
- Design an integrated pipeline architecture that balances accuracy with practical efficiency
- Implement schema-aware validation mechanisms to ensure query correctness
- Evaluate the end-to-end pipeline against baseline approaches using clinically relevant metrics
This work adopts a rigorous, multi-phase methodology:
- Dataset Preparation: The MIMIC-IV database and EHRSQL 2024 benchmark are adapted for NL2SQL evaluation, ensuring schema alignment and clinical relevance.
- Model Selection: Both general-purpose and medical-domain LLMs are selected, including Phi-4, Qwen, MedQwen, Meditron, Medalpaca, and SQL-specialized models.
- Prompting Strategies: Each model is evaluated under zero-shot, few-shot, and schema-aware prompting to assess adaptability and performance.
- Two-Stage Pipeline Design:
- Stage 1 (Generation): The best-performing models generate SQL from natural language queries.
- Stage 2 (Validation): A separate model validates and corrects generated SQL, focusing on schema compliance and logical consistency.
- Fine-Tuning: LoRA-based parameter-efficient fine-tuning is applied to selected models to improve domain-specific performance.
- Evaluation: A comprehensive framework is used, including execution accuracy, structural/component metrics, and error categorization. Clinical relevance is prioritized in metric design.
- Analysis & Visualization: Error patterns and model behaviors are analyzed using custom scripts and visualizations to inform future improvements.
- A memory-efficient NL2SQL pipeline for the MIMIC-IV database, with open-source code for reproducibility.
- Systematic benchmarking of multiple LLMs and prompting strategies in the medical SQL generation context.
- A robust evaluation framework and detailed error analysis tailored to the needs of clinical data retrieval.
- Insights and recommendations for deploying NL2SQL systems in real-world healthcare research settings.
src/- Main entry point and orchestration scriptsmodel/- Model implementations for M1 and M2 (few-shot, finetune, schema-aware, zeroshot)analysis/- Data and model analysis scripts, visualizations, and evaluation outputsdata/- Datasets, MIMIC-IV database files, and schema informationhelper_scripts/- Utility scripts for data sampling, cleaning, and scoringutils/- Core utility modules for dataset handling, query analysis, evaluation, and visualizationresults/- Output results from experiments and evaluations
- Base Models: Phi-4, Qwen 1.5/2.5, MedQwen, Meditron 7B, Medalpaca 13B, DuckDB NSQL 7B, SQLCoder 7B
- Fine-Tuning: LoRA with 4-bit quantization
- Database: MIMIC-IV (adapted from EHRSQL 2024 competition schema)
- Evaluation: Custom metrics for table/column access accuracy, component similarity, execution plan analysis
- Visualization: Matplotlib, Seaborn for analysis plots
- Clone the repository:
git clone <repo-url> cd NL2SQL_MIMIC
- Install Python dependencies:
pip install -r requirements.txt
- Ensure you have access to the MIMIC-IV database and EHRSQL 2024 dataset (see
data/folder).
- Main pipeline entry point:
python src/main.py --config config.py
- Helper scripts for data sampling, cleaning, and evaluation are in
helper_scripts/. - Model-specific scripts (few-shot, finetune, schema-aware, zeroshot) are in
model/m1/andmodel/m2/. - Analysis and visualization scripts are in
analysis/.
NL2SQL_MIMIC/
├── src/
├── model/
├── analysis/
├── data/
├── helper_scripts/
├── utils/
├── results/
├── requirements.txt
├── config.py
└── README.md
- Michael Han, Daniel Han, and Unsloth team. Unsloth, 2023. URL http://github.com/unslothai/unsloth.
- Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-Wei H Lehman, Mengling Feng, Marzyeh Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iv: A freely accessible critical care database. Scientific Data, 10(1):1–14, 2023. doi:10.1038/s41597-023-02055-7.
- Ayush Kumar, Parth Nagarkar, Prabhav Nalhe, and Sanjeev Vijayakumar. Deep learning driven natural languages text to sql query conversion: A survey. arXiv preprint arXiv:2208.04415, 2022. URL https://arxiv.org/abs/2208.04415.
- Gyubok Lee, Sunjun Kweon, Seongsu Bae, and Edward Choi. Overview of the EHRSQL 2024 shared task on reliable text-to-SQL modeling on electronic health records. In Proceedings of the 6th Clinical Natural Language Processing Workshop, pages 644–654, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.clinicalnlp-1.62. URL https://aclanthology.org/2024.clinicalnlp-1.62/.
- Ali Mohammadjafari, Anthony S. Maida, and Raju Gottumukkala. From natural language to sql: Review of llm-based text-to-sql systems, 2025. URL https://arxiv.org/abs/2410.01066.
- Zheng Ning, Yuan Tian, Zheng Zhang, Tianyi Zhang, and Toby Jia-Jun Li. Insights into natural language database query errors: From attention misalignment to user handling strategies. arXiv preprint arXiv:2402.07304, 2024. URL https://arxiv.org/abs/2402.07304.
- Richard Tarbell, Kim-Kwang Raymond Choo, Glenn Dietrich, and Anthony Rios. Towards understanding the generalization of medical text-to-sql models and datasets. arXiv preprint arXiv:2303.12898, 2023. URL https://arxiv.org/abs/2303.12898.
- Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task, 2018. URL https://aclanthology.org/D18-1425/.
- Xiaohu Zhu, Qian Li, Lizhen Cui, and Yongkang Liu. Large language model enhanced text-to-sql generation: A survey. arXiv preprint arXiv:2410.06011, 2024. URL https://arxiv.org/abs/2410.06011.
- Angelo Ziletti and Leonardo D’Ambrosi. Retrieval augmented text-to-sql generation for epidemiological question answering using electronic health records. In Proceedings of the 6th Clinical Natural Language Processing Workshop, pages 47–53. Association for Computational Linguistics, 2024. doi:10.18653/v1/2024.clinicalnlp-1.4. URL http://dx.doi.org/10.18653/v1/2024.clinicalnlp-1.4.
This project is licensed under the MIT License. See the LICENSE file for details.