In recent years, Language Models for Code (LMC) have significantly changed the landscape of software engineering (SE) on downstream tasks, such as code generation, by making software development more efficient. Therefore, a growing interest has emerged in further evaluating these Language Models to homogenize the quality assessment of generated code. As the current evaluation process can significantly overreact on accuracy-based metrics, practitioners often seek methods to intepret LMC outputs beyond canonical benchmarks. While the majority of research reports on code generation effectiveness in terms of expected ground truth, scant attention has been paid to LLMs' explanations. In essence, the decision-making process to generate code is hard to interpret. To bridge this evaluation gap, we introduce code rationales (CodeQ), a technique with rigorous mathematical underpinning, to identify subsets of tokens that can explain individual code predictions. We conducted a thorough Exploratory Analysis to demonstrate the method's applicability and a User Study to understand the usability of code-based explanations. Our evaluation demonstrates that CodeQ is a powerful interpretability method to explain how (less) meaningful input concepts (i.e. natural language particle `at') highly impact output generation (i.e. code conditionals). Moreover, participants of this study highlighted CodeQ's ability to show a causal relationship between the input and output of the model with readable and informative explanations on code completion and test generation tasks. Additionally, CodeQ also helps to uncover model rationale, facilitating comparison with a human rationale to promote a fair level of trust and distrust in the model.
This repository is an online appendix to the ICSE '25 paper titled "Why is Accuracy Not Enough for Interpretability? On Rationalizing Language Models For Code." It includes expanded material from the evaluation and links to the data and code.
Below CodeQ artifacts links are published such as experimental notebooks, scripts, survey raw data, and analysis. The code repository for implementing the interpretability analysis is attached. We also explain the code rationales approach design and showcase an example of our survey, which proves the approach's usability.
| Artifact | Repository Folder | Description |
|---|---|---|
| Documented Notebooks | results_analysis/rq1_exploratory_analysis | Statistical analysis for global explanation it include extended figures with rationales of different datasets |
| User study analysis | results_analysis/rq2_user_study | Spreadsheets with participant answers and statistical summarization |
| Source Code | nbs | nbdev format notebooks with the code rationales experimentation |
| Source Code | code_rationales | Generated code by nbdev as a python library |
| Source Code | scripts | External libraries and utilities for running rationales experiments |
| Models | Upon paper acceptance | Compatibilized Pre-Trained Models (e.g., BART and CodeParrot) |
| Testbeds | Upon paper acceptance | Test sets and generated code |
The folder results_analysis/rq1_exploratory_analysis contains several books for exploratory analysis on code and test generation.
- Exploratory Data Analysis - Frequency of NL and SC
- Exploratory Data Analysis - Distribution of Meaningful and Meaningless Concepts
- Exploratory Data Analysis - Distribution of Rationales Probabilities Across Different Datasets
- Exploratory Data Analysis - Proportionality of NL and SC
- Exploratory Data Analysis - Dependencies between rationales and targets
- Test generation - Local Analysis
- Test generation - Global Analyisis
After running the experiments across suggested datasets or testbeds, we registered and stored different analyses for each set. The captures folders under rq1_exploratory_analysis/code_completion and rq1_exploratory_analysis/test_generation contain snapshots of statistical analysis (e.g., distributions, heatmaps, histograms). For example, the image below depicts a global analysis of rationales using a heatmap, which connects the input rationales and generated code. Code tokens were clustered by concepts at the AST levels 1 and 2.
This heatmap shows the impact of the input rationale (
The following heatmap demonstrates the impact of an input-output of the decoder transformer (of BART) for test case generation. We applied Code-Q on the decoder part of BART to interpret a generated test. Note the relevance of a focal method as a rationale concept against other contexts such as class, fields, or constructors. For instance, we observe that `fields' only impact declarations, test declarations, and blocks. More examples can be found at Bart decoder notebook analysis
This folder contains raw data from our user study and CSVs where we aggregated the results and performed statistical analysis based on our research questions.
- Raw data of user responses in CSV format This file contains raw data of every user responses in CSV format. Each column in the CSV records raw answers to every question we asked in the survey. The personal information of the user is omitted for privacy reasons.
- Collection of all the user responses and statistical analysis from Qualtrics This file contains all the information of the users grouped by standard statistical analysis performed using Qualtrics features.
- Taxonomy of error cases analysis In this file, we present our error case analysis of model prediction. We categorize different types of errors in the model prediction using existing literature. We show some samples for each error type.
- Survey Evaluation based on our metrics including demographic information This file contains various sheets with the evaluation of the survey based on our metric. The sheets correspond to demographics, usefulness, reliability, readability, and alignment. We also show an aggregation of all the data in a table in the final sheet in this file.
CodeQ comprises four steps to transform an interpretability tensor from the matrix representation of the input and input set of tokens and their relationships.
-
Stating Preconditions: The first step involves preparing the necessary conditions for using the method to interpret an LMC. This includes making the model compatible using a specific algorithm and structuring "interpretable concepts," which are meant to introduce understandable concepts related to the model's input-output. These concepts are tailored to software engineering (SE) tasks. Two types of interpretability concepts are proposed: one based on Abstract Syntax Tree (AST) for code generation, and the other on focal methods for test case generation.
-
Constructing Testbeds: The second step builds testbeds by selecting and configuring the model's input, depending on the SE task and interpretability concepts. For example, prompts are used to build a testbed for code generation, and the generated code is concatenated with the prompt to form a new testbed for applying CodeQ, which is referred to as a set of generated snippets.
-
Building Interpretability Tensors: The third step involves applying the CodeQ method, which is designed to interpret predictions made by language models. CodeQ is compatible with both encoder-decoder and decoder-only models and introduces three mathematical components to transform tokens from a snippet into an "interpretability tensor".
- The interpretability approach uses the tensor
$\Phi$ to generate local post hoc explanations, such as dependency maps. These maps reveal three levels of human-interpretable concepts:
-
$L_1$ : fine-grain level rationales (i.e., tokens), -
$L_2$ : concept rationales (e.g., noun, preposition, identifier), -
$L_3$ : modality (e.g., natural language or programming language).
Additionally, the interpretability tensor can be explored further to generate post hoc global explanations.
-
RQ1 [Applicability]: How applicable is CodeQ to globally interpret code generation? This RQ explores the use of CodeQ in creating understandable explanations for the behavior of language models in code generation tasks. The hypothesis is that greedy rationalization can identify key rationales leading to predictions, providing insights into the model's prediction dependencies.
-
RQ2 [Usability]: How useful is CodeQ in practical settings? This RQ assesses the practical usefulness of CodeQ through a user study, evaluating qualitative metrics such as usefulness, reliability, readability, and how well CodeQ helps in aligning language models.
We propose two concept aggregation sets
The cumulative probability of concepts per dataset. We observe that datasets with docstring (DC) grow faster than the ones with only source code (signature (SG) and body (BD)).

The following figure shows the size of focal methods (in blue) and focal test methods (in orange) used in our analysis. The average size of a focal method in terms of tokens is between
In this image, we show one of the samples presented in the survey. The sample was selected based on our analysis of error cases. The image depicts our technique of presenting rationales behind predictions and also captures whether the users agree with the generated rationales. By exposing users to CodeQ, we assessed the informativeness and readability of our diagrams. The remaining samples in the survey follow the same structure.
@misc{CodeQ2025,
title={CodeQ: On Explaining (Large) Language Models For Code Using Global Code-Based Explanations},
author={David N. Palacio and Dipin Khati and Daniel Rodriguez-Cardenas and Alejandro Velasco and Denys Poshyvanyk},
year={2025},
eprint={2503.16771},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2503.16771},
}





