updated look

zzigak · zzigak · commit e931d64caf1b · 2025-10-15T15:22:13.000-04:00
diff --git a/index.html b/index.html
@@ -79,8 +79,8 @@ <h1>Refactoring Codebases through Library Design</h1>
 			</ul>
 		</nav>
 
-		<figure id="teaser">
-			<img src="images/figure1_files-1.png" alt="teaser">
+		<figure id="teaser" style="border: none;">
+			<img src="images/figure1_files-1.png" alt="teaser" style="border: none; box-shadow: none; outline: none;">
 			<figcaption style="text-align: center;">Overview of the refactoring problem. A refactoring task comprises a set of files. We refactor the files by designing a new library. Candidate refactorings are evaluated based on a refactoring metric, and are expected to maintain correctness of the original code sources (pass rate). We explore several refactoring metrics in this paper.</figcaption>
 		</figure>
 	</header>
@@ -114,7 +114,7 @@ <h2>Problem Statement</h2>
 		
 		<div class="math-display">
 			$$\ell(\mathcal{L}, \{\rho_n'\}) = \begin{cases} \textcolor{#3a7bd5}{M}(\mathcal{L}, \{\rho_n'\}) & \forall \rho_n, \tau(\rho_n) \subseteq \tau(\rho_n') \\ \infty & \text{otherwise} \end{cases}$$
-		</div>
+					</div>
 	</section>
 
 	<section>
@@ -132,13 +132,13 @@ <h3>Asymptotic behavior of metrics in large-sample regime</h3>
 			<figcaption style="text-align: center;">Asymptotic behavior of metrics for scoring libraries and refactorings. MDL produces libraries with higher function reuse compared to other metrics.</figcaption>
 		</figure>
 
-		<h3>Human evaluation of refactoring metrics</h3>
+		<h3>What refactoring metric do humans agree with the most?</h3>
 		<div style="display: flex; gap: 2em; align-items: flex-start;">
 			<div style="flex: 1;">
 				<p>We perform a human study to corroborate the findings using the exact same CodeContests clusters. The human study compares tokens, MDL, and Maintainability Index by (1) refactoring clusters into libraries, (2) presenting human participants with the original sources and their refactorings under pairs of metrics, and (3) eliciting pairwise preferences from human participants.</p>
 				
 				<p>Humans prefer MDL-minimizing libraries, and although the preference is only statistically significant for MDL vs. MI, the data suggest a rank-order preference of <strong style="color: #3a7bd5;">MDL > Tokens > MI</strong>. We ran 14 participants (eliciting 129 judgments), and already we see a general preference for compression-based metrics (MDL and Tokens) with only MDL crossing the threshold of statistical significance.</p>
-			</div>
+					</div>
 			<div style="flex: 1;">
 				<!-- Placeholder for human study figure -->
 				<figure class="table-figure">
@@ -236,14 +236,11 @@ <h3>MINICODE: Our Refactoring Benchmark</h3>
 		<h4>CodeContests</h4>
 		<p>Competition problems are crafted with specific variations of algorithmic approaches in mind, resulting in both shared latent concepts and the required test cases. As a result, competition coding is both verifiable, and ready to refactor. We therefore take solutions, prompts, and tests from CodeContests, a competition programming dataset.</p>
 
-		<h4>Huggingface Transformers Library</h4>
+		<h4>Huggingface 🤗 Transformers Library</h4>
 		<p>We test refactoring across implementations of large language and vision–language models from the Huggingface transformers repository (modelling_&lt;name&gt;.py files, e.g., Qwen2, LLaMA, DeepSeek-V3). Unlike competition coding, these sources are production-scale and Huggingface requires that all changes pass an extensive suite of integration tests before merging into the main branch. A refactoring is only deemed correct if it passes the unmodified Transformers test suite, making this a high-stakes setting that requires correctness and compatibility.</p>
 
-		<h4>Huggingface Diffusers Library</h4>
+		<h4>Huggingface 🤗 Diffusers Library</h4>
 		<p>We test refactoring across implementations of diffusion models from the Huggingface diffusers repository (unet_&lt;name&gt;.py and scheduler_&lt;name&gt;.py files, e.g., Stable Diffusion UNet, DDPMScheduler), yielding two distinct tasks. Like Transformers, Diffusers requires that all changes pass a comprehensive suite of integration tests before merging into the main branch.</p>
-
-		<h4>Logo & Date</h4>
-		<p>The library learning literature already has existing benchmarks: Typically they seek to learn a single library from a task comprising many sources, and then test that library on holdout program synthesis tasks. Logo and Date were used in the recent related work REGAL, which we incorporate wholesale to understand how our new method compares to state-of-the-art library learning. The associated programming problems were created by humans, but their solutions were generated by gpt-3.5-turbo.</p>
 	</section>
 
 	<!-- <section>
@@ -428,41 +425,41 @@ <h3>Are these libraries useful for solving new, unseen programming problems?</h3
 					<figcaption style="text-align: center;">Results for LIBRARIAN on 10 Code Contests tasks ($K=8,S=3$)</figcaption>
 				</div>
 				<div style="flex: 1; min-width: 300px;">
-					<table class="table-styled">
-						<thead>
-							<tr>
+			<table class="table-styled">
+				<thead>
+					<tr>
 								<th><strong>Dataset</strong></th>
 								<th><strong>Model</strong></th>
 								<th><strong>Pass Rate</strong></th>
-							</tr>
-						</thead>
-						<tbody>
-							<tr>
+					</tr>
+				</thead>
+				<tbody>
+					<tr>
 								<td rowspan="2">Logo</td>
 								<td>REGAL (gpt-3.5-turbo)</td>
 								<td>49.3% ±1.1</td>
-							</tr>
-							<tr>
+					</tr>
+					<tr>
 								<td>LIBRARIAN (3.5-turbo)</td>
 								<td>69.9% ±0.9</td>
-							</tr>
-							<tr>
+					</tr>
+					<tr>
 								<td rowspan="2">Date</td>
 								<td>REGAL (gpt-3.5-turbo)</td>
 								<td>90.2% ±0.5</td>
-							</tr>
-							<tr>
+					</tr>
+					<tr>
 								<td>LIBRARIAN (3.5-turbo)</td>
 								<td>94.7% ±0.7</td>
-							</tr>
-						</tbody>
-					</table>
+					</tr>
+				</tbody>
+			</table>
 					<figcaption style="text-align: center;">Solving holdout test program synthesis tasks using learned libraries</figcaption>
 				</div>
 			</div>
 		</figure>
 
-		<h3>Real-World Refactoring</h3>
+		<h3>How does Librarian Perform on Real-World Refactoring tasks?</h3>
 		<p>The HuggingFace Transformers library is used by nearly 400k GitHub projects. We deploy LIBRARIAN to 10 source files, using Claude Code to sample $K=15$ refactorings per cluster of size $S=5$, believing an agent such as Claude Code would excel at repo-level edits. LIBRARIAN distilled repeated abstractions such as MLPs, Attention, Decoder classes, RoPE helper functions, etc., lowering MDL to <strong style="color: #3a7bd5;">67.2% of its original value</strong> while still passing all integration tests. The top-3 refactorings based on MDL have an average of <strong style="color: #3a7bd5;">18 abstractions</strong> (functions, classes) in the library, each of which is called on average <strong style="color: #3a7bd5;">4.59 times</strong> in the refactored models.</p>
 
 		<p>For Diffusers, scheduler clusters yielded top-3 MDL refactorings with an average of <strong style="color: #3a7bd5;">12.3 functions</strong> and <strong style="color: #3a7bd5;">3.0 calls per function</strong>, while UNet refactorings produced richer abstractions with an average of <strong style="color: #3a7bd5;">17.0 functions/classes</strong> and <strong style="color: #3a7bd5;">3.43 calls each</strong>.</p>
@@ -485,13 +482,13 @@ <h2 id="citation-header">Citation</h2>
 			<div class="citation-box">
 				<button class="copy" onclick="copyText()"><i class="fa fa-clipboard"></i></button>
 				<pre><code id="citation-content">@misc{kovacic2025refactoringcodebaseslibrarydesign,
-      title={Refactoring Codebases through Library Design},
+      title={Refactoring Codebases through Library Design}, 
       author={Ziga Kovacic and Justin T. Chiu and Celine Lee and Wenting Zhao and Kevin Ellis},
       year={2025},
       eprint={2506.11058},
       archivePrefix={arXiv},
       primaryClass={cs.SE},
-      url={https://arxiv.org/abs/2506.11058},
+      url={https://arxiv.org/abs/2506.11058}, 
 }</code></pre>
 			</div>
 		</div>