You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<figcaptionstyle="text-align: center;">Overview of the refactoring problem. A refactoring task comprises a set of files. We refactor the files by designing a new library. Candidate refactorings are evaluated based on a refactoring metric, and are expected to maintain correctness of the original code sources (pass rate). We explore several refactoring metrics in this paper.</figcaption>
@@ -132,13 +132,13 @@ <h3>Asymptotic behavior of metrics in large-sample regime</h3>
132
132
<figcaptionstyle="text-align: center;">Asymptotic behavior of metrics for scoring libraries and refactorings. MDL produces libraries with higher function reuse compared to other metrics.</figcaption>
133
133
</figure>
134
134
135
-
<h3>Human evaluation of refactoring metrics</h3>
135
+
<h3>What refactoring metric do humans agree with the most?</h3>
<p>We perform a human study to corroborate the findings using the exact same CodeContests clusters. The human study compares tokens, MDL, and Maintainability Index by (1) refactoring clusters into libraries, (2) presenting human participants with the original sources and their refactorings under pairs of metrics, and (3) eliciting pairwise preferences from human participants.</p>
139
139
140
140
<p>Humans prefer MDL-minimizing libraries, and although the preference is only statistically significant for MDL vs. MI, the data suggest a rank-order preference of <strongstyle="color: #3a7bd5;">MDL > Tokens > MI</strong>. We ran 14 participants (eliciting 129 judgments), and already we see a general preference for compression-based metrics (MDL and Tokens) with only MDL crossing the threshold of statistical significance.</p>
<p>Competition problems are crafted with specific variations of algorithmic approaches in mind, resulting in both shared latent concepts and the required test cases. As a result, competition coding is both verifiable, and ready to refactor. We therefore take solutions, prompts, and tests from CodeContests, a competition programming dataset.</p>
238
238
239
-
<h4>Huggingface Transformers Library</h4>
239
+
<h4>Huggingface 🤗 Transformers Library</h4>
240
240
<p>We test refactoring across implementations of large language and vision–language models from the Huggingface transformers repository (modelling_<name>.py files, e.g., Qwen2, LLaMA, DeepSeek-V3). Unlike competition coding, these sources are production-scale and Huggingface requires that all changes pass an extensive suite of integration tests before merging into the main branch. A refactoring is only deemed correct if it passes the unmodified Transformers test suite, making this a high-stakes setting that requires correctness and compatibility.</p>
241
241
242
-
<h4>Huggingface Diffusers Library</h4>
242
+
<h4>Huggingface 🤗 Diffusers Library</h4>
243
243
<p>We test refactoring across implementations of diffusion models from the Huggingface diffusers repository (unet_<name>.py and scheduler_<name>.py files, e.g., Stable Diffusion UNet, DDPMScheduler), yielding two distinct tasks. Like Transformers, Diffusers requires that all changes pass a comprehensive suite of integration tests before merging into the main branch.</p>
244
-
245
-
<h4>Logo & Date</h4>
246
-
<p>The library learning literature already has existing benchmarks: Typically they seek to learn a single library from a task comprising many sources, and then test that library on holdout program synthesis tasks. Logo and Date were used in the recent related work REGAL, which we incorporate wholesale to understand how our new method compares to state-of-the-art library learning. The associated programming problems were created by humans, but their solutions were generated by gpt-3.5-turbo.</p>
247
244
</section>
248
245
249
246
<!-- <section>
@@ -428,41 +425,41 @@ <h3>Are these libraries useful for solving new, unseen programming problems?</h3
428
425
<figcaptionstyle="text-align: center;">Results for LIBRARIAN on 10 Code Contests tasks ($K=8,S=3$)</figcaption>
429
426
</div>
430
427
<divstyle="flex: 1; min-width: 300px;">
431
-
<tableclass="table-styled">
432
-
<thead>
433
-
<tr>
428
+
<tableclass="table-styled">
429
+
<thead>
430
+
<tr>
434
431
<th><strong>Dataset</strong></th>
435
432
<th><strong>Model</strong></th>
436
433
<th><strong>Pass Rate</strong></th>
437
-
</tr>
438
-
</thead>
439
-
<tbody>
440
-
<tr>
434
+
</tr>
435
+
</thead>
436
+
<tbody>
437
+
<tr>
441
438
<tdrowspan="2">Logo</td>
442
439
<td>REGAL (gpt-3.5-turbo)</td>
443
440
<td>49.3% ±1.1</td>
444
-
</tr>
445
-
<tr>
441
+
</tr>
442
+
<tr>
446
443
<td>LIBRARIAN (3.5-turbo)</td>
447
444
<td>69.9% ±0.9</td>
448
-
</tr>
449
-
<tr>
445
+
</tr>
446
+
<tr>
450
447
<tdrowspan="2">Date</td>
451
448
<td>REGAL (gpt-3.5-turbo)</td>
452
449
<td>90.2% ±0.5</td>
453
-
</tr>
454
-
<tr>
450
+
</tr>
451
+
<tr>
455
452
<td>LIBRARIAN (3.5-turbo)</td>
456
453
<td>94.7% ±0.7</td>
457
-
</tr>
458
-
</tbody>
459
-
</table>
454
+
</tr>
455
+
</tbody>
456
+
</table>
460
457
<figcaptionstyle="text-align: center;">Solving holdout test program synthesis tasks using learned libraries</figcaption>
461
458
</div>
462
459
</div>
463
460
</figure>
464
461
465
-
<h3>Real-World Refactoring</h3>
462
+
<h3>How does Librarian Perform on Real-World Refactoring tasks?</h3>
466
463
<p>The HuggingFace Transformers library is used by nearly 400k GitHub projects. We deploy LIBRARIAN to 10 source files, using Claude Code to sample $K=15$ refactorings per cluster of size $S=5$, believing an agent such as Claude Code would excel at repo-level edits. LIBRARIAN distilled repeated abstractions such as MLPs, Attention, Decoder classes, RoPE helper functions, etc., lowering MDL to <strongstyle="color: #3a7bd5;">67.2% of its original value</strong> while still passing all integration tests. The top-3 refactorings based on MDL have an average of <strongstyle="color: #3a7bd5;">18 abstractions</strong> (functions, classes) in the library, each of which is called on average <strongstyle="color: #3a7bd5;">4.59 times</strong> in the refactored models.</p>
467
464
468
465
<p>For Diffusers, scheduler clusters yielded top-3 MDL refactorings with an average of <strongstyle="color: #3a7bd5;">12.3 functions</strong> and <strongstyle="color: #3a7bd5;">3.0 calls per function</strong>, while UNet refactorings produced richer abstractions with an average of <strongstyle="color: #3a7bd5;">17.0 functions/classes</strong> and <strongstyle="color: #3a7bd5;">3.43 calls each</strong>.</p>
0 commit comments