Papers
arxiv:2606.29088

Diff-Based Code Corruption using LLMs for Large-Scale Bugfix Benchmarking

Published on Jun 27
Authors:
,
,
,
,
,

Abstract

A large-scale bugfixing benchmark called MegaBugFix is introduced, containing 12,629 Python programs with bugs injected via diff representations, revealing more challenging bugs than existing benchmarks.

There are various benchmarks to evaluate bugfixing capabilities of Large Language Models. However, most widespread benchmarks do not fully reflect real-world bugfixing practices. They are small, weakening statistical reliability, and the buggy programs are often similar to one another, potentially distorting evaluation results. The range of bug types can also be narrow, failing to capture a representative range of bugs. To address these issues, we introduce MegaBugFix, a large-scale bugfixing benchmark containing 12,629 buggy Python programs synthesized from correct ones by a Large Language Model. Bug injections were generated as diffs representing code changes. Through this approach, we were able to avoid common pitfalls of LLM-based mutation techniques like injecting overly simplistic bugs or failing to modify the input program. We evaluated 13 open-weight models on MegaBugFix and baseline benchmarks, finding consistently lower performance on MegaBugFix. This reveals that our benchmark presents more challenging bugs and exposes model failures that may remain hidden when evaluating on existing benchmarks. The benchmark and fine-tuned model used for bug injection are available at hf.co/collections/szalontaib/megabugfix.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.29088
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.29088 in a Space README.md to link it from this page.

Collections including this paper 1