October 29, 2025 1 min read

Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies

Abstract

Large Language Models have gained remarkable interest in industry and academia. Conducting empirical studies with LLMs remains challenging and raises questions on how to achieve reproducible results.

We studied 85 articles describing LLM-centric studies published at ICSE 2024 and ASE 2024. Of the 85 articles, 18 provided research artefacts and used OpenAI models. We attempted to replicate those 18 studies—only five were sufficiently complete and executable. For none of the five studies were we able to fully reproduce the results.

Our results highlight the need for stricter research artefact evaluations and more robust study designs to ensure the reproducible value of future publications.

DOI Download PDF