While Large Language Models (LLMs) have shown promise in various software engineering tasks, their deep understanding of code semantics remains a challenging area. This paper introduces a novel methodology to probe the semantic understanding of LLMs by subjecting them to a rigorous test: identifying trivial equivalent mutations in C code generated by csmith. By leveraging the diversity and complexity of csmith-generated programs, we can challenge the LLM on its comprehension of code semantics.
Our evaluation focuses on the LLM's ability to recognize semantic equivalence, provide sound justifications, and generate counterexamples when necessary. Through our experiments, we aim to shed light on the limitations and potential of current LLMs in understanding code semantics, paving the way for future advancements in AI-assisted software development.
LLM, code understanding, mutation equivalence