COMMUNICATION

Cross-lingual Code Clone Detection: When LLMs Fail Short Against Embedding-based Classifier

2024
Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering : 2474-2475

Lien de l'article : https://doi.org/10.1145/3691620.3695335

Discipline : Informatique et sciences de l'information

Auteur(s) : Micheline Benedicte Moumoula, Abdoul Kader Kabore, Jacques Klein, Tegawende F. Bissyande

Auteur(s) tagués : BISSYANDE T. François D'Assise

Renseignée par : BISSYANDE T. François D'Assise

Résumé

Cross-lingual code clone detection has gained attention in softwaredevelopment due to the use of multiple programming languages.Recent advances in machine learning, particularly Large LanguageModels (LLMs), have motivated a reexamination of this problem.This paper evaluates the performance of four LLMs and eightprompts for detecting cross-lingual code clones, as well as a pre-trained embedding model for classifying clone pairs. Both approachesare tested on the XLCoST and CodeNet datasets.Our findings show that while LLMs achieve high F1 scores (upto 0.98) on straightforward programming examples, they strugglewith complex cases and cross-lingual understanding. In contrast,embedding models, which map code fragments from different lan-guages into a common representation space, allow for the trainingof a basic classifier that outperforms LLMs by approximately 2and 24 percentage points on the XLCoST and CodeNet datasets,respectively. This suggests that embedding models provide morerobust representations, enabling state-of-the-art performance incross-lingual code clone detection.

Mots-clés

Cross-Language Pairs, Code Clone Detection, Large LanguageModel, Prompt Engineering, Embedding Mode

Retour à la liste Consulter l'article

Cross-lingual Code Clone Detection: When LLMs Fail Short Against Embedding-based Classifier

Résumé

Mots-clés

1047

10464

49

118