Cross-lingual Code Clone Detection: When LLMs Fail Short Against Embedding-based Classifier
- Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering : 2474-2475
Résumé
Cross-lingual code clone detection has gained attention in softwaredevelopment due to the use of multiple programming languages.Recent advances in machine learning, particularly Large LanguageModels (LLMs), have motivated a reexamination of this problem.This paper evaluates the performance of four LLMs and eightprompts for detecting cross-lingual code clones, as well as a pre-trained embedding model for classifying clone pairs. Both approachesare tested on the XLCoST and CodeNet datasets.Our findings show that while LLMs achieve high F1 scores (upto 0.98) on straightforward programming examples, they strugglewith complex cases and cross-lingual understanding. In contrast,embedding models, which map code fragments from different lan-guages into a common representation space, allow for the trainingof a basic classifier that outperforms LLMs by approximately 2and 24 percentage points on the XLCoST and CodeNet datasets,respectively. This suggests that embedding models provide morerobust representations, enabling state-of-the-art performance incross-lingual code clone detection.
Mots-clés
Cross-Language Pairs, Code Clone Detection, Large LanguageModel, Prompt Engineering, Embedding Mode