Academic study project on JavaScript code duplication using AST parsing with text similarity.
Run:
make init
clone-analisys <PATH> <SIMILARITY INDEX>
// clone-analisys src/api-server 0.85
We select a piece of code to convert it into an Abstract Syntax Tree (AST) representation. Then, the cleaning and normalization phase is carried out, in which we remove unwanted attributes and apply a standardization between similar structures, such as the example of an arrow function for a regular function.
// the both code snippets are characterized as type 2 clone
const arrowFunction = (value) => {
const { type } = value
return type
}
function regularFunction(value) {
// this is a regular function
const { type } = value
return type
};
To perform a representation of code snippets in AST, we have good libraries like:
Library | Version |
---|---|
espree | 7.3.1 |
@babel/parser | 7.14.7 |
abstract-syntax-tree | 2.19.1 |
In this project we are using abstract-syntax-tree because it is a library that offers greater facilities to manipulate an AST.
To perform the comparison between ASTs, even in this current version, we had two options, namely: i) Comparison between pure ASTs where we only have the return if they are identical or not, or; ii) Convert the ASTs to text (string) and use libraries that check the textual similarity between the code snippets.
Library | Version | Type |
---|---|---|
ast-compare | 2.1.0 | Compare ASTs |
string-similarity | 4.0.4 | Compare strings |
string-comparison | 1.0.9 | Compare strings |
The decision to compare ASTs directly seems to be the most coherent decision, but so far lib ast-compare can only identify whether the pieces are identical or not. In this scenario, using the representation of Abstract Syntax Trees still gives us the advantage of being a uniform and easy-to-manipulate representation for pre-processing and normalizations, in addition to transforming it into text so that it can be compared as a textual element.
Using the code snippets examples above, we have:
ast-compare: false
string-similarity (Dice): 0.925351071692535
string-comparison (Cosine): 0.9672041516493517
string-comparison (Levenshtein): 0.9072164948453608
string-comparison (Longest Common Subsequence): 0.9357933579335793
string-comparison (Metric Longest Common Subsequence): 0.9337260677466863
ast-compare: true
string-similarity (Dice): 1
string-comparison (Cosine): 1
string-comparison (Levenshtein): 1
string-comparison (Longest Common Subsequence): 1
string-comparison (Metric Longest Common Subsequence): 1
To learn more about the issues addressed, read: ESTUDO EMPÍRICO SOBRE DUPLICAÇÃO DE CÓDIGO EM APLICAÇÕES REACT.JS.