Biraj Upadhyaya and Dr. Samarjeet Borah
Abstract- The copying of programming assignments by students specially at the undergraduate as well as postgraduate level is a common practice. Efficient mechanisms for detecting plagiarised code is therefore needed. Text based plagiarism detection techniques do not work well with source codes. In this paper we are going to analyse a code- based plagiarism detection technique which is employed by various plagiarism detection tools like JPlag, MOSS, CodeMatch etc.
Introduction
The word Plagiarism is derived from the Latin word plagiarie which means to kidnap or to abduct. In academicia or industry plagiarism refers to the act of copying materials without actually acknowledging the original source[1]. Plagiarism is considered as an ethical offence which may incur serious disciplinary actions such as sharp reduction in marks and even expulsion from the university in severe cases. Student plagiarism primarily falls into two categories: text-based plagiarism and code-based plagiarism. Instances of text based plagiarism includes word to word copy, paraphrasing, plagiarism of secondary sources, plagiarism of ideas, plagiarism of secondary sources, plagiarism of ideas, blunt plagiarism or authorship plagiarism etc. Plagiarism is considered code based when a student copies or modifies a program required to be submitted for a programming assignment. Code based plagiarism includes verbatim copying, changing comments, changing white space and formatting, renaming identifiers, reordering code blocks, changing the order of operators/ operands in expression, changing data types, adding redundant statement or variables, replacing control structures with equivalent structures etc[2].
Background
Text based plagiarism detection techniques do not work well with a coded input or a program. Experiments have suggested that text based systems ignore coding syntax, an indispensable part of any programming construct thus posing a serious drawback. To overcome this problem code-based plagiarism detection techniques were developed. Code-based plagiarism detection techniques can be classified into two categories viz. Attributed oriented plagiarism detection and Structure oriented plagiarism detection.
Attribute oriented plagiarism detection systems measure properties of assignment submissions[3]. The following attributes are considered:
Number of unique operators
Number of unique operands
Total number of occurrences of operators
Total number of occurrences of operands
Based on the above attributes, the degree of similarity of two programs can be considered.
Structure oriented plagiarism detection systems deliberately ignore easily modifiable programming elements such as comments, additional white spaces and variable names. This makes this system less susceptible to addition of redundant information as compared to attribute oriented plagiarism detection systems. A student who is aware of this kind of plagiarism detection system being deployed at his institution would rather complete the assignment by himself/herself instead of working on a tedious and time consuming modification task.
Scalable Plagiarism Detection
Steven Burrows in his paper ” Efficient and Effective Plagiarism Detection for Large Code Repositories”[3] provided an algorithm for code -based plagiarism detection. The algorithm comprises of the following steps:
Tokenization
Figure: 1.0
Let us consider a simple C program:
#include
int main( ) {
int var;
for (var=0; var
{
printf(“%dn”, var);
}
return 0;
}
Programming Construct
Token
int
main
for
return
(
)
{
}
=
+
,
ALPHANUM
STRING
S
N
R
g
A
B
j
l
K
J
D
E
N
5
Table 1.0: Token list for program in Figure 1.0.
Here ALPHANAME refers to any function name, variable name or variable value. STRING refers to double enclosed character(s).
The corresponding token stream for the program in Figure 1.0 is given as
SNABjSNRANKNNJNNDDBjNA5ENBlgNl
Now the above token is converted to N-gram representation. In our case the value of N is chosen as 4. The corresponding tokenization of the above token stream is shown below:
SNAB NABj ABjS BjSN jSNR SNRA NRAN RANK ANKN NKNN KNNJ NNJN NJNN JNND NNDD NDDB DDBj DBjN BjNA jNA5 NA5E A5EN 5ENB ENBl NBlg BlgN lgNl
These 4-grams are generated using the sliding window technique. The sliding window technique generates N-grams by moving a “window” of size N across all parts of the string from left to right of the token stream.
The use of N-grams is an appropriate method of performing structural plagiarism detection because any change to the source code will only affect a few neighbouring N-grams. The modified version of the program will have a large percentage of unchanged N-grams, hence it will be easy to detect plagiarism in this program .
Index Construction
The second step is to create an inverted index of these N-grams . An inverted index consists of a lexicon and an inverted list. It is shown below:
Lexicon
Inverted List
Apple
1: 25,3
Orange
1: 26,2
Banana
1: 22,5
Mango
3: 31,1 33,3 15,2
Grapes
2: 24,6 26,1
Table 2.0: Inverted Index
Referring to above inverted index for mango, we can conclude that mango occurs in three documents in the collection. It occurs once in document no. 31, thrice in document no. 33 and twice in document no. 15. Similarly we can represent our 4-gram representation of Figure 1.0 with the help of an inverted index. The inverted index for any five 4-grams is shown below in Table 3.0.
Lexicon
Inverted List
5ENB
2: 1,1 2,2
A5EN
2: 1,1 2,2
ABjS
2: 1,1 2,1
ANKN
2: 1,1 2,1
BgNl
1: 2,1
………
………
Table 3.0: Inverted Index
Querying
The next step is to query the index. It is understandable that each query is an N-gram representation of a program. For a token stream of t tokens, we require (t − n + 1) N-grams where n is the length of the N-gram . Each query returns the ten most similar programs matching the query program and these are organised from most similar to least similar. If the query program is one of the indexed programs, we would expect this result to produce the highest score. We assign a similarity score of 100% to the exact or top match[3]. All other programs are given a similarity score relative to the top score .
Burrows experiment compared against an index of 296 programs shown in Table 4.0 presents the top ten results of one N-gram program file (0020.c). In this example, it is seen that the file scored against itself generates the highest relative score of 100.00%. This score is ignored, but it is used to generate a relative similarity score for all other results. We can also see that the program 0103.c is very similar to program 0020.c with a score of 93.34% .
Rank Query Index Raw Similarity
File File Score Score
1
0020.c
0020.c
369.45
100%
2
0020.c
0103.c
344.85
93.34%
3
0020.c
0092.c
189.38
51.26%
4
0020.c
0151.c
185.05
50.09%
5
0020.c
0267.c
167.82
45.43%
6
0020.c
0150.c
164.67
44.57%
7
0020.c
0137.c
158.67
42.93%
8
0020.c
0139.c
154.31
41.76%
9
0020.c
0269.c
129.17
34.96%
10
0020.c
0241.c
126.87
34.33%
Table 4.0: Results of the program 0020.c compared to an index of 296 programs.
Comparison of various Plagiarism Detection Tools
4.1 JPlag:
The salient features of this tool are presented below:
JPlag was developed in 1996 by Guido Malpohl
It currently supports C, C++, C#, Java, Scheme and natural language text
It is a free plagiarism detection tool
It is use to detect software plagiarism among multiple set of source code files.
JPlag uses Greedy String Tiling algorithm which produces matches ranked by average and maximum similarity.
It is used to compare programs which have a large variation in size which is probably the result of inserting a dead code into the program to disguise the origin.
Obtained results are displayed as a set of HTML pages in a form of a histogram which presents the statistics for analyzed files
CodeMatch
The salient features of this tool are presented below:
It was developed by in 2003 by Bob Zeidman and under the licence of SAFE Corporation
This program is available as a standalone application.
It supports 26 different programming languages including C, C++, C#, Delphi, Flash ActionScript, Java, JavaScript, SQL etc
It has a free version which allows only one trial comparison where the total of all files being examined doesn’t exceed the amount of 1 megabyte of data
It is mostly used as forensic software in copyright infringement cases
It determines the most highly correlated files placed in multiple directories and subdirectories by comparing their source code .
Four types of matching algorithms are used: Statement Matching, Comment Matching, Instruction Sequence Matching and Identifier Matching .
The results come in a form of HTML basic report that lists the most highly correlated pairs of files.
MOSS
The salient features of this plagiarism detection tool are as follows:
The full form of MOSS is Measure of Software Similarity
It was developed by Alex Aiken in 1994
It is provided as a free Internet service hosted by Stanford University and it can be used only if a user creates an account
The program can analyze source code written in 26 programming languages including C, C++, Java, C#, Python, Pascal, Visual Basic, Perl etc.
Files are submitted through the command line and the processing is performed on the Internet server
The current form of a program is available only for the UNIX platforms
MOSS uses Winnowing algorithm based on code-sequence matching and it analyses the syntax or the structure of the observed files
MOSS maintains a database that stores an internal representation of programs and then looks for similarities between them
Comparative Analysis Table
JPlag
MOSS
CodeMatch
Birth
1996
1994
2003
Inventor
Guido Malpohl
Alex Aiken
Bob Zeidman
Availability
Free
Free
Free(till 1 MB use)
Algorithm
used
Greedy String Tiling
Winnowing Algorithm
Statement/ Comment/ Instruction/ Identifier matching
Languages supported
C, C++, C#, Java, Schema and Natural Text
26 languages
26 languages
Results displayed
HTML Histogram
HTML basic report
HTML pair code matching
Conclusion
In this paper we learnt a structured code-based plagiarism technique known as Scalable Plagiarism Detection. Various processes like tokenization, indexing and query-indexing were also studied. We also studied various salient features of various code-based plagiarism detection tools like JPlag, CodeMatch and MOSS.
References
Gerry McAllister, Karen Fraser, Anne Morris, Stephen Hagen, Hazel White http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/
Georgina Cosma , “An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis ”, University of Warwick, Department of Computer Science, July 2008
Steven Burrows, “Efficient and Effective Plagiarism Detection for Large Code Repositories”, School of Computer Science and Information Technology , Melbourne, Australia, October 2004
Vedran Juric, Tereza Juric and Marija Tkalec ,”Performance Evaluation of Plagiarism Detection Method Based on the Intermediate Language ”, University of Zagreb
We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.
Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.
Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.
Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.
Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.
Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.
We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.
Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.
You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.
From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.
Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.
Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.
You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.
You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.
Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.
We create perfect papers according to the guidelines.
We seamlessly edit out errors from your papers.
We thoroughly read your final draft to identify errors.
Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!
Dedication. Quality. Commitment. Punctuality
Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.
We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.
We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.
We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.
We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.