Analysis Undergraduate Student Performance using Feature Selection Techniques on Classification Algorithms

Analysis Undergraduate Student Performance using Feature Selection Techniques on Classification Algorithms

Don't use plagiarized sources. Get Your Custom Essay on
Analysis Undergraduate Student Performance using Feature Selection Techniques on Classification Algorithms
Just from $13/Page
Order Essay

Abstract— Educational Data Mining employed in various field which include various attributes for analysis students details such as name, attendance, class test, lab test, spot test, assignment, result on educational data. In this research mainly focus on calculating performance of undergraduate students in computer science and engineering by a predictive data mining model using feature selection methods with classification algorithms. Feature selection techniques is proposed in data preprocessing process to find out the most inherent and important attributes so that we analyze and evaluate the students better performance. We collected records of 800 students from final year student, studying in undergraduate level from North Western University .In this paper, we used four feature selection methods: genetic algorithms, gain ratio, relief and information gain. Also used to five classification algorithms: K-Nearest Neighbor, Naïve Bayes, Bagging , Random forest, and J48 Decision Tree in this research. Experimental result shown that Gain Ratio feature selection method with 10 feature selected gives the best result on 87.375% accuracy with k-NN classifier. we used dissimilar feature selection techniques are matched by student performance prediction constructed on students academic performance.

Keywords— data mining; feature selection; genetic algorithm ; gain ratio; Relief; information gain; classification; student performance.

I.     Introduction

    Educational data mining (EDM) describes a research ground anxiety with the application of data mining. Classification is very beneficial data exploration that expect students theoretical performance. In data mining knowledge discovery  refers to the broad process of finding knowledge in data. knowledge discovery involves some process such as data selection, data cleaning, data transformation, data integration, pattern evaluation, pattern discovery. The application of data mining is widely prevalent in education system. Educational data mining is an emerging field which can be effectively applied in the field of education. The educational data mining uses several ideas and concepts such as Association rule mining, classification and clustering.[7] The knowledge that emerges can be used to better understand students’ promotion rate, students’ retention rate, students’ transition rate and the students’ success [1].

The data mining system is pivotal and crucial to measure the students’ performance improvement.

we create and initialize the individuals in the population. As the genetic algorithm is a stochastic optimization method, the genes of the individuals are usally initialized at random.

One of the most advanced algorithms for feature selection is the genetic algorithm. This is a stochastic method for function optimization based on the mechanics of natural genetics and biological evolution. [2]

In this paper, we use genetic algorithms as a feature selection method to optimize the performance of a predictive model, by selecting the most relevant features.

A classification built on an connotation instructions algorithm is used to construct a classifier to help estimate the students performance.

By upgraded results, there are some techniques that can rise the correctness of the results tested on data processing techniques.

This technique is able to find info in the form of patterns, topographies known as knowledge.

In this research, mostly  emphasis on attribute selection system. We used four Feature selection methods: Genetic algorithm (GA), Gain ratio(GR), Relief and Information gain. Then, we compared student performance by five classification algorithms such as K-Nearest Neighbor(KNN), Naïve Bayes(NB),  Bagging, Random forest(RF), and J48 Decision Tree algorithms for each feature selection techniques.

This paper is structured as follows: in Section 1 we discuss about previous researches on student performance prediction in educational field and theirs influence factors. In this section also discuss the effort to deal with feature selection. In Section 2, we describe the step for implementing research from data preparation, preprocessing stage, oversampling stage and selecting best feature from dataset (based on : attribute, dimension and subset). In Section 3 we provide the experimental result and we explain about the results’ analysis. Finally in Section 4 we offer conclusion and discuss the future works.

II.    Related work

    Now days educational data mining is emerged as a very active research area because there are lots of things in this filed are not exposed. Work connected to student performance, student behavior analysis, faculty performance and impact of this factor on student final performance need much attention.

J K Jothi and K Venkatalakshmi conducted the students’ performance analysis on the graduate students’ data collected from the Villupuram college of Engineering and Technology. The data included five year period and applied clustering methods on the data to overcome the problem of low score of graduate students, and to raise students academic performance [3].

Feature selection is a fundamental stage related to classification accuracy. As the dimensionality of study domain expands, the number of features become higher [2].[4]

A comparison between GA and full model selection (support vector machines and a particle swarm model selection) on classification problems, the results showed that GA gave better performance on problems with high dimensionality and large training sets [5]

Mythili M S and Shanavas A R applied classification algorithms to analyze and evaluate school students’ performance using weka. They came with various classification algorithms, namely J48, Random Forest, Multilayer perception, IBI and decision table with the data collected from the student management system [6].

Noah, Barida and Egerton conducted a study to evaluate students’ performance by grouping the grading into various classes using CGPA. They used different methods like Neural network, Regression and K-means to identify the weak performers for the purpose of  performance improvement [8].

Baradwaj and pal described data mining techniques that help in early identification of  student dropouts and students who need special attention. Here they used a decision tree by using information like attendance, class test, semester and assignment marks [9].

Remesh, Parkavi, and Yasodha conducted a study on the placement chance prediction by investigating the different techniques such as Naive Bayes Simple, MultiLayerPerception, SMO, J48, and REPTree by its accuracy. From the result they  concluded that MultiLayerPerception technique is more suitable than other algorithms [10].

III.   Proposed Method

    The main ideas of the proposed approach are to increase the performance of classification accuracy and gain the essential features. Establish the features to discover an best set of attributes. This task is carried out using state-of-the-art dimension selection algorithms, namely Genetic Algorithms (GA),Gain Ratio (GR), Relief, Information Gain attribute  evaluation (IG).

 

Data collection

 

Data Preprocessing

 

Data cleaning and transformation

 

Feature Selection methods:

RELIEF

GR

IG

GA

 

Classifiers:

KNN, NB,Bagging,RF,J48

 

Evaluation

 

Final Model

            Fig 1: Proposed method

Finally a subset of attributes are select for the classification stage. Attribute removal has played a important role in many classification methods. Eventually, five classification methods, which are measured very strong in solving non-linear problems, are chosen to estimate the class possibility. These methods are K-Nearest Neighbor (KNN), Naïve Bayes(NB),  Bagging, Random forest(RF), and J48 Decision Tree classifier.

A.    Data Selection

    Data used Students’ Academic Performance datasets that consists of 800 students records with 15 features. Data are collected from Department of computer science and engineering, North Western University, Khulna, Bangladesh. Attributes datasets shown in table i.

In this research used 15 attributes such as id, attendance, assignment, class test, lab test, spot test, skill, central viva, extra curriculum activities, quiz test, project/presentation, backlog. Grades are assign to all the students using following mapping:  A (91% – 100% ),  B (71% – 90% ), C (61% – 70% ) ,D (41% – 60% ) ,F (0% – 40% ) .

Final semester result and final cgpa are assign to all the students using following mapping:  A (75% – 100% ),  B (70% – 74% ), C (65% – 69% ) ,D (60% – 64% ) ,F (0% – 60% ) .

TABLE I.  LIST OF ATTRIBUTE DATASET

Attributes No.

Attributes Name

Possible Values

         1

Student Id

Id of the student

         2

Attendance

A,B,C,D,F

         3

Assignment

A,B,C,D,F

         4

Class test

A,B,C,D,F

         5

Lab test

A,B,C,D,F

         6

Spot test

A,B,C,D,F

         7

Skill

A,B,C,D,F

         8

Central viva

A,B,C,D,F

         9

Extra curriculum activities(ECA)

YES/NO

        10

Quiz test

A,B,C,D,F

        11

Project/Presentation

YES/NO

        12

Backlog

YES/NO

        13

Final semester result

A,B,C,D,F

        14

Final CGPA

A,B,C,D,F

        15

Class

Excellent ,Very Good, Good, Average ,Poor

Extra curriculum activities are divided into two classes: yes(1) and no(0).Quiz test is divided two classes: yes(1) and no(0). And Backlog also divided two classes: yes(0) and no(1) are assign to all the students.

B.    Data Preprocessing

1)   Data cleaning and transform

    Data cleaning is the process of identify and removing corrupt records from a record set.in our student data noisy data remove from dataset.  

    Data transform is a task to prepare the selected data into format that ready to process. In our experiment, we transformed student grade data by discretized into a categorical classes. A class is divided into 5 classes consisting of Excellent, Very Good, Good, Average and Poor . Final class based on this index shown in table II.

             TABLE II. CATEGORICAL CLASSES

         Class

                   Range

     Excellent

            (11.8 – 13)

     Very Good

            (9.2 – 11.7)

     Good

            (7.9 – 9.1)

     Average

            (5.3 – 7.8)

     Poor

     (0 –  5.2)

2)   Feature Selection

  Feature selection is called attribute selection.We used four feature selection methods such as genetic algorithms, gain ratio, relief, information gain. Using this method find out optimal feature for classify better accuracy.

    Genetic Algorithms(GA). Genetic Algorithms (GA) are theory algorithms that simulate the action of publication and instinctive selection. Each attribute in data set is calculated as a gene with separate linear series called chromosomes.[11]

GA consequence to initialize a population of solutions. Three operators such as selection, crossover, and mutation are used to the population. Fitness function used in genetic algorithm that is evaluated  until optimal solution is arrived.

    Gain Ratio (GR). The GR is computed as information gain divided by the entropy of the attribute’s value

       GainRatio(Class,Attribute)=InfoGain(Class,Attribute)/H(Attribute) [12]

   where H(Attribute) is the entropy of the attribute. GR measures the relative worth of an attribute respect to the class.

Relief. In the Relief algorithm, a good discriminating attribute is defined as the attribute that has the similar feature values in the similar class and dissimilar feature values in dissimilar classes. It uses a nearest neighbor method to calculate relevancy scores for each attribute. It evaluates the worth of an attribute by repeatedly sampling an instance and computing given attribute value based on the nearest instance of the same and different class.

Information Gain (IG). The Information Gain (IG) is a measure based on Entropy. The formula for IG is:

    InfoGain (Class ,Attribute)=H(Class)-H(Class | Attribute) [13]

where H (Class) is the total entropy of the class, and H( Attribute, Class) is the conditional entropy of the class given the attribute.

C.    Classification

    We used five classification algorithms such as K-Nearest Neighbor , Naïve Bayes,  Bagging , Random forest, and J48 Decision Tree. This algorithms to mine the data from feature selection steps.

    K-Nearest Neighbor (KNN). The K-Nearest Neighbor algorithm called KNN, is a classification algorithm. It is more widely used in classification problems.

K-NN fundamentally works on the belief that the data is connected in a feature space. Hence, all the points are considered in order, to find out the distance among the data points. Euclidian distanceor Hamming distance is used according to the data type of data classes used. In this a single value of K is given which is used to find the total number of nearest neighbors that determine the class label for unknown sample. If the value of K=1, then it is called as nearest neighbor classification.[14]

    Naïve Bayes (NB). Naive Bayes is a classification algorithm for binary (two class) and multi-class classification problems.

Bayesian theorem provides an equation for calculating posterior probability P(c | x) from P(c), P(x) and P(x | c):

 

        p(c | x) = p(x | c)*p(c) / p(c) [15]

• P(c | x): the posterior probability of class (c, target) given predictor (x, attributes).

• P(c): the prior probability of class.

• P(x | c): the likelihood, which is the probability of predictor given class.

• P(x):  the prior probability of predictor.

    Bagging. Bagging is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods.[16]

    Random Forest (RF). Random forests are an ensemble learning method for classification. Its correct for decision trees habit of overfitting to their training set.

Random forest is the combination of different decision trees, used to classify the data samples into classes. It is commonly used statistical technique used for the classification. The worth of each distinct tree in not essential, the purpose of random tree is to reduce the error rate of the whole forest. The error rate depends upon two factors i.e. correlation between two trees and the strength of the tree.[17]

    J48 Decision Tree. J48 is an algorithm used to generate a decision tree developed by Ross Quinlan mentioned earlier.

It is an implementation of C4.5 in WEKA. The algorithm uses a greedy technique to induce decision trees for classification and uses reduced error pruning. J48 can use both discrete and continuous attributes, attributes with differencing lost and training data with missing attribute values.[18]

D.   Evaluate the results

In this Experiment compares the accuracy results by using the selected attributes on feature selection techniques with each classification techniques. Calculate the accuracy using ten-fold cross-validation. Cross validation is a techniques that validating the accuracy.

WEKA (Waikato Environment for Knowledge Analysis) is used as a data mining tool. Waikato Environment for Knowledge Analysis is a suite of machine learning software written in java, developed at the University of Waikato, New Zealand. It is free software licensed under the GNU General Public License.

Matlab and weka tool used  for feature selection, pre-processing and classification.

EXPERIMENT AND RESULTS

The suggested approach for the purpose of predicting student performance applied in this study is carried out in two major phases. In the first phase, the feature space is searched to reduce the feature numbers and prepare the conditions for the next step. This task is carried out using four dimension reduction techniques, namely GA, GR, Relief, IG Algorithms. At the end of this step a subset of features are chosen for the next round.

. The optimal features of these techniques are summarized in table 3. Afterwards, the selected features are used as the inputs to the classifiers. Five classifiers are proposed to estimate the success possibility as mentioned previously, these methods include KNN, NB, Bagging, Random Forest and J48 decision tree.

     

        Fig 2: Student Data Set

        

         Fig 3: Visualize Class Attributes

In our experiment, we applied all dataset to each feature selection method and then, we got the result of selected features set as table III below:

 

This selected features with five classifiers such as K-Nearest Neighbor , Naïve Bayes,  Bagging , Random forest, and J48 Decision Tree compute the performance set as table IV below:

In order to evaluate goodness of each feature selection, we needed further experiment by doing classification of selected features from prior stage

 

   FS Method

No. of Selected feature

Selected feature

  GA

10

1,2,3,4,7,9,10,11,12,13

  GR

10

1,3,4,6,7,8,10,11,12,13

RELIEF

 09

1,2,3,4,8,10,11,12,13

     IG

  10

1,2,3,4,5,6,7,8,12,13

                                   TABLE III.  LIST OF SELECTED FEATURE

      Table IV: Performance measures of selected features

Classifier

Performance index

GA

GR

RELIEF

IG

KNN

Accuracy

84.875

87.375

85.875

77.125

Precision

0.785

0.874

0.859

0.772

Recall

0.785

0.874

0.859

0.771

F-Measure

0.784

0.874

0.858

0.771

ROC Area

0.882

0.936

0.931

0.875

Naïve bayes

Accuracy

75.5

80.375

80.125

76

Precision

0.759

0.812

0.807

0.768

Recall

0.755

0.804

0.801

0.760

F-Measure

0.756

0.805

0.803

0.760

ROC Area

0.929

0.948

0.948

0.932

Bagging

Accuracy

76.375

81.5

80

75.25

Precision

0.769

0.816

0.801

0.752

Recall

0.764

0.815

0.800

0.753

F-Measure

0.764

0.815

0.800

0.752

ROC Area

0.930

0.954

0.954

0.931

RF

Accuracy

81.625

86.75

86.625

81.125

Precision

0.816

0.808

0.866

0.811

Recall

0.816

0.809

0.866

0.811

F-Measure

0.816

0.808

0.866

0.811

ROC Area

0.951

0.878

0.973

0.957

J48

Accuracy

77.75

79.25

81.875

77

Precision

0.777

0.792

0.819

0.771

Recall

0.778

0.793

0.819

0.770

F-Measure

0.777

0.792

0.819

0.770

ROC Area

0.903

0.902

0.922

0.879

Table IV shows that accuracy results of  students’ performance analysis based on students’ dataset. It clearly reveals that Random Forest is a very best classifier for analyzing the students’ performance result with good accuracy.

We computed the accuracy of selected features with all four methods by choose a five classifiers (KNN, Naïve bays, Bagging, RF, and J48 Decision Tree). The result shown as table V and figure 4 below:

                      Table V: Comparison of FS method

    FS Method

KNN

NB

Bagging

RF

J48

GA

84.875

75.5

76.375

81.625

77.75

GR

87.375

80.375

81.5

87.375

79.25

RELIEF

85.875

80.125

80

86.625

81.875

IG

77.125

76

75.25

81.125

77

             

                Fig. 4: Comparison of accuracy of feature selection

      Now we computed the Highest accuracy of selected features with Feature selection methods by  classifiers. The result shown as table VI and figure 5 below:

The selected attributes on both feature selection methods are further tested on the classification algorithm. In this  experiment compares the accuracy results using all selected feature. In table VII, shows Genetic Algoriths(GA) feature selection method with  K-Nearest Neighbor(KNN) is 84.875% accuracy. Gain Ratio(GR) feature selection method with  K-Nearest Neighbor(KNN) is 87.375% accuracy. Relief feature selection method with  Random Forest(RF) is 86.625% accuracy. Information Gain(IG) feature selection method with  Random Forest(RF) is 81.125% accuracy. Information Gain(IG) feature selection method with  Random Forest(RF) tend to have the lowest accuracy value in this dataset.

Gain Ratio(GR) feature selection method with  K-Nearest Neighbor(KNN) gives the best accuracy 87.375%.

Table VII: Comparison of FS With Classifiers

Method

      Accuracy(%)

GA + KNN

            84.875

GR + KNN

            87.375

Relief + RF

            86.625

IG + RF

            81.125

Fig 5: Comparison of accuracy of feature selection

V.    Conclusion

This research aims to develop a model to classify student performance. In our experiment , K-Nearest Neighbor, Naïve Bayes, Bagging , Random forest, and J48 Decision Tree classification algorithms were applied to genetic algorithms, gain ratio, relief and information gain feature selection method. The experimental result had shown that the performance of student calculation model greatly depends on the choice of collection of most related attribute from the list of attribute used in student dataset. Gain Ratio method with  K-Nearest Neighbor classifier shown the best accuracy among the other methods.

For the future work, we will apply more feature selection algorithms and also works on optimization algorithms with large datasets.

VI.   Acknowledgment

Thanks to our Supervisor, who inspired us with research on this interesting area and for all his helpful tips. The authors thanks to North Western University, Khulna for providing the student data.

VII.  References

[1] A. G. Sagardeep Roy, “Analyzing Performance of Students by Using Data Mining Techniques,” in 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics (UPCON) , Mathura, 2017.

[3] J.K.Jothi and K.Venkatalakshmi, “Intellectual performance analysis of  students by using data mining techniques”, International  Journal of Innovative Research in Science, Engineering and Technology, vol 3, Special iss 3, March 2014.

[4] T. B. A. N. A. S. I. H. Kartika Maharani, “Comparison Analysis of Data Mining Methodology and Student Performance Improvement Influence Factors in Small Data Set,” in International Conference on Science in Information Technology (ICSITech), 2015.

[5] J. M. Valencia-Ramirez, J. Raya, J. R. Cedeno, R. R. Suarez, H. J. Escalante, and M. Graff, “Comparison between Genetic Programming and full model selection on classification problems”, Power, Electronics and Computing(ROPEC), IEEE International Autumn Meeting, pp.1-6, 2014.

[6] M.S. Mythili1 and  A.R.Mohamed Shanavas , “An analysis of students’          Performance using classification algorithms ”,  IOSR-JCE, Volume 16,                iss1, Jan. 2014.

[8] OTOBO Firstman Noah, BAAH Barida and Taylor Onate Egerton, “Evaluation of  student performance using data mining over a given data space”, International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878,  Volume-2, iss 4,  September  2013.

[9] Brijesh Kumar Baradwaj and Saurabh Pal, “Mining educational data to analyze  Ssudents’ performance”, (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No. 6, 2011.

[10] V.Ramesh, P.Parkavi and P.Yasodha, “Performance analysis of aata mining techniques for placement chance prediction”,  International Journal of Scientific and Egineering Research , Vol.2, iss 8, August 2011.
 

What Will You Get?

We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.

Premium Quality

Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.

Experienced Writers

Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.

On-Time Delivery

Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.

24/7 Customer Support

Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.

Complete Confidentiality

Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.

Authentic Sources

We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.

Moneyback Guarantee

Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.

Order Tracking

You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.

image

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

image

Trusted Partner of 9650+ Students for Writing

From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.

Preferred Writer

Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.

Grammar Check Report

Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.

One Page Summary

You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.

Plagiarism Report

You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.

Free Features $66FREE

  • Most Qualified Writer $10FREE
  • Plagiarism Scan Report $10FREE
  • Unlimited Revisions $08FREE
  • Paper Formatting $05FREE
  • Cover Page $05FREE
  • Referencing & Bibliography $10FREE
  • Dedicated User Area $08FREE
  • 24/7 Order Tracking $05FREE
  • Periodic Email Alerts $05FREE
image

Our Services

Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.

  • On-time Delivery
  • 24/7 Order Tracking
  • Access to Authentic Sources
Academic Writing

We create perfect papers according to the guidelines.

Professional Editing

We seamlessly edit out errors from your papers.

Thorough Proofreading

We thoroughly read your final draft to identify errors.

image

Delegate Your Challenging Writing Tasks to Experienced Professionals

Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!

Check Out Our Sample Work

Dedication. Quality. Commitment. Punctuality

Categories
All samples
Essay (any type)
Essay (any type)
The Value of a Nursing Degree
Undergrad. (yrs 3-4)
Nursing
2
View this sample

It May Not Be Much, but It’s Honest Work!

Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.

0+

Happy Clients

0+

Words Written This Week

0+

Ongoing Orders

0%

Customer Satisfaction Rate
image

Process as Fine as Brewed Coffee

We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.

See How We Helped 9000+ Students Achieve Success

image

We Analyze Your Problem and Offer Customized Writing

We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.

  • Clear elicitation of your requirements.
  • Customized writing as per your needs.

We Mirror Your Guidelines to Deliver Quality Services

We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.

  • Proactive analysis of your writing.
  • Active communication to understand requirements.
image
image

We Handle Your Writing Tasks to Ensure Excellent Grades

We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.

  • Thorough research and analysis for every order.
  • Deliverance of reliable writing service to improve your grades.
Place an Order Start Chat Now
image

Order your essay today and save 30% with the discount code Happy