Recommender System For Stem Enrolment In Universities Using Machine Learning Algorithms: Case Of Kenyan Universities

. Technology, Engineering, and Mathematics (STEM) enrolment has gained a lot of research interest. The increase in demand for STEM-based skill sets has contributed to the need for systems that could potentially increase enrolments in the field. The purpose of this study was to investigate recommender systems for STEM enrolment in universities using machine learning algorithms. Students face challenges while selecting STEM courses that match their attributes. This article aims to provide a recommender system for STEM enrolment using machine learning algorithms. The article investigates three machine learning algorithms which include Support Vector Machine (SVM), Artificial Neural Network (ANN), and Naïve Bayes. Accuracy and validation techniques were applied to test the algorithms. The results demonstrated that our work performed better than that of the published research, with the ANN outperforming other classification methods. The results position ANN as an important algorithm in building a recommender model for STEM higher education enrolment. The study also identifies high school grades and Interest in STEM courses as important features in predicting STEM course enrolment in higher education. The study will guide policy on the courses to lay more emphasis on, as well as for the funding authorities to prioritize funding allocation for STEM-based courses.


INTRODUCTION
Despite the need for a more skilled workforce, STEM has not attracted many students.This has been attributed to the challenge of misalignments of students' capabilities with the courses they choose in Higher Education [1].According to He and Jang [2], there is a requirement for national efforts to establish STEM education due to the worldwide desire for STEM workers.Technological innovation and the search for "need-based and practical solutions" drive the four distinct STEM disciplines" [3].As presented by Sustainable Development Goal(SDG) 4 [4] there is a drive for students to select courses that will result in more innovations.As stated by [5], researchers are actively looking for ways to improve STEM courses since they believe these courses can help students develop "21st-century skills" [6].This has been reinforced by Tawbush et al. [7] who stress that researchers are increasingly concentrating on STEM because of the development of technology [8] and the rise in environmental threats.The growth experienced in STEM courses would be attributed to the need to reduce the gap between academia and the industry [9].Science, Technology, Engineering and Mathematics comes in handy to propel the economy in the ever-increasing competition [3], [10].Despite the numerous choices a student must make while selecting courses, getting the appropriate courses is still daunting [11].A recommender system can be defined as a platform that helps match students' interests with the STEM courses they are applying to [12], [13].According to Girase et al. [14], the Recommender system encompasses several artificial intelligence techniques.Some techniques widely used with recommender systems include machine learning and data mining [15].
Recommendation comprises three phases which include information, the learning phase, and finally the prediction [16].Recommender systems' primary goal is to reduce users' information overload while allowing the personalization of recommendations for better decision-making [17].Recommender Systems are useful for giving students advice while choosing disciplines to study in higher education [18].The outcome is improved academic performance motivated by a reduction in dropout rates [19].Recommendation systems can be classified into content-based, collaborative, and hybrid recommender systems [20]- [22].Advances have increased, especially in the development of recommender systems for enrolment in higher education.Markov chain model [23], [24], Regression [25], and analytical hierarchy process [26], [27] have been used http://ijstm.inarah.co.id 1466 in the development of recommender systems in higher education.In the recent past research has drifted to more advanced approaches in recommending course enrolments in Higher education.The approaches include Datamining [28], Machine learning [29], and Deep Learning [30], [31].The goal of machine learning, a subfield of artificial intelligence, is to produce algorithms based on "data trends and historical relationship between data" [32], [33].Supervised learning [34] and unsupervised learning [35] are the two main classifications of machine learning algorithms [36]- [38].According to [37] Supervised learning utilizes "labeled datasets" which conduct training to the algorithms to classify data or predict outcomes accurately.
On the other hand, unsupervised algorithms can "discover hidden patterns" without getting help from humans [39], [40].The study will build recommendations for STEM courses in higher education based on survey data from students who are currently in Kenyan Universities.To achieve our objective, we train three machine learning models (SVM, Naïve Bayes, and ANN) based on the student's historical data.Several communities and organizations are seeking ways in which they can provide education for all in line with SDG4.The main focus for SDG 4 is innovative learning which STEM can articulate.Sithole et al. [41] argue that Universities are facing the challenge of low STEM enrolments while at the same time having high attrition rates in STEM enrolment.According to the Commission of higher education [42] in their report for 2017/2018 statistics, University enrolments are weighing more on the "arts, humanities and social sciences" compared to the STEM courses.The report further states that the STEM-related courses enrolment was at 41% compared to the art, humanities, and social sciences which were rated at 59%.The research aims at implementing a recommender system that will attempt to increase STEM enrolments.This will be achieved by enabling the students to get enrolled in a course based on their interests and motivations, socio-demographics, academic strengths, and educational factors.There have been attempts to develop recommender models by researchers in the past.
Mokarrama et al. [43] proposed a content-based recommender system for selecting universities.Based on a set of features, the system was able to suggest Universities for the students to select.The challenge with the system was the scope [44] which was more general and the prospective student's preferences were not well captured.They did not address the Kenyan domain in the study.Wang et al. [39] developed a model based on machine learning and hybrid techniques.The model was able to recommend courses for students in successive semesters.The model implemented the matrix factorization machine learning algorithm.The system had a recall percentage of 58% and an accuracy of 78%.The weakness of the model was the challenge of information on demographics and past grades as suggested by Wang [39].The study will seek to include demographic details and previous grades as suggested by [39].The study will also include student preference while being specific to Kenyan Universities and STEM domain.This article aims at investigating the three-machine learning algorithms that are used in recommending STEM courses to students joining University.The article also aims at identifying the machine learning algorithm with the highest level of accuracy and the features that are best suited to create a recommendation.The current study investigated the following research questions.Question 1: Can we model the STEM recommender path choice according to the student's academic history by applying different ML algorithms?Question 2: Which ML classifier offers optimal performance in predicting student STEM course selection?Question 3: How is a student's STEM path choice associated with that student's previous academic performance?
This article is organized as follows: Section 2 discusses related work; section 3 explains the materials and methods, and section 4 discussion of the results.Section 5 is the last section that contains the conclusions and future works.

II. LITERATURE REVIEW
Several studies have been conducted in the area of course recommendation using machine learning algorithms in higher education.The most popular Machine learning algorithms used in higher education recommendation include SVM, Naïve Bayes, and Artificial Neural networks.http://ijstm.inarah.co.id 1467 Pupara et al.[45] developed an institution-based recommender system that was based on the student's characteristics.The system relied on the student's attitude and characteristics to recommend institutions of learning.The study used 1109 records of students from three Universities within Pakistan.The study implemented the decision tree model and association mining rule.The accuracy of the model was 69% based on features that included: the university's reputation, public confidence in institutions, student skills, and family income [44].The research was limited to recommendation courses based on attitudes and applicants' characteristics.Baskota and Ng [11], developed a machine learning recommender system for graduate school recommendation using the SVM and K Nearest Neighbors.The model used personal data while at the same time utilizing data obtained from online portals.The SVM model was implemented to find an appropriate graduate school whereas the KNN was implemented to get any other university that the student would be enrolled in based on their profiles.The model achieved an accuracy of 61%.The model agreed with the graduate recommender model that was proposed by Aishwarya and Tiple [16].The limitation of the study was its application to the graduate student's enrolment.Fiarni et al. [46] developed a machine learning recommender system that predicted the information system majors based on student's interest and performance.

State of the art in Machine Learning Recommender Model in higher education
The system implemented the decision tree machine learning algorithm.The system performed well by posting a precision of 71% and a Recall of 61%.The system was not evaluated using the accuracy technique thereby not showing the robustness of the model compared to other similar ones.Alsayed et al. [47] developed a system that recommended majors for students in higher education.The study implemented several supervised models that included decision trees, random forests, gradient boosting classifiers, and extra tree classifiers.The dataset was obtained from Kaggle online database.The model was validated using the 10-fold cross-validation method which achieved an accuracy of 75% and 61% for Random forest and gradient-boosting classifiers.According to nam the shortcoming of the model included the use of the Kaggle dataset which did not have customized input features.The study further proposed the use of survey data to enhance the accuracy of the prediction.Jena et al. [48] developed eLearning courses recommender model that uses SVD and KNN.The model aimed at recommending eLearning courses to students while achieving the highest accuracy.The Accuracy of the KNN outperformed the other models.The model was built on the Python environment.
The limitation of the study was the application to online students rather than the whole student fraternity.Zayed et al. [49] improved the intelligent recommender system developed by Alsayed et a.l [47].The study implemented supervised machine learning algorithms that include Support Vector Machine, Random Forest, and Decision Trees.The research used the same Kaggle dataset but included hyper-tuning to increase the accuracy.According to Zayed et al. [49] the accuracy of the random forest increased to 95%.The main limitation of the study was the inability to get the input features that would fit the problem.Despite the concentration of research on machine learning recommender models in higher education, there is little evidence that the researchers have concentrated on STEM education using primary data.To the best of our knowledge, no research has been conducted on machine learning recommender systems for STEM courses in Kenya.This gives provides the need for research which will have implications for higher education, students, and other education stakeholders.
Machine learning models for recommending enrolment in higher education.

Support Vector Machine
SVMs' popularity has been on the rise due to their robustness [50].Srivastava and Bhambhu, [51] define SVM as a set of related supervised learning methods implemented in both classification and regression According to Baskota and Ng [11], [52], [53] SVM has been more instrumental in text classification due to their ability to "classify both linear and non-linear data" [54].Despite their low speed as far as training time is concerned they attain a high level of accuracy compared to other algorithms [55].According to Srivastava and Bhambhu [51] SVM has a high accuracy level compared to other classification algorithms.The SVM algorithm is credited for having been used in various fields which include "text categorization, image classification [56] and object detection and data classification" [51].
http://ijstm.inarah.co.id 1468 According to Roy and Dutta [13] SVM would be appropriate in solving the challenge of high dimensional problems that arise in recommender systems.The performance of SVM is stable with or without the addition of new data.Below is a representation of the model.

Artificial Neural Network
ANN is a type of Artificial intelligence technique whose main strength is the ability to self-learn and generate efficient results [57], [58] It is independent of data types thereby being able to learn patterns independently [59].According to Hernandez et al. [60] an ANN system processes information based on units that are referred to as neurons.ANN draws its strength from the idea of the human brain works.The brain is interconnected to various neurons to make decisions faster [59].According to Hernandez et al. [60] the strength of the ANNs is the ability to perform well in terms of metrics.This is in comparison with the other classification algorithms [61].According to Latifah et al. [62], Artificial neural networks' plasticity, nonlinearity, modularity, and openness to noisy, fuzzy, or soft data give it an edge over other algorithms.
Naïve Bayes Naïve Bayes is widely acceptable due to its simplicity and speed of performing tasks [63].Naive Bayes is based on the Bayesian principle of the theory of probability [34].The operation of the Naïve Bayes is based on a "subjective probability of some unknown states under incomplete information" [64].

III. MATERIALS AND METHODS
The study contributes to the body of knowledge by building a machine learning recommender model for STEM enrolment in Kenya Universities.The process of development of the machine learning recommender model comprises three stages.The stages include data collection, data pre-processing, and visualization and recommendation as shown in Figure 1 below.

Fig 1. Workflow of machine learning algorithms[Author:The researcher] Data Collection and Pre-processing
The data used in the study was collected through a survey conducted among students within four Universities in Kenya.The universities were categorized into both private and public.The targeted Universities had students who were taking STEM-based courses and had previously attended the Kenya Certificate of Secondary Education Exam (KCSE).The data contained 384 sample records with 38 input features and was based on the stratified random sampling technique.The technique brings the strength of selecting the sample properly enabling the researcher to generalize the findings for the entire population [35].
The dataset comprises features that include but are not limited to gender, age bracket, family income, KCSE final score, and score for individual STEM-based courses.Our dataset comprises both numerical and categorical data.The feature selection was conducted using the lasso technique [65]) which resulted in 29 features.By combining the advantages of ridge regression with subset selection, LASSO enhances model interpretability and prediction accuracy [65].The records were then augmented using the smote technique to 1158.The STEM courses were labeled as 0, 1, 2, and 3 representing Engineering, Mathematics, Sciences, and Technology respectively.The research was based on survey data due to the need of using primary data.The choice of the dataset was a shift from the traditional choice of data that resides within the local university repositories as suggested by Alzayed et al. [47].

Data Processing and Visualization
In this stage, the data undergoes a cleaning process to have it ready for visualization and training process.The cleaning was conducted using the Python library [48].The label encoder was then implemented to transform the categorical features into numeric values.The training and testing of the model are based on three Machine learning algorithms namely SVM, Naïve Bayes, and Artificial Neural Network with the help of the 10-fold cross-validation technique.This is achieved through the use of data that has undergone pre-processing using the Python library known as Scikitlearn [48] library.The data is split into training and testing data using the ratio 70:30.After training, the model will either determine the best model on the current study dataset or identify patterns between input characteristics and output variables.Accuracy will be used to identify the high-performance machine learning algorithm.Finally, ML models with high accuracy are selected for model development.

IV. RESULT AND DISCUSSION
The selection of the appropriate STEM courses is an important component for higher education stakeholders.This section investigates STEM recommendations using machine learning algorithms in Kenyan universities.Python libraries as well as visualizations are implemented to answer the research questions.
Question 1: Can we model the STEM recommender path choice according to the student's academic history by applying different ML algorithms?Question 2: Which ML classifier offers optimal performance in predicting student STEM course selection?
The first two questions were comprehensively addressed by conducting experiments for the three machine learning models.

SVM
SVM is a popular machine-learning algorithm that works best with small datasets [11].The model is trained and tested as shown in Figure 5 below.The accuracy of the SVM is 77.22%.This is in line with the previous study by Zayed et al. [49].The high performance is also supported by Gaye et al. [55], [66] who assert that despite the high duration required to train the SVM model, they post very good performance.The Confusion matrix was also used to test the number of positively predicted STEM courses as shown in figure 6 below.Source: Field Data 2023 According to Kulkarni et al. [67], a confusion matrix is a predictive tool in machine learning that is deployed to test the count of predicted and actuals.The results suggest that 55 engineering students were recommended correctly based on the total number of 72 students.This represents a percentage of 76 accurate recommendations for engineering students.Its also worth noting that 55 mathematics students were also predicted accurately from the 61 students.This represents 90% of positive recommendations from the test data.There were 43 accurate predictions for the sciences out of the possible 56.This represents 77% accurate predictions.Finally, technology was presented with 30 accurate predictions which was a 44% accurate prediction.The low number of predictions on the technology courses is due to the level of education that most technological students are taking.From the data, most students taking technology were certificate students who didn't have the requisite qualification to be placed in a University.

Naïve Bayes
Simplicity and speed of task completion are the driving factors for the Naïve Bayes machine learning algorithm [63].The accuracy of the Naïve Bayes is 72.22%.The model is trained and tested as shown in Figure 5  According to the confusion matrix, 64% of the predictions for the Engineering courses were accurate.Mathematics posted 78% accuracy in predictions while science and technology posted 80% and 70% respectively.

ANN
Latifah et al. [62] argue that ANN performs well compared to other algorithms.ANN draws its strength from the idea of the human brain works.The brain is interconnected to various neurons to make decisions faster [59].The study presents training of the model using the ANN machine learning model as shown in Figure 9.The accuracy posted by ANN is 82%.This is the best result in comparison to both Naïve Bayes and SVM.Figure 6 shows the confusion matrix after training the STEM recommendation based on the ANN algorithm.

Fig 6. ANN Model confusion matrix
Source: Field Data 2023 According to the confusion matrix, 92% of the predictions for the Engineering courses were accurate.Mathematics posted 95% accuracy in predictions while science and technology posted 66% and 70% respectively.The observations agree with [13] whose findings explain that ANN produces high performance.Question 3: How is a student's STEM path choice associated with that student's previous academic performance?To answer the second research, question the research used the Spearman correlation model which was conducted within the Python environment.Since the Spearman correlation is discovered by computing the Pearson correlation on the ranked data inside two features, it is often referred to as Spearman's "rank correlation" coefficient [68].The Spearman correlation presented the relationship between the features.The strongest positive and negative correlations are 1 and 1, respectively, in the coefficient's range of 1 to 1 [69].Figure 11  Relationship of the features Source: Field Data 2023 According to the model, there is a strong positive correlation between feature J and features AB, AC, AD, and AE.These are the features that represent STEM courses about the courses that the participants are currently enrolled in.This historical information is important for recommending courses to future students [13].The results suggest that students who performed well in the sciences i.e.Biology Physics and Chemistry are more likely to join sciences in universities in Kenya.Similarly, students who posted high scores in Mathematics are likely to join courses in Statistics and other mathematics-related courses.The students who posted good scores in Mathematics, Physics, and Chemistry are more likely to join the Engineering courses in Kenyan Universities.
The study was also able to identify that it was difficult to predict the technological students due to the progression from certificate to bachelor.Most of the sampled students taking the technological courses had performed poorly in the STEM courses in secondary but managed to progress to undergraduate due to course progression.This was confirmed by Pearson's correlation and the various confusion matrix presented.The study also showed that peer influence and interest in STEM courses affect the student's choice of STEM courses.This is evident from the strong positive correlation between the Peer influence feature and the course taken.This agrees with [70]) whose research indicated that a student who has an interest either by own violation or the impact of peers would more likely select a STEM course in university.This is also supported by [61] who assert that awareness of career would most like influence a student's choice of STEM course.

V. CONCLUSION
To achieve higher education accessibility in the agenda as per SDG 4, it's important to match the student's interest with the courses offered.This ensures reduced dropout rates in Universities while increasing retention and graduation rates.Machine learning recommender models have attracted a lot of interest in research due to their predictive and recommendation prowess.Limited studies have been conducted in the area of recommendation within the STEM higher education domain using survey data.The study implemented the ANN, Naïve Bayes, and SVM algorithms.The study sampled 4 universities in Kenyan Universities that offered STEM courses based on the stratified random sampling technique.The Universities were classified into both public and private.The data was collected through questionnaires that were administered through google forms and controls were put in place to prevent missing data.The analysis was done on the Python environments and results were presented using data visualization libraries.
The accuracy of the three algorithms was tested with ANN giving the highest performance.Confusion Matrix was generated on the three algorithms to present the level of accurate recommendations for STEM courses.Finally, the relationship between the features was presented and the results suggested a strong positive correlation between the scores for STEM courses in High school and the Enrolled course.Similarly, there is a strong correlation between the interest in STEM courses and the courses that are http://ijstm.inarah.co.id 1473 enrolled.Based on the Lasso for feature selection.method previous KCSE score and the exposure to STEMrelated activities are good criteria for suggesting student field specialization.In addition, these ML models and features could be of high value in developing a system to easily recommend a STEM course to potential applicants who are often uncertain of their desired fields of specialization.Finally, this article differs from other research in this area of predicting student courses in a university because it is based on the Kenyan context and the area specific to STEM enrolment

VI. RECOMMENDATION
The study serves as a basis for the investigation of machine learning models in recommender systems in higher education.Future studies should be conducted on the application of deep learning algorithms in the recommendation of STEM courses within the higher education domain.

Fig 2 .Fig 3 .
Fig 2. Distribution of STEM courses within selected universities Source: Field Data 2023

Fig 4 .
Fig 4. SVM confusion matrix.Source: Field Data 2023 According to Kulkarni et al.[67], a confusion matrix is a predictive tool in machine learning that is deployed to test the count of predicted and actuals.The results suggest that 55 engineering students were recommended correctly based on the total number of 72 students.This represents a percentage of 76 accurate recommendations for engineering students.Its also worth noting that 55 mathematics students were also predicted accurately from the 61 students.This represents 90% of positive recommendations from the test data.There were 43 accurate predictions for the sciences out of the possible 56.This represents 77% accurate predictions.Finally, technology was presented with 30 accurate predictions which was a 44% accurate prediction.The low number of predictions on the technology courses is due to the level of education that most technological students are taking.From the data, most students taking technology were certificate students who didn't have the requisite qualification to be placed in a University.4.2 Naïve BayesSimplicity and speed of task completion are the driving factors for the Naïve Bayes machine learning algorithm[63].The accuracy of the Naïve Bayes is 72.22%.The model is trained and tested as shown in Figure5below.The below image shows the confusion matrix after training the STEM recommendation based on the Naïve Bayes algorithm.

Fig 5 .
Fig 4. SVM confusion matrix.Source: Field Data 2023 According to Kulkarni et al.[67], a confusion matrix is a predictive tool in machine learning that is deployed to test the count of predicted and actuals.The results suggest that 55 engineering students were recommended correctly based on the total number of 72 students.This represents a percentage of 76 accurate recommendations for engineering students.Its also worth noting that 55 mathematics students were also predicted accurately from the 61 students.This represents 90% of positive recommendations from the test data.There were 43 accurate predictions for the sciences out of the possible 56.This represents 77% accurate predictions.Finally, technology was presented with 30 accurate predictions which was a 44% accurate prediction.The low number of predictions on the technology courses is due to the level of education that most technological students are taking.From the data, most students taking technology were certificate students who didn't have the requisite qualification to be placed in a University.4.2 Naïve BayesSimplicity and speed of task completion are the driving factors for the Naïve Bayes machine learning algorithm[63].The accuracy of the Naïve Bayes is 72.22%.The model is trained and tested as shown in Figure5below.The below image shows the confusion matrix after training the STEM recommendation based on the Naïve Bayes algorithm.

Fig 7 .
Fig 6.ANN Model confusion matrixSource: Field Data 2023 According to the confusion matrix, 92% of the predictions for the Engineering courses were accurate.Mathematics posted 95% accuracy in predictions while science and technology posted 66% and 70% respectively.The observations agree with[13] whose findings explain that ANN produces high performance.Question 3: How is a student's STEM path choice associated with that student's previous academic performance?To answer the second research, question the research used the Spearman correlation model which was conducted within the Python environment.Since the Spearman correlation is discovered by computing the Pearson correlation on the ranked data inside two features, it is often referred to as Spearman's "rank correlation" coefficient[68].The Spearman correlation presented the relationship between the features.The strongest positive and negative correlations are 1 and 1, respectively, in the coefficient's range of 1 to 1[69].Figure11below shows how the features are related.