Prediction of the Factors Affecting PISA Mathematics Literacy of Students from Different Countries by Using Data Mining Methods

The purpose of this study is to predict the mathematical literacy levels of the students participating in the research through the data obtained from PISA 2015 exam organized by OECD using data mining and to determine the variables that affect mathematics literacy. For this purpose, students' mathematics literacy levels and the variables that affect their mathematics literacy levels were analyzed separately for 6 different countries at different proficiency levels. The population of the research is 519334 students from 72 countries, who have taken PISA 2015 exam. The sample that was determined according to the purpose of the study consists of a total of 34,565 students from Singapore, Japan, Norway, the USA, Turkey, and the Dominican Republic, which have been observed to be at different proficiency levels. In the first stage of the study, analyzes were performed using data mining prediction methods. At this stage, WEKA program was employed and M5P algorithm, which is one of the most common methods, was used. In the second stage of the research, the output variable was predicted from the input variables using Artificial Neural Networks methods to determine the extent to which decision trees obtained by M5P prediction method produce valid results. In the analyzes carried out in MATLAB program, the relationship between students' actual math literacy scores and literacy scores predicted from input variables was examined. As a result of the study, the variables that affect mathematics literacy were found to be the socio-economic status index for Singapore, Norway, the United States, Turkey, and Dominic. On the other hand, the variables influencing mathematics literacy for Japan were found to be mathematics learning time and father's education level. The consistency of the results was as follows: 86.10% for Singapore, 40.26% for Japan, 30.10% for Norway, 39.15% for America, 26.43% for Turkey, and 29.24 % for Dominic. As a result of the study, a differentiation was found among the variables that affect mathematics literacy of the countries at different proficiency levels.


Introduction
Introduce the Problem T oday, too much data is collected about individuals from different settings and for different purposes. At this stage, it is very important to determine which data is significant and which is worthless in the decision-making process. In addition, the followings are considered as other important problems to be addressed: What kind of data will be collected? How much data will be collected and for how long? How will the data be stored, and what kind of preprocessing will be applied? (Bienkowski et al., 2012). Since the variables related to the characteristics of the individuals have a great influence on the accuracy of the results obtained in the prediction and decisionmaking processes, it is necessary to determine which variables are important to us. One of the most important elements of scientific research is data and the correct analysis of the data. Data analysis includes the organization of data sets, computation of descriptive statistics, performing correlation and regression, and other statistical operations (Baykul & Güzeller, 2013). Due to the high number of methods in question, the determination of which of these methods make more effective predictions and which of them perform less erroneous calculations is of great importance for the reliability and validity of the results. PISA (Program for International Student Assessment-International Student Assessment Program), TIMSS (Trends in International Mathematics and Science Study-International Mathematics and Science Trends Research), PIRLS (The Project of International Reading Language Skills) collect information about a large number of variables in large-scale exams. PISA projects are executed every three years since 2000. The first application, which was carried out in 2000, focused on reading skills, whereas the focus was on mathematics literacy in 2003, on science in 2006, on reading skills in 2009, and on computer-based mathematics literacy in PISA 2012. Similarly, it repeats every three years as reading skills, mathematics literacy and science. The proficiency levels of the students represent the achievement level of the students in the focus area of the project. In the tests, the proficiency level is set between 1 and 6, where level 2 indicates the "basic competence level" in field tests. Students below this level are considered to fail to receive the necessary knowledge and skills to deal with everyday life problems, whereas students who are positioned at the 5th and 6th levels are accepted to be successful in problem-solving-focused thinking in their daily lives.
It is necessary to determine whether the achievement of the individuals can be decided based only on the test results and to determine to what extent the predictions to be made using a large number of variables are correct at this stage. This study aimed to draw meaningful results through the PISA exam, which is one of the international large-scale exams, and through the big data on students and school, and to use DM methods in the field of education.
The followings are also considered as important problems to be addressed: What kind of data will be collected, how much data will be collected and for how long, how the data will be stored, what kind of preprocessing will be applied (Bienkowski et al., 2012). The variables to be obtained from databases published by OECD (The Organization for Economic Cooperation and Development) were set by considering the problems mentioned in the scope study. In addition, since the information obtained from students is bigdata, it was confirmed that it can be used within the scope of data mining methods (Nisbet, Elder and Miner, 2009). Big data actually means that the information is obtained from different environments and at different formats. Although the definition of big data differs according to the application area, the size of the available data and the source from which it is obtained are important determinants for big data (Vaitsis et al., 2016). The data to be used in the study can be defined as big data because PISA 2015 exam data contains student information at different formats and from different sources and a vast number of students are reached.The main purpose of this study is to predict the mathematical literacy levels of the students participating in the research through the data obtained from the PISA 2015 exam organized by OECD and using data mining and artificial neural network methods, and to determine the variables that affect mathematics literacy. In this regard, students' mathematics literacy levels and the variables that affect their mathematics literacy levels were analyzed separately for 6 different countries observed to be at different proficiency levels. In the study conducted by Aksu and Güzeller (2016), it was determined that the variables that affect PISA mathematical literacy are self-efficacy perception, attitude towards the course and anxiety states, and study discipline. There are many studies on mathematical literacy especially in recent years (Harms, 2000;Kaiser, 2002;EARGED, 2008;Tekin & Tekin, 2004;Özgen & Bindak, 2008;Akay & Boz, 2011;Duran & Bekdemir, 2013). It is seen that different studies have been carried out to determine the variables that are thought to have an effect on mathematical literacy (Dursun & Dede, 2004;Fisher, 1995;Savaş et al., 2010;Özer & Anıl, 2011). In addition, Koğar (2015) examined the factors affecting mathematical literacy with the help of mediation model and determined that gender, economic, social and cultural status indices and time spent learning mathematics have a significant effect on mathematical literacy. In this study, it is aimed to determine whether the results obtained by considering the affective features that have been determined to be effective on mathematical literacy so far differ according to countries with different proficiency levels. In addition to this, it was aimed to determine the socioeconomic features that have a significant effect on mathematics literacy using PISA student surveys, to predict the scores that students got from the PISA mathematics literacy test, and to determine the consistency of the predicted test results. By this means, data mining and artificial neural network methods will be applied simultaneously for different countries through an international examination. The sub-objectives of the study are listed below: • Predicting PISA Mathematics Literacy achievement of the students for different countries, • Identifying the variables that have a significant effect on students' PISA Mathematics Literacy for different countries, • Determining the order of importance and effects of the variables addressed while predicting students' PISA achievement for different countries, • Determining the prediction accuracy of students' PISA achievement for different countries,

Method
This study aimed to determine the variables that affect students' mathematics literacy through the answers given to the questions in PISA 2015 student survey, and to check whether these variables differ depending on the country. PISA math scores were taken as a continuous variable in the study, therefore the procedure was predictive analysis; and since it was aimed to compare the results of the prediction by data mining, the method used in the study was a descriptive research model. The study is descriptive research because it addresses the prediction results obtained by data mining methods and the determination of the variables that affect PISA mathematics literacy levels (Aggarval & Ranganathan, 2019).

The Population and the Sample of the Research
The population of the research is 519,334 students from 72 countries, which took the PISA 2015 exam. Table  1 shows the average scores of the countries in the population and their rank in terms of mathematical literacy as a country. The sample that was determined according to the purpose of the study consists of a total of 34,565 students from Singapore, Japan, Norway, the USA, Turkey and the Dominican Republic, which were found to be at different proficiency levels. The reason why Singapore, Japan and Norway were chosen from these countries is that they are the subject of many researches in terms of the education systems they apply and the results obtained. The reason for choosing the US sample from the 2nd level and Turkey from the 1st level is that there are studies in the related literature that compare the two countries according to different criteria and there are findings that the results obtained for the two countries are due to cultural differences. The Dominican Republic was taken as the sample because the only country below level 1 is the Dominican Republic.
The number of students of the selected countries participating in the PISA exam and their proportions in the population are shown in Table 2.
According to Table 2, two different countries from the 3rd level were included in the sample. This is because the total number of countries at different proficiency levels varies. 32 of the 72 countries participating in the exam were at the third level, thus it was aimed to include two different countries from this level and compare the results of these two countries. Regarding Turkey in PISA 2015, the student population of 15 agegroup was 1,324,089, whereas the accessible Turkey population was defined as 925,366 students (MoNE, 2016).

Data Collection Tools
To determine the variables covered in the PISA student questionnaire, which is the data collection tool of the study, firstly, the literature was reviewed, and PISA mathematics literacy levels of students were predicted using 15 variables related to mathematics achievement (Aksu & Güzeller, 2016;Duran & Bekdemir, 2013;Dursun & Dede 2004;Harms 2000;Koğar 2015;Özer & Anıl, 2011;Savaş et al., 2010;Tekin & Tekin 2004). It has been determined that the results obtained with the help of the selected attributes feature without using any algorithms over the training data set during the estimation of the mathematical literacy of the countries with different proficiency levels. This result provides evidence that the variables that affect mathematical literacy will differ for different countries. Accordingly, all 15 independent variables were included in the analysis to predict students' mathematical literacy levels for each country, and the results were examined to see if they differed. The numerical values used at this stage were belonging to the population and they were obtained from 519,334 students. The variables used in the study were gender, personal room, mother's education level, father's education level, out-of-school learning time, mathematics learning time, total learning time, belonging to the school, exam anxiety, motivation level, desire for collaborative work, emotional support of the family, perceived feedback, teacher's fairness, socio-economic level, and Mathematics literacy level. The names and codes of the variables used in the study and their descriptive statistics values are shown in Table 3.

Data Analysis
In the first stage of the research, the objective was to create a model that will predict students' mathematical literacy achievement by using PISA data. For determining which PVMATH scores should be considered as the dependent variable (result/ output/target), the relationship between 10 different scores were examined and it was decided to use PV2MATH scores, which have the highest correlation with other variables, as a dependent variable. In the study, the analysis of each country were carried out in two stages. In the first stage, data mining prediction methods of WEKA program were employed, whereas in the second stage, artificial neural network methods of MATLAB program were used. In the first phase of the study, the analysis were performed using data mining prediction methods.Within the scope of the study, M5P algorithm, one of the most used methods in data mining decision trees, was employed. The M5P algorithm, which is a revised version of the M5 algorithm, whose foundation was laid by Quinlan (1992), was updated by Wang and Witten (1997). In the model, linear regression functions are used in the leaves of the decision tree to estimate the dependent variable that is defined as numeric. One of the reasons for using this method, which is known as regressionbased decision tree, is that it achieves more holistic and understandable decision trees compared to other methods (Wang & Witten, 1997). Another reason for using this method is being successful in dealing with missing data and making more accurate predictions using smoothing indices (Breiman et al., 1984). The regression-based decision tree obtained with this method can make more accurate and more consistent predictions using multiple logistics models instead of a single logistics model. In the second stage of the research, the artificial neural networks method was used. The artificial neural network is one of the application areas of artificial intelligence obtained by simulating intelligence, which is the combination of all human abilities such as learning and problem solving, in a computer environment. Herewith, it can produce solutions to complex problems in a way similar to how a human produces solutions (Aksu & Güzeller, 2018). Artificial neural networks have three levels, namely the input layer, the hidden layer, and the output layer. Just like the human brain, they are comprised of the input layer, where the stimuli from the outside world come; the output layer, in which the results for these stimuli will be produced; and one or more hidden layers between these two. Each hidden layer produces the output values corresponding to the input values coming from the previous neuron and sends them to the neurons of the next layer. Thus, human-specific properties are simulated in a way that human possesses these properties. In order to determine the extent to which decision trees obtained by M5P prediction method produce valid results, the output variable was predicted from the input variables using artificial neural network methods. In the analysis performed by artificial neural networks, the fixed parameters were set as Levenberg-Marquardt (TRAINLM) for the training function and adaptive learning (LEARNGDM) method for the learning function.
The theoretical model established by Artificial Neural Networks (ANN) to determine the variables that affect students' mathematics literacy for different countries is shown in Figure 1.

Figure 1. Theoretical Model Established by ANN
As seen in Figure 1, the number of input variables was determined to be 15, the number of hidden layers was 10, the number of output layers was 1, and the output variable was 1. The values here show the prediction of the output variable, which was set as the mathematics literacy of the students, from 15 independent variables. After this process, the relationship between students' real mathematics literacy scores and the literacy scores predicted from the input variables was analyzed in MATLAB program. The relationship between actual and predicted values was analyzed using the curve fitting method, and the consistency of the results was determined. In determining the external validity of the results obtained from WEKA program, MATLAB program used the validation method known as hold out (Souza et al., 2002) instead of using the whole data set. In this method, the data file is divided into two as training and test data. Firstly, a training data set is formed to create the learning method named as a predictor. In the analysis of the data, 70% of the whole data set was used to train the data, 15% to test the results, and the remaining 15% to test the validity of the results for each country. In data mining, it is generally very common to use at least one-third (1/3) of the available data for testing and the remaining two-thirds (2/3) for training. In the literature, it is stated that better estimation is made for the training dataset if the ratio is at least 70% and above (Rácz et al., 2021).

Findings
In the study, it was aimed to determine the variables that affect mathematics literacy of the countries participating in the PISA exam and being at different proficiency levels regarding the country average. For this purpose, the results were reported for the following countries: Singapore at the fourth level, Japan and Norway at the third level, America at the second level, Turkey at the first level, and the Dominican Republic below the first level.

Results of Singapore Sample
The results obtained by the MP5 method, which is one of the data mining decision tree methods, to determine the variables that affect the mathematical literacy of Singapore, which had a national mathematical literacy average score of 564 in the PISA 2015 exam and which was found to be at the fourth level in terms of proficiency level are shown in Figure 2.
Regarding Figure 2, 16 different rules were created to predict students' mathematics literacy levels. As a result of the top-down analysis of the decision tree, the variable that has the most impact on mathematics literacy was found to be students' socioeconomic status index (SSI), and this variable was set as the root node. The resulting decision tree was divided into two branches according to SSI variable, with a cut-off score of 0.065; Teacher's fairness (TF) was effective on mathematics literacy levels of students with SSI level below 0.065, whereas the SSI variable itself was the most effective predictive variables in children whose SSI level was above 0.065. Regarding the second level branching of the tree, socio-economic level, teacher's fairness, and mathematics learning time were observed to be the most effective variables of the second level. As a result of the prediction obtained from a total of 6,115 Singapore students, the review of the whole decision tree obtained to determine the variables that affect students' mathematics literacy for the Singapore sample revealed that socio-economic level, teacher fairness, mathematics learning time, and total learning time were the most effective variables, in this order. Matthews correlation coefficient obtained from WEKA program was 0.464; the average absolute error was 64.58 and the square Figure 2. Decision Tree obtained for Singapore Sample root of the average errors was 87.72. In order to determine the external validity of the results, the data was divided into three, as 70% (n = 4,281) training data, 15% (n = 917) test data and 15% (n = 917) validation data. The results of the prediction are shown in Figure 3.The results obtained by the MP5 method, which is one of the data mining decision tree methods, to determine the variables that affect the mathematical literacy of Singapore, which had a national mathematical literacy average score of 564 in the PISA 2015 exam and which was found to be at the fourth level in terms of proficiency level are shown in Figure 2.

Figure 3. Prediction Results of Different Data Sets for Singapore Sample
Regarding the results obtained through the analysis carried out using the three-layer, namely, input hidden and output layers, the feed-forward network shown in Figure 3, the percentages of correct prediction were 93.18% in the training data set, 91.96% in the test data set, 91.86% in the validation data set, and accordingly, an average of 92.8% in the whole dataset. The relationship between the predicted results and the actual literacy scores was analyzed and the results are shown in Figure 4. Accordingly, R 2 value, which is known as the coefficient of determination, of the relationship defined as y = 0.86x + 77.63 was calculated as 0.861. The RMSE value of the prediction was calculated as 17.57. According to this result, it was concluded that the literacy scores predicted from the input variables were 86.12% consistent.

Results of Japan Sample
The results obtained by the MP5 method, which is one of the data mining decision tree methods, to determine the variables that affect the mathematical literacy of Japan, which had a national mathematical literacy average score of 532 in the PISA 2015 exam and which was found to be at the third level in terms of proficiency level are shown in Figure 5.  Figure 5, 3 different rules were created to predict students' mathematics literacy levels. As a result of the top-down analysis of the decision tree, the variable that has the most impact on mathematics literacy was found to be students' mathematics learning time (MLT) and this variable was set as the root node. Regarding the second level of the tree, the father's education level was observed to be effective on students' mathematics literacy. Matthews correlation coefficient obtained from WEKA program was 0.502; the average absolute error was 60.16 and the square root of the average errors was 75.84. In order to determine the external validity of the results, the data was divided into three, as 70% (n = 4,653) training data, 15% (n = 997) test data and 15% (n = 997) validation data. The results of the prediction are shown in Figure 6.

Figure 6. Prediction Results of Different Data Sets for Japan Sample
Regarding Figure 6, the percentages of correct prediction were 57.56% in the training data set, 55.33% in the test data set, 50.34% in the validation data set, and accordingly, an average of 56.10% in the whole dataset. The relationship between the predicted results and the actual literacy scores was analyzed and the results are shown in Figure 7. Accordingly, R 2 value, which is known as the relationship, of the relationship defined as y = 0,31x + 367.00 was calculated as 0.315. The RMSE value of the prediction was calculated as 40.26. According to this result, it was concluded that the literacy scores predicted from the input variables were 31.46% consistent.

Results of Norway Sample
The results obtained by the MP5 method, which is one of the data mining decision tree methods, to determine the variables that affect the mathematical literacy of Norway, which had a national mathematical literacy average score of 502 in the PISA 2015 exam and which was found to be at the third level in terms of proficiency level are shown in Figure 8.
Regarding Figure 8, 13 different rules were created to predict students' mathematics literacy levels. As a result of the top-down analysis of the decision tree, the variable that has the most impact on mathematics literacy was found to be students' socioeconomic status index (SSI), and this variable was set as the root node. The most effective variables after SED were total learning time and out-of-school learning time. Regarding the branching structure of the lower level of the tree, teachers' fairness and motivation were observed to be the most effective variables at the second level. Matthews correlation coefficient obtained from WEKA program was 0.343; the average absolute error was 62.26 and the square root of the average errors was 88.49. In order to determine the external validity of the results, the data was divided into three, as 70% (n = 3,820) training data, 15% (n = 818) test data and 15% (n = 818) validation data. The results  Figure 9.

Figure 9. Prediction Results of Different Data Sets for Norway Sample
Regarding Figure 9, the percentages of correct prediction were 57.05% in the training data set, 51.73% in the test data set, 47.91% in the validation data set, and accordingly, an average of 54.90% in the whole dataset. The relationship between the predicted results and the actual literacy scores was analyzed and the results are shown in Figure 10.
Accordingly, R2 value, which is known as the coefficient of determination of the relationship, defined as y = 0.29x + 353.20 was calculated as 0.301. The RMSE value of the prediction was calculated as 37.78. According to this result, it was concluded that the literacy scores predicted from the input variables were 30.12% consistent.

Results of the USA Sample
The results obtained by the MP5 method, which is one of the data mining decision tree methods, to determine the variables that affect the mathematical literacy of the USA, which had a national mathematical literacy average score of 470 in the PISA 2015 exam and which was found to be at the second level in terms of proficiency level are shown in Figure 11.
Regarding Figure 11, 7 different rules were created to predict students' mathematics literacy levels. As a result of the top-down analysis of the decision tree, the variable that has the most impact on mathematics literacy was found to be students' socioeconomic status index (SSI), and this variable was set as the root node. It is observed that the total learning time was effective on mathematics literacy levels at a lower level of the tree. Regarding the branching structure of the second level of the tree, the total learning time was again observed to be the most effective variable. Matthews correlation coefficient obtained from WEKA program was 0.566; the average absolute error was 58.20 and the square root of the average errors was 73.61. In order to determine the external validity of the results, the data was divided into three, as 70% (n = 3,998) training data, 15% (n = 857) test data and 15% (n = 857) validation data. The results of the prediction are Figure 11. Decision Tree obtained for the USA Sample shown in Figure 12.

Figure 12. Prediction Results of Different Data Sets for the USA Sample
Regarding Figure 12, the percentages of correct prediction were 64.83% in the training data set, 57.05% in the test data set, 57.94% in the validation data set, and accordingly, an average of 62.58% in the whole dataset. The relationship between the predicted results and the actual literacy scores was analyzed and the results are shown in Figure 13. Accordingly, R 2 value, which is known as the coefficient of determination, of the relationship defined as y = 0.39x + 287.90 was calculated as 0.392. The RMSE value of the prediction was calculated as 43.20. According to this result, it was concluded that the literacy scores predicted from the input variables were 39.15% consistent.

Results of Turkey Sample
The results obtained by the MP5 method, which is one of the data mining decision tree methods, to determine the variables that affect the mathematical literacy of Turkey, which had a national mathematical literacy average score of 420 in the PISA 2015 exam and which was found to be at the first level in terms of proficiency level are shown in Figure 14.

Figure 14. Decision Tree obtained for Turkey Sample
Regarding Figure 14, 6 different rules were created to predict students' mathematics literacy levels. As a result of the top-down analysis of the decision tree, the variable that has the most impact on mathematics literacy was found to be students' socioeconomic status index (SSI), and this variable was set as the root node. Mathematics learning time was observed to be effective at the lower level of the tree, while the total learning time variable was effective on the mathematics literacy levels of students whose SSI level was above -0.495. Regarding the branching structure of the second level of the tree, the total learning time was observed to be the most effective variable at the second level. Matthews correlation coefficient obtained from WEKA program was 0.458; the average absolute error was 57.24 and the square root of the average errors was 71.98. In order to determine the external validity of the results, the data was divided into three, as 70% (n = 4,127) training data, 15% (n = 884) test data and 15% (n = 884) validation data. The results of the prediction are shown in Figure 15.

Figure 15. Prediction Results of Different Data Sets for Turkey Sample
Regarding Figure 15, the percentages of correct prediction were 53.80% in the training data set, 46.31% in the test data set, 45.58% in the validation data set, and accordingly, an average of 51.42% in the whole dataset. The relationship between the predicted results and the actual literacy scores was analyzed and the results are shown in Figure 16.

Figure 16. Prediction Results for Turkey Data
Accordingly, R 2 value, which is known as the coefficient of determination, of the relationship defined as y = 0.28x + 298.30 was calculated as 0.264. The RMSE value of the prediction was calculated as 37.74. According to this result, it was concluded that the literacy scores predicted from the input variables were 26.43% consistent.

Results of Dominic Sample
The results obtained by the MP5 method, which is one of the data mining decision tree methods, to determine the variables that affect the mathematical literacy of Dominic, which had a national mathematical literacy average score of 328 in the PISA 2015 exam and which was found to be below of the first level in terms of proficiency level are shown in Figure 17.
Regarding Figure 17, 16 different rules were created to predict students' mathematics literacy levels. As a result of the top-down analysis of the decision tree, the variable that has the most impact on mathematics literacy was found to be students' socioeconomic status index (SSI), and this variable was set as the root node. Regarding the branching structure of the second level of the tree, belonging to the school and collaborative work were observed to be the most effective variables at the second level. Cooperative work, teacher's fairness and total learning time were effective in the third branch of the tree; whereas cooperative work, belonging to the school and total learning time was effective in the fourth level branching of the tree. Matthews correlation coefficient obtained from WEKA program was 0.491; the average absolute error was 46.03 and the square root of the average errors was 58.18. In order to determine the external validity of the results, the data was divided into three, as 70% (n = 3,318) training data, 15% (n = 711) test data and 15% (n = 711) validation data. The results of the prediction are shown in Figure 18.
Regarding Figure 18, the percentages of correct prediction were 56.01% in the training data set, 51.06% in the test data set, 49.54% in the validation data set, and accordingly, an average of 54.27% in the whole dataset. The relationship between the predicted results and the actual literacy scores was analyzed and the results are shown in Figure 19.
Accordingly, R2 value, which is known as the coefficient of determination, of the relationship defined as y = 0.29x + 235.40 was calculated as 0.294. The RMSE value of the prediction was calculated as 30.09. According to this result, it was concluded that the literacy scores predicted from the input variables were 29.44% consistent.

Comparison of Results from Countries at Different Levels
The effective variables in each level of the decision trees created to determine the variables that affect mathematics literacy scores of the countries at different proficiency levels are shown in Table 4.
Regarding was effective for Singapore, and Teacher's fairness (TF) was effective for Norway and Dominic. Regarding the correlation coefficients, which is one of the criteria related to the reliability of the results, the best result was obtained for the Norway sample, followed by Turkey, Singapore, Dominic, Japan, and the United States. Validity criteria of decision trees obtained for countries at different proficiency levels are shown in Table 5.
Regarding Table 5, the most consistent predictions were obtained for Singapore, followed by the USA, Japan, Norway, Dominic, and Turkey. Especially the correlation coefficient between the real values and the predicted values in the Singapore sample was found to be 0.93, which shows that the variables discussed are quite determinant on mathematics literacy. In addition, the correlation coefficients between real values and predicted values were determined to be high for Singapore and moderate for all other countries.

Conclusion, Discussion and Suggestions
In this study, mathematical literacy levels of the students participating in the research were predicted using data mining and artificial neural network methods, from the data obtained from PISA 2015 exam organized by OECD, and the variables that affect mathematics literacy were determined. In the study, mathematics literacy levels and the variables that affect students' mathematics literacy levels were analyzed separately for each of the countries that have been determined to be at different proficiency levels. In the first stage of the study, the output variable was predicted from the input variables using artificial neural networks methods whereas data mining prediction methods were used in the second stage. The results obtained from WEKA and MATLAB programs were compared with the studies in the literature, similar and different aspects were revealed, and suggestions were made for future research.
In the first sub-problem of the study, the variable that has the most impact on mathematics literacy of the Singapore sample was found to be students' socioeconomic status index (SSI). In addition, teachers' fairness and mathematics learning time were the most effective variables at the second level. The most effective variable in the fourth level of the decision tree was found to be total learning time. Singapore has been under British exploitation since 1819 and declared its independence from England in 1959 and left Malaysia in 1965 and became a fully independent country (OECD, 2012). In the colonial period, each ethnic group and group of the same religion had their own, separate education system. In the period of independence, the government identified families with poor economic conditions and provided financial assistance to reduce both the educational and social achievement gap (Levent & Yazıcı, 2014). Due to the economic downturn in 2009, unemployment rates have increased, many families have experienced economic difficulties, and most families have taken their children from school to reduce expenses and directed them to a job to earn money (UNESCO, 2011). These experiences are thought to cause SSI to be the variable that has the greatest impact on mathematical literacy in the Singapore sample. In addition, in Singapore students start mathematics at the first grade and science at third grade, which explains that mathematics learning time and total learning time are the other effective variables (CIU, 2008). In Singapore, teacher candidates are paid during their undergraduate education, and teacher salaries are high, encouraging talented young people to choose this profession. The emphasis put on the practices aimed at improving teacher quality in the country and underlining quality rather than quantity causes teachers to devote themselves to their jobs (Bakioğlu & Göçmen, 2013). For this reason, teacher's fairness variable is thought to be effective on mathematical literacy. The results obtained within the scope of the study are similar to the ones of the study conducted by Areepattamannil and Kaur (2013), in which the variables affecting mathematical literacy of Singaporean students were found to be SSI and learning time.
In the second sub-problem of the study, the variable that has the most impact on mathematics literacy of Japan sample was found to be students' mathematics learning time (MLT), from the other variables, only father's education level was effective at the second level. Accordingly, it was concluded that only two of the 15 predictive variables covered in the study were effective on mathematics literacy. The most effective variable on mathematics literacy was the duration of mathematics learning in Japan sample, which is thought to be because of the shadow education, which became popular especially in the last 20 years (Dierkes, 2010). Shadow training includes special additional courses or coaching services aiming to provide additional help to students, to prepare them for exams, especially organized outside of the school. Unofficial individual lessons are available in many East Asian communities, including Hong Kong, Taiwan,  Japan, and South Korea (Brown & Baker, 2012). In the Japanese education system, the mother is usually the family member who is responsible for the children and is the individual who decides on the choice of school and helps their homework (Youbi et al., 2019).
In the study, the education level of the father was found to be effective on mathematics literacy while the education level of the mother did not have a significant effect, which differs from the literature. However, the fact that shadow education in Japan is also carried out by private schools and institutions shows that this education is related to the income of the father who is the main responsible of the livelihood of the family and that the level of father education is effective on mathematics literacy (Drinck, 2002;Entrich, 2014). Besides, SSI variable did not have a significant effect on mathematics literacy, which is similar to the findings of the studies conducted in the literature (Stacey, 2011;OECD, 2010).
In the third sub-problem of the study, the variable that has the most impact on mathematics literacy of the Norway sample was found to be students' socioeconomic status index, in addition, total learning time and out-of-school learning time were the most effective variables at the second level. The most effective variable on the third level was teacher's fairness; whereas the variable that is effective on the fourth level of the decision tree was found to be anxiety. The most effective variable on Norwegian students' mathematics literacy in the study was found to be SSI, which is thought as a result of the fact that Norwegian families pay higher amounts for education expenses compared to other countries and these payments correspond to approximately 20% of their income (Levin, 2003). Another reason for this result is that in a developed country like America, 40% of children stated that they were poor, while only 4% of Norwegian students described themselves as poor. In Scandinavian countries such as Norway, all students receive the same type of education in the same type of schools, in which individualized education is implemented, which means that education is structured to suit the needs of the student (Baird, Isaacs, Jojnson, Stobart, Yu, Sprague, and Daugherty, 2011). In Norway, especially after the PISA exam was applied in 2003, teachers have started to apply a large number of tests to their students even though they do not affect the exams (Ackerman & Kanfer, 2009). The motivation variable is thought to be effective on Norwegian students' mathematics literacy because students are exposed to more tests than normal (Mausethagen, 2013).
In the fourth sub-problem of the study, the variable that has the most impact on mathematics literacy of the USA sample was found to be students' socioeconomic status index, whereas total learning time was the most effective variable at the second level.
The variables that affect the lower level of the tree were found to be the same variables. Accordingly, it was concluded that only two of the 15 predictive variables covered in the study were effective on mathematics literacy. The fact that SSI is the most effective variable on mathematics literacy in a country with a state system like America, shows similarity with other studies in the field (Bos et al., 2016). In addition, Carnoy and Rothstein (2013) stated that the socio-economic status and family characteristics of American students have an important effect on their academic success. In addition, in a study conducted by Fiszbein and Stanton (2018), great differences were found in terms of achievement among six different social classes from the lowest to the highest for both the overall USA and the states in the USA. Similar to these results, in the report published by OECD (2011), it was stated that the difference between the social classes in America is quite high compared to Germany, England, France, Korea, Finland, and Canada. Similarly, Darling-Hammond (2014) stated that the increasing poverty rate, discrimination, income inequality, and inequality in school expenditures are the reason for the performance difference in large-scale exams such as PISA. These results support the fact that the variable that has the most impact on mathematical literacy is SSI. Heyneman (2013) stated that although American students' out-of-school learning time is oneseventh of Korea, the time spent in school is higher for America. Similarly, it was found that from the countries participating in the PISA exam, American students are at the top in terms of total learning time after China, South Korea, and Singapore (OECD, 2015). Within the scope of this study, the total learning time was the second most effective variable on mathematical literacy for the USA sample, which is similar to the studies in the literature.
In the fifth sub-problem of the study, the variable that has the most impact on mathematics literacy of the Turkey sample was found to be students' socio-economic status index, whereas mathematics learning time and total learning time were the most effective variables at the second level. The variables that affect the lower level of the tree were found to be the same variables. Accordingly, it was concluded that only three of the 15 predictive variables covered in the study were effective on mathematics literacy. Mutluer and Büyükkıdık (2017) reported parents' education level, enjoyment of mathematics, perseverancequick give up self-efficacy as the variables having a significant impact on the mathematical literacy of students in the PISA 2012 Turkey sample. Similarly, Aksu and Güzeller (2016) found that self-efficacy, attitude, working discipline, and anxiety variables were effective on mathematics literacy in PISA 2012 exam. The results obtained within the scope of this study differ from the ones in the literature. It is thought that the emergence of this difference is due to the performance of the exam in different years. The main subject of PISA 2012 was mainly mathematics, thus many affective features related to mathematics were questioned, whereas mathematics was less dominant in PISA 2015 exam, because the main subject was science. In addition, both mentioned studies took the dependent variable as a two-category variable, in this study mathematics achievement was taken as a continuous variable. In addition, the findings of this study differ from the findings of the study conducted by Aksu, Güzeller, and Eser (2017). In the related study, the variables of SSI, participation in courses, and participation in training did not have any significant effect on mathematics literacy, whereas SSI, mathematics learning time and total learning time were determined as the most effective variables in the study in which PISA 2015 data was used. It is thought that the emergence of this difference is due to the difference in the data sets and the analysis methods. Similarly, the results obtained from the study differ from the ones reported by Çetin and Gök (2017), in which self-efficacy, interest, and study ethics were found to be effective on mathematics literacy. It is thought that the main reason for the difference between the results obtained from the study and similar purpose studies in the literature is the inclusion of different variables. Data on affective characteristics such as mathematics self-efficacy, interest, attitude, etc. was not available in PISA 2015 exam. This study is thought to contribute to the literature in terms of determining other variables that have an impact on mathematics literacy.
In the sixth sub-problem of the study, the variable that has the most impact on mathematics literacy of Dominic sample was found to be students' socioeconomic status index, belonging to school was the most effective variable at the second level. The most effective variable on the third level was collaborative work; whereas the variables that are effective on the fourth level of the decision tree were found to be teacher's fairness and total learning time. Accordingly, it was concluded that only six of the 15 predictive variables covered in the study were effective on mathematics literacy. Unlike other countries, belonging to the school, the desire to work cooperatively and the motivation variables were effective on mathematics literacy, which is thought to be due to the country's diverse ethnic groups. The fact that Jews and Arabs live in the country besides Chinese and Japanese, has made the Dominican Republic a country with a multicultural structure (Krasnodebski et al., 2012). In addition, the most effective variable on the mathematics literacy of Dominican students was SSI, which makes us think that the impact of socioeconomic variables on the achievement is the same (13%) as the OECD average (Bos, Elias, Vegas and Zoido, 2016). In addition, it was reported that 80-90% of Dominican students can complete primary school, whereas only 25-30% of the students can complete secondary school (Jensen, 2010).
As a result of the analysis of the decision trees obtained as a whole, it was determined that the socioeconomic status (SED) index was the variable that had the most impact on mathematical literacy in all countries except Japan. In Japan, the time to learn mathematics was determined as the most effective variable. However, it has been determined that the variables affecting mathematical literacy differ for countries with different proficiency levels starting from the second level. Similarly, it was determined that the amount of variance explained decreased as the proficiency level decreased for the established models. Accordingly, it was concluded that the variables that affect mathematical literacy differ according to countries, and the validity of the results obtained tends to decrease as the proficiency level of the country decreases.

Suggestions
Since the variables that affect mathematical literacy are limited with the number of variables covered in the study, it is recommended that the researchers determine the independent variables that are effective on the dependent variable with the crossvalidation method in the variable selection menu of the WEKA program before starting the analysis, and thus perform their analysis with the help of fewer variables. In this way, it is thought that more consistent and less erroneous decisions can be obtained.
Variables that affect mathematics literacy may have more or less impact on achievement, depending on countries. For this reason, it is recommended to make more neutral comments in cross-country comparisons by considering the effect sizes of these variables instead of focusing on variables that directly affect achievement.
In addition to classical methods, researchers are advised to use the methods such as data mining and artificial neural networks that can easily work on large datasets and that provide richer reports, while reporting variables that affect success in regressionbased methods. Since PISA data was used in this study, it is suggested to conduct similar studies from the data of other large-scale exams, such as TIMSS, PIRLS, etc.
In this way, it may be possible to have an idea about whether the results depend on the exam applied.
In this study, only the M5P learning algorithm, one of the data mining prediction methods, was used in the analysis. Similar comparisons between countries may be tested using different learning methods and the differentiation of the results according to the method used can be tested.. Although the reliability and validity values of different countries are different, evaluating them according to some common criteria provides a better interpretation of the results. Therefore, it is recommended to use standardized evaluation criteria for the sake of the reliability of the results.The analysis was performed using PISA data, it is recommended to determine the variables that are effective on Turkish students' mathematics literacy based on the results obtained from large-scale exams conducted by ÖSYM (Student Selection and Placing Centre) in the future. It is thought that mathematics achievement will be better modeled for our students by using the tests that measure the affective characteristics along with the achievement tests, as in the PISA exam.