A Comparison of IRT Model Combinations for Assessing Fit in a Mixed Format Elementary School Science Test

Open ended and multiple choice questions are commonly placed on the same tests; however, there is a discussion on the effects of using different item types on the test and item statistics. This study aims to compare model and item fit statistics in a mixed format test where multiple choice and constructed response items are used together. In this 25-item fourth grade science test administered to 2351 students in 35 schools in Turkey, items are calibrated separately and concurrently utilizing different IRT models. An important aspect of this study is that the effect of the calibration method on model and item fit is investigated on real data. Firstly, while the 1-, 2-, and 3-Parameter Logistic models are utilized to calibrate the binary coded items, the Graded Response Model and the Generalized Partial Credit Model are used to calibrate the open-ended ones. Then, combinations of dichotomous and polytomous models are employed concurrently. The results based on model comparisons revealed that the combination of the 3PL and the Graded Response Model produced the best fit statistics.


Introduction
Tests play crucial roles in individuals' lives. Exams are used for many reasons, such as selection and placement of individuals, determining which knowledge areas need to be improved, and planning and revising educational programs. Test design, analysis of test scores, and interpretation of test results have been important aspects of measuring examinees' trait levels (Kinsey, 2003). Public concern boosts discussions on tests regarding their reliability and validity, which are affected by many elements, such as test length, item format, and scoring.
Multiple choice (MC) items are the most common item types in tests. Despite the fact that the MC format is criticized since examinees can guess the answer correctly, many tests include only MC items due to not only budget and time constraints but also due to the difficulties in defending test scores to the public in plain terms. Although MC items are economically practical and they secure objective and reliable marking, it is difficult to measure higher order thinking with them. In addition, as Lissitz, Hou and Slater (2014) stress, if MC items are exclusively used in testing, the focus of instruction and learning will undermine the analysis, synthesis and evaluation skills of the learners, which in turn risk the loss of the active construction of knowledge. To eliminate these major limitations, it is possible to incorporate constructed response (CR) items in tests. On the other hand, CR items are difficult to score objectively and reliably despite they are considered to be measuring examinees' understanding of the content at a deeper level (Kim, Walker & McHale, 2008). Mixed format tests including both MC and CR items are highly effective measurement tools for teaching and learning to overcome the limitations stemming from their separate use. When they are combined, more reliable content total scores are obtained and a more precise latent trait is defined (Sykes & Yen, 2000). However, as Hollingworth, Beard and Proctor (2007) state, some educators and policy makers believe that constructed response items and multiple choice items do not measure the same construct when placed on the same tests.
The purpose of the present study was to investigate the applicability of separately and concurrently calibrating the dichotomous and polytomous items on a 4th grade science examination data using different Item Response Theory (IRT) models. Therefore, it would be possible to examine how model and item fit statistics vary when MC and CR items are analyzed separately and together. In addition, it will give insight regarding which IRT model is a better candidate for possible further use on achievement test data.
The Classical Test Theory (CTT) has been utilized in many testing systems; yet, it has many shortcomings such as the dependence of the values of item statistics (i.e., difficulty and discrimination) on a particular examinee sample, their average level of ability, and the range of scores. Another important shortcoming is that a valid comparison of examinees coming from different groups is possible only when the same or parallel tests are administered. In CTT, test reliability is described in terms of parallel forms although it is not practical in real world.
IRT has been employed to compute scale scores for achievement tests by most of the testing agencies throughout the world. When there is a reasonable fit between the selected model and data, IRT models produce invariant item statistics and ability estimates. As Hambleton and Swaminathan (1991) explained, the IRT estimate of an examinee's ability does not depend on a particular sample of test items. Also, the precision of ability estimates is known, and free to vary from one examinee to another (Baker, 2001). However, as Bergan (2010) reports, IRT model selection is often based solely on philosophical considerations rather than empirical tests. In general education policies dictate the choice of IRT model which results in a danger of misinterpretation of the data being analyzed as measures of relative fit are ignored (Brown, Templin & Cohen, 2015). Therefore, it is imperative to compare relative fit of competing models to avoid misleading interpretations about the data and making wrong decisions about test takers' performance.

IRT Models
Many different approaches have been developed to calibrate items in the IRT framework. The current study focuses on item calibrations based on the 1-, 2-, and 3-Parameter Logistic Models (1PL, 2PL, 3PL), the Generalized Partial Credit Model (GPCM), and the Graded Response Model (GRM). The roots of the 1PL model were introduced by a Danish mathematician, Georg Rasch. He demonstrated that item difficulties and examinee ability are sufficient statistics for measurement and introduced the Rasch Model (Rasch, 1960). In the 1PL model which was developed based on the Rasch's work, the probability of getting a correct response is plotted as a function of ability.
where θ j is the ability and β i is the difficulty parameter. The letter e is the base of natural logarithms (e≈ 2.118) and the 1.7 in the exponent lets the logistic function approximate the normal function (Warm, 1978). Although Rasch Model and 1PL are philosophically different (Andrich, 2004;Linacre, 2005), the differences between them are not in the scope of the current study. The 1PL model assumes an equal discrimination among all items, and a guessing parameter is not included in the model as it assumes that ability parameter is the sufficient statistic to compare individuals taking a particular test (Baghei and Carstensen, 2013). The two-parameter model was developed by Lord (1952) based on cumulative normal distribution. Birnbaum (1968) replaced the two-parameter logistic function with the two-parameter normal ogive function to model item characteristics (Hambleton, Swaminathan & Rogers, 1991). He modeled the probability of getting a correct response as a function of difficulty and discrimination parameters.
where α i is the discrimination parameter. Birnbaum (1968) modified the 2PL model by adding a parameter that represents the contribution of guessing to the probability of correct response (Baker, 2001). That is, the probability of correct response depends on guessing besides difficulty and discrimination in the 3PL model.
where c i is the guessing parameter. The partial credit model (PCM) was introduced in 1982 by Masters, who decomposed the response to an item into a series of ordered pairs of adjacent categories, then applied a dichotomous model to each pair assuming equal discriminations across the items (De Ayala, 2009). On the other hand, Muraki (1992) extended the equal discrimination assumption and applied the 2PL model to polytomously scored items and introduced the GPCM. This model assumes that the probability of choosing the k th category over the (k-l) th category is expressed as the logistic dichotomous response model (Muraki, 1992), expressed as, where, k represents the n= 2, 3, ….m, which are the response options. The GPCM is, then, written as and where, D is a scaling constant (1.7) that sets the θ in the same metric as the normal ogive model, b jk is an item category, and b j is an item location parameter. While b j represents the slope, d k is the category parameter (Muraki, 1993).
GRM was developed by Fumiko Samejima (1969). Within the GRM, the b-parameter for each response category indicates the probability of an examinee whose θ is equal to the value of location parameter (b), scoring x or higher is 50% on the CCRF (Tang, 1996). Samejima modeled the probability of a person responding in category k or higher versus responding categories lower than k as where, P * ix (θ) is the cumulative category response function (CCRF) representing the probability of scoring x or above on item i by an examinee with the proficiency level of θ. Probability of each score category is as follows: and the score category response function (SCRF) of the GRM can be written as The Partial Credit and Generalized Partial Credit Models are generalized from the dichotomous IRT models to describe an examinee's probability of selecting a possible score category among all score categories. Dichotomous IRT models describe how likely individuals at a certain ability level reach the score category k rather than k-1. So, k and k-1 categories of polytomously scored items can be viewed as dichotomous categories. While the Partial Credit Model assumes that discrimination indices of all items are constant, the Generalized Partial Credit Model releases them free. These differences between the PCM and the GPCM are similar to those between the Rasch or the 1PL and 2PL models (Tang & Eignor, 1997). GRM, on the other hand, assumes that the boundary parameters of the categories are ordered. That is, each score category has a point where the probability of that category is highest.

Method
To realize the aims of the study, different IRT models were applied on the data collected form 2351 fourth grade students in 35 elementary schools in Turkey. The exam was part of a formative assessment initiative. Twenty five science items were asked to all participants, 14 of which were scored dichotomously and the remaining 11 were scored polytomously.
Before the items were written, 10 science teachers were selected as item writers based on school administrators' references and peer ratings about their teaching quality. A two-day training on item-writing was provided to teach-ers by two educational measurement specialists. Seventy five items were generated by those item writers and 25 items were selected based on content validity indicators set in accordance with the 4th grade science curriculum. The two educational measurement specialists participated in the item selection process along with the item writers. After the selection process, answer keys for each item was prepared by three item writers and the two specialists. During this process, possible and plausible answers for the graded items were prepared and a detailed rubric was developed. After the implementation of the exam, constructed response items were coded 0 if the answer was incorrect. It is coded as 1 if the answer was partially correct, 2 if it was correct, and number 9 was used to symbolize unattempted items.
After data collection, all answer sheets were graded by at least two teachers who also participated in the item writing process. In case there were discrepancies between the ratings of an item, the raters convened, discussed, and decided on the final mark. After the data collection, the dichotomous items were calibrated utilizing the 1PL, 2PL, and 3PL models, and the constructed response items were calibrated through GPCM and GRM. Then, all items were calibrated concurrently using mixed models. After estimating a model, it was compared with a competing more complicated better fitting one. Model selection was based on RMSEA, -2LL, number of unfitting items and item fit statistics as more parsimonious model is preferable. Violating the principle of parsimony creates unnecessarily complicated models and reduces predictions about new data sets (Kang, Cohen & Sung, 2009).

Results
Before performing analysis with IRTPRO (Cai, du Toit & Thissen, 2011), the unidimensionality assumption was tested by performing a categorical confirmatory factor analysis (Cat-CFA) with Mplus (Muthen & Muthen, 2012). A χ 2 value of 1549.42 with a 275 degrees of freedom indicated a poor fit (p< .01); however, it is known that Chisquare is affected by the sample size and this result is not surprising. An investigation of TLI (NNFI) (.90) and CFI (.91) results indicated a reasonable fit (Hu & Bentler, 1999). In addition, an obtained RMSEA value of .04 represented a good fit (Steiger, 2007). Therefore, the data set was considered unidimensional.
The first step of the IRT analysis in the current study was calibrating the MC and CR items separately and determining the number of misfitting items. Thissen's (2000, 2003) S-X 2 statistics were computed to evaluate item misfits throughout the study. This statistic was originally developed for dichotomous IRT models and was found to perform better than the traditional item-fit statistics. S-X 2 was generalized to the polytomous models by Chen (2008, 2010). Dichotomously scored 14 items were calibrated with the 1PL, 2PL, and 3PL models. Table 1 below includes the results of the analyses. As seen above, the 1PL, 2PL, and 3PL models fit the data well based on the RMSEA statistics, each of which has an RMSEA value of .05 or less, indicating a close approximate fit (Kline, 2005). However, the reliability statistics were considered low, which may be due to the small number of items. It is important to note that although marginal reliability in the IRT framework is similar to the reliability in the CTT framework in that it is a measure of the overall test, marginal reliability is based on the average conditional standard errors at various levels on the measurement scale (Geen, Bock, Linn, & Reckase, 1984). Marginal reliability can be expressed as in which is the conditional error variance and is the population density (Florida Department of Education, 2015). The literature suggests that the deviance test based on -2log likelihood (-2LL) statistics can be used to assess the model improvement. The difference in the -2LL statistic is distributed as a χ 2 statistic with the degrees of freedom equal to the difference in the number of parameters between the two models. If the difference in the -2LL is greater than the critical value, the addition of the extra parameters contributes significantly to the fit of the model (Hambleton, Swaminathan & Rogers, 1991). The difference in -2LL between 1PL and 2PL (χ 2 (13)= 58.97, p< .05) was found statistically significant. Similarly, that difference between 2PL and 3PL (χ 2 (14)= 73.45, p< .05) was also significant. These findings indicate that, as the parameters are added, model fit gets better. Furthermore, while 2 out of 14 items showed misfit in the 1PL model, 1 item showed misfit in the 2PL. All of the items fit the 3 PL model well. Table 2 includes detailed information regarding the S-X 2 item level diagnostic statistics of 14 dichotomous items. An investigation of item difficulties help one see how those values change as the model improves. As seen above, fit statistics increase significantly as the parameters are added. When Table 3 is examined it is seen that not only item difficulties but also order of the items based on their difficulty values are changed dramatically. For example while item 16 is the most difficult item when 1PL or 2PL is the model of choice, it is the fourth difficult one in 3PL model.
Considering RMSEA values, it might seem logical not to compare models and conclude that 1PL fits the data considerably well, further analysis of -2LL statistics on model improvement it is seen that not only the 3PL model is preferred over the 1PL and 2PL, it can be concluded that the difficulty values obtained for the first two models are misleading. Recall that difficulty parameter represents the proportion of examinees who respond correctly in 1PL, it represents that proportion after accounting for item-specific discrimination and guessing parameters (Bergan, 2010). After analyzing the MC items, the remaining 11 CR items in the test were analyzed through Muraki's GPCM and Semajima's GRM. As provided in Table 4, the fit statistics based on those two models are similar.  Although RMSEA was computed as .05, indicating a good overall model fit, three items had poor fit statistics at .01 level when GPCM was used to conduct the analysis. A reliability value of .77 is considered to be acceptable. When the same 11 CR items were analyzed through Samejima's Graded Response Model (GRM), an RMSEA of .04 and a reliability of .78 indicate a slightly better overall fit than that of the GPCM. Both models had 3 misfitting items. Item statistics are provided in Table 5 below.
As the second step of the IRT analyses, all the MC and CR items were calibrated simultaneously and fit indices were examined to compare different models. The results of those analyses are given below. Table 6 shows that the data have acceptable RMSEA and marginal reliability statistics in all combined models. The 1PL, 2PL, and 3PL models combined with GRM and GPCM fit the data well based on the RMSEA statistics. That is, when dichotomous and polytomous items are analyzed together in the current achievement test, both GRM and GPCM can be chosen. A close look at the differences in -2LL statistics revealed that, as more parameters are added to the model, fit gets better. The -2LL difference between the 1PL and the 2PL (χ 2 (13)= 545.05, p< .05) was significant; however, the difference between the 2PL and the 3PL was not (χ 2 (14)= 13.62, p> .05) if GRM is used for the CR items. Similarly, -2LL statistics difference between the 1PL and the 2PL (χ 2 (13)= 287.95, p< .05) was significant; however, between the 2PL and the 3PL (χ 2 (14)= 13.42, p> .05), the difference was not significant when GPCM is used for the CR items. These preliminary results suggest that when dichotomous and polytomous models are combined in the same test, GRM and GPCM produce similar results. That is, considering the overall model fit statistics, after one decides which polytomous model will be used; s/he can choose the 2PL or 3PL model for the dichotomously scored items. Yet, one should take the item statistics in consideration before making the final decision regarding the model. Table 7 provides item-level fit values for all combined models.
As Table 7 displays, out of 25 items, 8 items misfit the 1PL, 4 items misfit the 2PL models, and 3 items misfit the 3 PL model when GRM is the model of choice for polytomous items. On the other hand, 8 out of 25 items displayed misfit when the 1PL is applied to the dichotomously scored items when GPCM is the model of choice for the polytomous ones. This number went down to 6 in the 2PL and to 4 in the 3 PL model with the combination of GPCM. When the item diagnostics regarding the MC items are examined, it is seen that items 2 and 7 do not fit under any combined models. Item 6 fits all the models except when the GRM or the GPCM is combined with the 1 PL. Item 13 fits all the models except when the GRM is combined with the 1PL, and the item 18 fits all the models except when the GRM or the GPCM is combined with 3PL. There are three CR items displaying misfit under different models. The fit statistics of the item 12 appear to be acceptable only when the GRM is combined with the 2PL or the 3PL. Item 14 is considered as misfitting like item 18 when the GRM is combined with the 1PL. Item 23 does not fit when the GPCM is combined with the 2PL or the 3PL. Based on the item level statistics, it can be concluded that the data have best fit statistics when the 3PL and GRM models are combined. Since the GRM and the GPCM are not nested models, traditional model comparison statistics, such as comparing -2LL differences, are not appropriate to decide whether a combination of the 3PL and GRM or the 3PL and GPCM models provide better fit for the data used in this study. On the other hand, it is possible to use Akaike's Information Criterion (AIC: Akaike, 1974) and Schwarz's Bayesian Information Criterion (BIC: Schwarz, 1978) for this purpose (Kang, Cohen, & Sung, 2005). As both GRM and GPCM models have the same number of parameters (Bartolucci, Bacci, & Gnaldi, 2015), it is logical to compare them utiliz-ing AIC and BIC. Although significance tests are not available with these statistics, they provide estimates of the relative differences between the two options.
AIC and BIC statistics were computed as 76960.97 and 77393.23 respectively for the combination of the 3PL with the GPCM; on the other hand, an AIC of 76920.13 and a BIC of 77346.63 were obtained when the GRM was selected with the 3PL model for the dichotomous items, which can be considered as a sign that supports the conclusion that the combination of the 3PL and GRM models has a Above graph compares the test information functions and corresponding standard errors. Combination of 3PL and GRM models provide higher information with lower standard errors as the ability of test takers get closer to the lower end and higher end of the theta distribution. On the other hand, the combination of 3PL and GPCM models provides slightly more information for the students with ability level close to the mean.

General Discussion
The goal of this study was to assess the changes in fit statistics when dichotomous and polytomous items were calibrated separately and concurrently. The 1PL, 2PL, and 3PL IRT models were applied to dichotomously coded MC items, and it was seen that, in general, as the parameters are added to the model, fit statistics get better. When the GPCM and GRM models are compared, the GRM is the model of choice for the analyzed data due to higher reliability and lower RMSEA and -2LL statistics. The results show that multiple choice and constructed response items can effectively be used in the same test when the data are analyzed through IRT models.
It is seen that 1PL&GRM and 1PL&GPCM have the same number of misfitting items; however, 2PL&GPCM has more misfitting items than 2PL&GRM. In addition, 3PL&G-PCM has more misfitting items than 3PL&GRM. RMSEA statistics are (.04) the same for all combinations except for the 1PL&GRM (.05). Reliabilities are the same (.82) for all the combined models except for 1PL&GRM and 1PL&G-PCM (.80).
Considering the reliability statistics, the change in the number of misfitting items and RMSEA statistics, the most promising combination is 3PL&GRM for the data utilized in this research. The findings support the conclusions reached by Sykes and Yen (2000), who reported substantially more items not fitted when the 1PL is combined with polytomous response models than 3PL. On the other hand, the findings of current study do not fully confirm the findings of Chon, Lee and Ansley (2007), who stated that the 3PLM and GPCM models tended to fit the mixed format data best.
This study serves as a promising step in the utilization of combined models in elementary school tests. More studies are needed to discover the applicability of such analyses in different subjects, such as literacy and mathematics. As indicated previously, the data used in this study are unidimensional. In real situations, it is likely to have a multidimensional data set. Therefore, further studies should be conducted on such data sets. Although misfitting items are determined, the reason for the misfit is out of the scope of the current study. Further studies using effect sizes to quantify the misfits and exploring the reasons for the misfit are encouraged.