Basic Science Test: An analysis into the methodology of ...



Basic Science Test: An analysis into the methodology of test construction and interpretation of resultsGoran L. DamchevskiUniversity of LjubljanaIntroduction“A mistake made in the beginning of a process is revealed in the final result”The purpose of this paper is to evaluate the methods of test construction, scoring and interpretation of the test properties both from the classical test theory and the item theory response approaches. To that end, the topic of basic science knowledge was chosen as appropriate. The reason for this is, the presumption that people would be interested in participating in such a test, as tests on other topics could fail to elicit a wider participation response both for their difficulty and tediousness. While constructing the test, multiple factors were considered all while aiming at the goal of getting as much participants to take it. These factors include: Length of the test. The minimal number of items was chosen (25) in order to ensure both completion and enthusiasm by the participants in the process. The downside of this is that by the end of the analysis, there is a loss of quality when the items that were deemed to be inadequate have to be removed.Type of items. Initially the test consisted of a mixed type of items, including: multiple choice items, true or false items, and short answer items. Because of the low difficulty (in layman terms) of true or false questions and the high possible variability of answers to short answer items, both of these types of items were adopted to the multiple-choice item type. Thus, giving the test a uniform appearance, reducing the number of random guess possibilities, eliminating the interpretation bias of short answer questions, and allowing to more easily regulate the presumed difficulty of items.Content. The test was based on the Pew Research Center (2015) quiz, that looked at the public knowledge of science. From this, 10 questions were replicated and reformatted to fit the test. Additionally, questions were also replicated from sources such as:- - of the questions were constructed to be similar in difficulty, length and scope of topic with the questions of the Pew Research Center (2015) quiz.Test difficulty. The main concern was choosing questions suitable for a diverse general public that encompass a content basic enough to motivate participants to complete the test and that has a range of difficulty that is good enough from which to be able to infer where the participants lay on the science knowledge spectrum. For the purpose of motivation, the initial questions were set to be somewhat easier and the test progressed in difficulty as the participants moved forward with random sets of easier items to retain their motivation. The main part of the paper is concerned with the methodology of estimating the test’s conclusive, predictive and measurement properties. This was achieved with examining the test both from the classical test theory perspective and the item response theory perspective.The properties examined and compared are:Latent trait measurementSpecifically, the goal was to see whether the test measures the aimed trait of basic science knowledge. To this end, a confirmatory factor analysis was conducted as opposed to an exploratory factor analysis and a principal components analysis. The reasoning for this, is that an exploratory analysis tends to be biased with respect to the researcher having the liberty to decide how many factors fit in the model. Such methods are ambiguous and allow different researchers to deem as satisfactory a varying set of factors as fitting in a model. In EFA each item is loaded on each factor, in CFA we must specify the patterns of either “load” or “not”, therefore in CFA each item will typically load on 1 factor- which means we are allowing it to have a relationship to the factor (Brown, 2015). Otherwise in EFA we are saying it has a 0 relationship to the other factors. In a confirmatory factor analysis, the theorized number of factors that should fit in a model can be estimated empirically via model fit indicators and a P-value for determining which items fit in the factor. Further, EPAs produce unreliable factor loadings, because they are set to compute the loadings from all items to a factor, this is the result from factor indeterminacy and produces an infinite number of possible factor scores that all have the same mathematical characteristics (Grice, 2001) as opposed to CFAs that compute the loadings from assigned items to the assigned factor, thus giving us more precise information as to which item belongs to a specific factor. The cutoff values for factor loadings usually ranges from .3 to .4 (+/-). In CFA, we will get a global and local model fit. As to whether a 1-factor model fits better vs. the predictions that a 2-factor model makes. Allowing us to see if the model fits better if we adjusted it in some way.Considering the comparison between CFA and EFA in regards to item psychometric properties, there are multiple properties of the items that can be easily inferred via CFA, such as difficulty and discrimination. Discrimination refers to the degree of the item being related to the latent trait. Normally, discrimination refers to the item total correlation, because the total is a measure of the trait. In CFA, discrimination is the factor loading, which directly tells us how related an item is to the trait.Difficulty is the location of the item on the trait on the latent trait metric. In CTT, difficulty is indexed by the mean. In CFA, difficulty is indexed by the item intercept and it is still reverse coded – the higher the index, the easier the item – because easier items have higher intercepts. A good item has a large slope (factor loading) in predicting the item response from the factor, because this is a linear slope, the item is assumed to be equally discriminating (equally good) across the entire latent trait. Similarly, a bad item has a flatter linear slope that is equally bad across the entire range of the latent trait. Item intercepts are irrelevant in evaluating how good an item is, but are critical when testing factor mean differences in any latent factor model (Brown, 2015).CFA vs. IRTItem factor analysis is the same as item response theory. Full information maximum likelihood takes into account the whole data, while limited information maximum likelihood takes into account the covariance matrix. The equivalent to ML is the RASCH model, which is the IRT of a full information model that also assumes Tau-equivalence. The factor scores in IRT are called Theta and are supposed to be normally distributed. The difference between CFA and IRT is that CFA approximates a linear regression and IRT approximates a logistic regression. The Probit estimator is the z-score that corresponds to the area under the curve of the standard normal distribution to the left side of the probability that we are trying to solve for. It has been demonstrated that a Logit will approximate a Probit if multiplied by 1.7. The logic when using a Probit estimator, is that we are not trying to predict the original binary outcome, but its link transformation (from 0 to 1) that is continuous. While comparing a Probit estimator to a Logit estimator with regards to a model fit, we will usually find the Probit giving a lower Chi-square, meaning that the chances of the ideal and observed model being statistically different are smaller (Brown, 2015).ML: The goal of the ML (Maximum Likelihood) estimation is three-fold. One, to obtain the “most likely” values for each unknown parameter in the model, including: intercepts, loadings, error variances, factor means, factor variances, factor covariances, etc. Which are the values that it produces. Second, to obtain some index as to how likely is some parameter value to be true – by using the SE (Standard Error) of estimates. Finally, to produce indexes as to how well the model that was specified, actually describes the data – By using model fit indices.The use of the ML estimator has several assumptions:Persons and items are conditionally independent.Item responses can be missing at random.Factor scores have a multivariate normal distribution.Item residuals have a multivariate normal distribution.Consequently, the original item responses are assumed to have a normal distribution also.Failing to meet these assumptions can result in SEs and Chi-square based model fit statistics to be incorrect and the linear model that predicts the model not working “well”. The argument for using an inappropriate estimator that closely approximates the values that an appropriate estimator would have produced, is present in fields such as the social sciences, but is problematic since we a priory know that no matter how much the produced values approximate the true values, they are certainly wrong, giving a heuristic perspective to an empirical field.MLR: Maximum Likelihood Robust estimator, is an accepted solution to ML when the normality assumptions not met, thus making it a viable alternative in latent trait measurement. Other alternatives are IRT or IFA (Brown, 2015).WLSMV: Weighted Least Squares Means and Variances, is a model estimator proposed and updated by (Muthen, 1997) which is comparable to GEE method, it allows for a combination of binary, ordered, polytomous, and continuous outcome variables as well as multiple-group analysis. Given the generality and statistical performance it provides a useful practical method for latent variable analysis. The WLSM method can be applied in both categorical and continuous outcomes.Model fit statisticsModel fit statistics are a set of measures used to indicate whether the items in a model are an adequate representation of the latent trait. A good rule of thumb when determining which statistic in the model fit is “good” or “bad”, is to remember that if the statistic has the words “residual” or “error” in its name, then it needs to be small to be good. On the other hand, if we have the word “fit” it needs to be as close to 1 as possible in order to be good.The Chi-square baseline statistic is the value for the model that is the most statistically different from the ideal model. It gives information on whether there is any correlation between the items, if the P-value in this statistic is nonsignificant it means that there is not enough correlation in the dataset, making a latent trait estimation impossible. Number of observations – the number of participants in the test. For a factor analysis, this number should be above 200-300 participantsChi?. A Chi? value for the comparison between the observed model H0 and a perfect model H1: The value should be as smaller as possible and the P value should be nonsignificant, indicating that the perfect model and the observed model are the “same”. Robust Comparative Fit Index (CFI) - The comparative fit index (CFI) analyzes the model fit by examining the discrepancy between the data and the hypothesized model, while adjusting for the issues of sample size inherent in the chi-squared test of model fit, and the normed fit index. CFI values range from 0 to 1, with larger values indicating better fit. Previously, a CFI value of .90 or larger was considered to indicate acceptable model fit. However, recent studies have indicated that a value greater than .90 is needed to ensure that miss-specified models are not deemed acceptable (Hu & Bentler, 1999). Thus, a CFI value of .95 or higher is presently accepted as an indicator of good fit (Hu & Bentler, 1999).Robust RMSEA - The root mean square error of approximation (RMSEA) avoids issues of sample size by analyzing the discrepancy between the hypothesized model, with optimally chosen parameter estimates, and the population covariance matrix. The RMSEA ranges from 0 to 1, with smaller values indicating better model fit. A value of .06 or less is indicative of acceptable model fit. The robust version of this, simply accounts better for data that haven’t met the normality assumption, as is the case for all other “robust” model fit statistics.CI for Robust RMSEA – The confidence intervals for RMSEA, indicate what is the probability for the true value of RMSEA to be within a given interval. It is recommended that the confidence interval’s upper limit is no higher than the desired value of the RMSEA estimate.Standardized Root Mean Square Residual (SRMR). The SRMR is an absolute measure of fit and is defined as the standardized difference between the observed correlation and the predicted correlation. It is a positively biased measure and that bias is greater for small N and for low DF studies. Because the SRMR is an absolute measure of fit, a value of zero indicates perfect fit. The SRMR has no penalty for model complexity. A value less than .08 is generally considered a good fit (Hu & Bentler, 1999). Sample Size. (Bentler-Bonett, 1980) The sample size can have an effect on the model fit indicies. SRMR fails to adjust for sample size: models with larger sample sizes have smaller values. The TLI and CFI do not vary much with sample size. However, these measures are less variable with larger sample sizes. The RMSEA and the SRMR are larger with smaller sample sizes.Degrees of freedom - Degrees of freedom indicate how many possible parameters can be estimatedReliability analysisReliability of the test is also a central aspect of the analysis. It is the true variance of the sample of test respondents divided by their observed variance, where observed variance = true variance + error variance (Guilford, 1965). A proper coefficient allows a more precise interpretation of the data with respect to the range in which the test is reliable – evaluated with confidence intervals and to the extent to which the test results can be generalized outside of the sample population. Reliability uses a cutoff heuristic at 0.7 as recommended by Nunnally and Bernstein (1994) in (Dunn, 2014), and is usually reported as a point estimate for the whole test. In order to improve the reliability estimate, it is advisable to have a statistic with confidence intervals for every subscale or item rather than a general point-estimate. Most researchers apply Cronbach’s alpha as the preferred reliability coefficient, however upon examination it was determined to have restrictions and flaws that make it an outdated measure of reliability.Cronbach’s alpha is a lower-bound estimate of internal-consistency reliability, that can be safely interpreted as such, under the following assumptions: All items are tau-equivalent -> equally related to the true score. Meaning that all items are equally good. Item errors are uncorrelated, the coefficient can be biased low or high if the errors are correlated. The CTT reasoning for this is that the only reason the scale had any covariance among the items is the trait.If they hold then alfa is a reasonable estimate of reliability – if not, the estimate can be biased. (Dunn et. al., 2014). One of the problems with alpha is that these assumptions are hardly ever met.Violation of these assumptions causes Alpha to inflate and attenuate its initial consistency estimates of the measure. Additionally, Alpha “if item deleted” is also used frequently with the goal of improving a test or scale by removing items that dampen reliability. However, “Alpha if item deleted”, in a sample does not reflect the impact that item deletion has on the population reliability, thus restricting the interpretation of the reliability coefficient only to the sample (Raykov, 1997a, b; in Dunn, 2014). Finally, a point estimate of alpha does not reflect the variance present in the estimation process.In order to fix the concerns of Alpha, two methods are proposed. One possibility for improvement is to do bootstrapping on alpha and also to provide confidence intervals for the coefficient (Raykov, 1998; Dunn, 2014). The other method is to use a more appropriate coefficient such as the Omega. When coefficient alpha is used, the measurement model is assumed to be true-score equivalent (or tau equivalent) model such that factor loadings are equal across items. When the coefficient omega, hierarchical omega, and categorical omega are used, the measurement model is assumed to be congeneric model (i.e., one-factor confirmatory factor analysis model). Coefficient omega assumes that a model fits data perfectly so the variance of the composite scores is calculated from model-implied covariance matrix. Categorical omega is a method to calculate coefficient omega for categorical items (Green and Yang, 2009). Since unidimensionality remains a requisite for all of the considered models, if a scale/test is considered to be multidimensional, it is recommended that the scale be split into subscales and that the omega coefficient along with its confidence intervals to be calculated for each subscale (So?an, 2000; in Dunn, 2014). Overall, the main advantages of omega over alpha can be summarized as follows:Omega makes fewer and more realistic assumptions than alpha, particularly in the sphere of social sciences. - Omega can be safely used when the tau-equivalence assumption fails.Problems associated with inflation and attenuation of internal consistency estimation are far less likely.Employing “Omega if item deleted” in a sample is more likely to reflect the true population estimates of reliability through the removal of a certain test/scale item.The calculation of omega alongside a confidence interval reflects much closer the variability in the estimation process, providing a more accurate degree of confidence in the consistency of the administration of a scale.MethodThe test was conducted electronically on 115 volunteering participants via “google forms”. The participants were mostly from a mixed Slovenian and Macedonian origin. In order to reconcile the language differences between both samples, the test was constructed in English. This can also be considered a skewing of results since weaker English speaking participants may not demonstrate their full capabilities, due to language limitations. The age range of participants was divided into four categories, in which the most participants fell into the 18-29yrs. category 81%, followed by the second category of 30-49yrs with 19%. The gender mix was stable with males 53.4% (62) and females 46.6% (54) The test also employed an anti-cheating filter on the last item that consisted of the question: “did you use google while answering some questions?” From which 9 participants answered “Yes” and were disqualified from further analysis, one participant openly admitted to totally randomizing the answers and was also disqualified, making the total analyzed sample 105 participants. Statistical analysis methodsIn order to conduct an item psychometric analysis, a latent trait measurement model was constructed. The goal of the test was to measure one dimension, namely: the basic science knowledge in a certain population. To this end, a CFA (confirmatory factor analysis) was conducted. The CFA was conducted in RStudio, with the package lavaan, the model estimator used is MLR and WLSMV.After conducting the CFA, the items were subject to item-psychometrics analyses, including: difficulty, adjusted difficulty, discrimination, discrimination adjusted (for overlap), reliability (alpha with CI, omega with CI), Spearman-Brown prophecy coefficient, Strata levels) as known by the CTT (classical test theory) approach. Norming was also conducted with two types of transformation: Percentile and Z-values. This allows a good evaluation of the position of the people in the scale both for layman and academic audiences.The RStudio code link with which the analysis was conducted is listed in the references.ALERT! Below are listed the 25 test questions, if you wish to take before looking at its structure the test you can directly access it HERE.The test consists of 25 multiple choice items, which are as follows:Table 1. A list of all of the test questions and possible answers. Note: the right answers are colored. Some items also contain images, for the full version of the test see AppendixQuestionOption AOption BOption COption DThis picture shows an object in space that has an icy core with a tail of gas and dust that extends millions of miles. What is this?A starA cometAn asteroidA moonThe center of the Earth is:Very hotVery coldHollowNone of the aboveThe continents on which we live have been:Moving their locations for millions of years and will continue to move. Moving their locations for millions of years but are not moving anymoreOccasionally moving but are usually stillCompletely stillElectrons are smaller than:ProtonsNeutronsAtomsAll of the aboveThe cell's DNA is (mostly) located in:The cytoplasmThe cell nucleusThe Golgy aparatusThe ribosomesLasers work by focusing:Sound wavesLight wavesRadio wavesGamma wavesThe universe began with:A huge explosionRandomly appearing matterThe creation by GodThere is no way to know how the universe beganWhich chromosome (genes) decide if a baby is born a boy?The sex chromosome "X" added by the fatherThe sex chromosome "X" added by the motherThe sex Chromosome "Y" added by the motherThe sex Chromosome "Y" added by the fatherAntibiotics kill: VirusesBacteriaCertain kinds of viruses and certain kinds of bacteria Viruses as well as bacteria.Human beings, as we know them today, developed from:NeanderthalsChimpanzeesGodNone of the aboveOne of the following is NOT a function of bones:Provides a place for muscle attachmentProtection of vital organsSecretion of hormones for calcium regulation in bloodProduction of blood corpusclesWhich kind of waves are used to make and receive cellphone calls?Radio wavesVisible light wavesSound wavesGravity wavesWhich of these is the main way that ocean tides are created?The rotation of the Earth on its axisThe gravitational pull of the moonThe gravitational pull of the sunNone of the aboveWhat does a light-year measure?BrightnessTimeDistanceWeightDenver, is at a higher altitude (height) than Los Angeles. Which of these statements is correct?Water boils at a lower temperature in Denver than Los Angeles.Water boils at a higher temperature in Denver than Los Angeles.Water boils at the same temperature in both Denver and Los AngelesThe differences in humidity impact the temperature at which water boils more than altitudeWhich of these pictures best illustrates what happens when light passes through a magnifying glass?1234The loudness of a sound is determined by what property of a sound wave?FrequencyWavelengthVelocity or rate of changeAmplitude or heightWhich of the following statements best describes the data in the graph below?In recent years, the rate of cavities has increased in many countriesIn some countries, people brush their teeth more frequently than in other countriesThe more sugar people eat, the more likely they are to get cavitiesIn recent years, the consumption of sugar has increased in many countriesWhich of these elements is needed to make nuclear energy and nuclear weapons?Sodium chlorideUraniumNitrogenCarbon dioxideWhich of these people developed the polio vaccine?Marie CurieIsaac NewtonAlbert EinsteinJonas SalkWhat effect does adrenaline have on the heart rate?No effectIt raises the heart rateIt slows it by 10%It slows it by 50%From what is the Jurassic period named?A kind of dinosaurThe French word for "Day"A mountain rangeA valley in South AmericaWhich is most acidic?White vinegarLemon juiceYour stomach acidCola drinksAustralian scientists Barry Marshall and Robin Warren won a Nobel Prize for showing that most peptic (stomach) ulcers are caused by what?A bacteriumStressGlutenJunk food and spicy foodWhich of these contains keratin?Human hairHorse hoovesCat clawsAll of the aboveResultsIn this section, we will review the unidimensionality of the models both by the CFA and IRT perspective, the item psychometric properties of the retained model items, an in-depth reliability analysis, and analysis of the resulting scale with an inquiry on how to improve it.CFA model fitHere we compare the tested models for unidimensionality conducted with a CFA analysis, the model fit results for each tested model are displayed in the table below:Table 2. CTT Model comparisonModelItemsItems misfitChi?CFITLIRMSEASRMRAlphaVar**StatusFull mlr2512347.4*0.6780.6490.049; ***L: 0.031; U: 0.0650.0820.7245.9%Does not fitReduced 1 mlr14081.90.9610.9540.025; L: 0.0; U: 0.0620.0630.7337.7%Does fit(low var)Categorical wlsmv14071.811.0320; L: 0; U: 0.0480.117;WRMR: 0.6940.73359.5%%Does not fitCategorical reduced wlsmv1306111.290; L: 0; U: 0.0520.111; WRMR: 0.6880,73351.5%Does not fitCategorical reduced wlsmv “2”10029.711.0480; L: 0; U: 0.0540.094; WRMR: 0.5930.70453.6%Does not fitCategorical parsimonic wlsmv8014.311.0750; L: 0; U: 0.0530.074; WRMR: 0.4970.67354.3%Does fit* - Chi? is significant at the 0.05 level** - Var refers to percentage of explained variance by the factor***- L = lower threshold of the confidence interval; U = Upper threshold of the confidence intervalComparing all the models that were tested for unidimensionality we see that only the “Categorical parsimonic wlsmv” model barely fits the assumption. The model names refer to the item and test properties of the analysis. A short description of the tested models follows: Full mlr: All 25 items were tested using the Maximum likelihood robust estimator in Rstudio. The test shows multiple item misfits at the p <0.05 level, and inadequate model fit statistics. The explained variance is also too small indicating a problem in the way that the model is set up. Alpha is satisfactory at the 0.7 rule of thumb, but cannot be retained because the model fails the unidimensionality criteria.Reduced 1 MLR: The items included are reduced to 14 by eliminating the worst items from the analysis. The item removal is based on modification indicies, item-factor misfit p values of > 0.05, low factor loadings and R? coefficients that are < 0.100. The model fits, but produces a very small portion of explained variance (7.7%).Categorical reduced WLSMV: retains the 14 items from the previous model, but switches the model estimator to Weighted least squares means and variances. The result is an overall better, but not satisfactory model fit, with a large improvement in the explained variance (from 7.7% to 59.5%). Additionally, the way in which R reads the dataset has been changed from continuous to categorical, making the MLR impossible to be conducted in R using the package “lavaan”. Note that Mplus has the possibility to conduct a confirmatory factor analysis with MLR and categorical exogenous variables but while conducting the analysis in Mplus a “degrees of freedom” estimate of >2000 was shown, and the analysis was limited to R, with the WLSMV estimator.Categorical reduced WLSMV 2: The items were reduced to 10 resulting in a better but not satisfactory model fit, SRMR > 0.08. This model is the last tested model that results in an alpha > 0.7. However, since unidimensionality supersedes internal consistency reliability, another model test is conducted.Categorical parsimonic WLSMV: The final CFA tested model resulted in an acceptable measure of unidimensionality with the following parameters: Robust Chi? of 12.250; Degrees of freedom = 20; p.value for Chi? p= .817 (a nonsignificant p value in the case of CFA means that the tested model in not different with the “ideal” desired model, in our case a 1 factor model); CFI and TLI both >0.95 which also give us a sense of the normality of the distribution of scores (a coefficient of >1 indicates a skewed); RMSEA= 0, with confidence intervals ranging from 0 to 0.053, which is well within the acceptable bounds of <0.06; SRMR= 0.074 which is in the acceptable range of SRMR < 0.08. The WLSMV estimator also computes an experimental model fit statistic WRMR= 0.497, which is a possible substitute for SRMR and the desired range is <1, however this coefficient is not frequently used in the practice of model testing, thus we omit it as a factor in the decision-making process of whether the model fit is acceptable. Despite the satisfactory model fit statistics, this 8 item model results in an undesirable alpha of 0.673, a further and more detail analysis of reliability is described in the “Reliability” section. Overall the last model is retained and its items will be subject to further analysis. Despite the satisfactory model fit, there are numerous possibilities for improvement in the overall model and the individual items. A replication study could easily result in an unsatisfactory model fit if the test is not carefully reconstructed. In the next section, we look at the properties of the accepted model.CFA model analysisIn this section, we present the model properties, item factor loadings, and item R?2 values.First, we present the SEM plot generated for this model, in which we can graphically see the item loadings as well as the first constrained variable. Figure 1. SEM plot for the retained model.We can see to which extent does the latent variable, that we have named “Basic science knowledge” explain the responses for each item. Next, we show the factor loadings for each item, their P - value fit to the latent variable and their respected R?2 score. The item names represent their corresponding position in the test, exp. X13 is the thirteenth question in the test.Table 3. Item properties within the 1 factor modelItemCorresponding item questionFactor loading Standard errorP – valueR-SquareX13Which of these is the main way that ocean tides are created?1//0.543X4Electrons are smaller than:0.7960.2460.0010.344X17The loudness of a sound is determined by what property of a sound wave?0.8480.2440.0010.391X8Which chromosome (genes) decide if a baby is born a boy?0.7930.2520.0020.342X11One of the following is NOT a function of bones:0.7540.2210.0010.309X15Denver, is at a higher altitude (height) than Los Angeles. Which of these statements is correct?0.7670.2350.0010.320X18Which of the following statements best describes the data in the graph below?0.8170.2720.0030.363X25Which of these contains keratin?0.6920.2370.0030.260Although all the items properly load on the latent variable with p < 0.05, we observe weak R? values, which mostly range from 0.300 to 0.400, one item also has an R? of <0.300 which is acceptable but explains a very small portion of variance. Considering all of this, we can conclude that even if the model does fit, it has a low utility for practice, and it would be advisable to look into its shortcomings and reconstruct a better model. Next, we will look into the IRT model analysis and compare the differences between the CFA and IRT models.IRT model fitHere we compare the model fit that resulted from IRT model testing for unidimensionality. The testing consisted of comparing the results of 2 different R packages (“tpm” and “mirt”) for 2PL and 3PL models. The results are shown in the table below:Table 4. IRT model testing and comparisonModelItemsItems misfitLog.Lik*AIC*BIC*Status2PL “tpm”82-491.141014.31056.8Does not fit3PL “tpm”94-490.661029.31093Does not fit2PL “mirt”81-491.141014.31056.8Does not fit3PL “mirt”91-424.90891.81947.5Does not fit2PL “tpm” 292-530.71097.41145.2Does not fit3PL “tpm” 291-530.271114.51186.2Does not fit2PL “mirt” 2100-530.71097.41145.2Does fit3PL “mirt” 2101-530.31114.51186.2Does not fit*Log. Lik referes to the Log likelihood fit statistic** AIC referes to the Akaike information criterion*** BIC referes to the Bayesian information criterionAll tested models initially consisted of the 8 previously retained items from the CFA analysis. The purpose of this was to later compare the CFA and IRT models. Item names refer to the number of parameters that a model consists of and the package used in the testing. Items misfit refers to the number of items that do not properly load on the latent variable. Log.Lik refers to the Log likelihood of a model fit statistic. AIC (Akaike information criterion) and BIC (Bayesian information criterion) are also model fit statistics. All three of these model fit statistics are relative and should be compared to a corresponding baseline statistic in order to get a qualified estimate of model fit. We retained the 2PL model that was tested with the R package “mirt” on the second attempt or the “2PL mirt 2” model. The model fit statistics are as follows: Chi? = 25.41, with p = 0,882 indicating that the ideal model is not different from the observed model; RMSEA = 0 with CI-05 = 0 and CI95 = 0.035; SRMR = 0.062; TLI = 1.072; CFI = 1. The fit statistics indicate a good model fit. Reliability is above the acceptable threshold of Alpha > 0.7 (Alpha = 0.707). The model has 2 additional items than the CFA model. The CFA and IRT models’ comparison is done in a later section.IRT Model characteristics:The retained model consists of 10 items, that are being predicted by the “basic science knowledge” latent factor. The explained variance for the “2PL mirt 2” model is 36.8%. Item alphas and betas (location and slope) are presented in the table below, alongside with a p-value of for the item fit regarding the latent variable. Note that the “mirt” R package computes factor loadings with EFA (exploratory factor analysis) therefore factor loadings should be interpreted with caution as explained in the introduction. Graphs for the 2PL IRT model are also displayed below, showing the item characteristic curves, test information curve and the expected total scores. Table 5. IRT item characteristicsItemCorresponding item questionItem Alpha (discrimination)Item Beta (difficulty)P – valueX3The continents on which we live have been:1.146-1.8460.562X13Which of these is the main way that ocean tides are created?2.02-1.1510.195X4Electrons are smaller than:1.4-0.3840.342X17The loudness of a sound is determined by what property of a sound wave?1.3240.1080.167X8Which chromosome (genes) decide if a baby is born a boy?0.903-0.2170.092X11One of the following is NOT a function of bones:1.011-0.1520.432X15Denver, is at a higher altitude (height) than Los Angeles. Which of these statements is correct?1.4050.980.542X18Which of the following statements best describes the data in the graph below?1.119-1.1170.199X19Which of these elements is needed to make nuclear energy and nuclear weapons?2.095-1.8110.103X25Which of these contains keratin?0.892-0.8460.082Most of the item IRT alphas are >1, indicating a good discrimination value, the p value for each item also indicates that the items load properly on the factor. Item difficulty is mostly high, therefore measuring the population that has the latent trait located mostly below 0. From all the items, the one with the best discrimination value is X19 with a = 2.095, however this item is relatively easy and the population that is located on -1.811 SD of the trait has a 50% chance of scoring on the item. The item with a good difficulty value is X15, with b = 0.98 and also has a good discrimination value (a > 1) of a = 1.405. The item that discriminates the least is X25 with a = 0.892 and also barely fits in the model with a p = 0.082. The decision to retain X25 was based on the need for the model to attain a sufficient reliability coefficient. The item trace lines graph shows the probability of a person scoring on an item given the amount of the measured latent trait. A score of 1 on the X axis indicates that the person is 1 standard deviation above the mean on his possession of the latent trait, we pair that with the Theta value that gives us the probability of the person scoring on an item. The accepted cutoff value for a theta probability is 0.5 indicating that the person has a 50% chance of answering correctly on an item. We can see from the graph, that most items are similar and their cutoff probability value ranges from -2 SD to 1 SD. This means that the test predominantly measures average and below average population for which it is also intended. The test information curve is also shown on a graph. With a maximum information value of ≈ 3.6, at theta Θ ≈ -1.3. The test information curve is generally acceptable, since we have an information index of 4, on a 10-item test. The expected total score graph shows the probability of a score that a person with a given theta is able to obtain. We see that a person which has 0 theta will score approximately 6 points (60%) on the test. This pushes the criteria for discriminating between the population that demonstrates a basic science knowledge and those that don’t to 7 points (70%), which means that in practice, if we were to conduct this test on a pass/fail basis, the tested population must obtain a score of 7 to pass. However, the valuing of points that each item provides should be adjusted according to the difficulty of each item. In conclusion, the IRT analysis produces a 10 item 2PL model, with acceptable model fit and minimal reliability. Item statistics appear to be sufficient, but generally if such a test is to be used for a wider population, a good deal of review and reconstruction is advisable. The most important points to consider in review are: The language of the test, should be appropriate for the tested population. If a different language speaking population is subject to the test, the items must be in the populations native language in order to make sure that we are testing the knowledge criteria and not language capabilities.The difficulty of items is mostly high; therefore, a better mix of harder items is advisable in order to be able to better distinguish between the populations that have different levels of the latent trait.A 2-level multiple factor structure should be considered, with different items loading on multiple factors, which in turn all load on a higher order factor that reflects the basic science knowledge of a population.A “bad” items analysis.Figure 2. Item trace lines (characteristic curves) for the retained model19558050609500Figure 3 . Test information curveFigure 4. Test expected total scoreCFA and IRT model comparisonThe goal of the CFA vs. IRT comparison is to evaluate which model has better psychometric characteristics and include into the further analysis. The compared characteristics are seen in the table below.Table 6. CFA and IRT model comparison.ModelItemsChi?CFITLIRMSEASRMRAlphaVar*2PL “mirt” 21025.4111.0720; L: 0; U: 0.0350.0620,70736.8%Categorical parsimonic wlsmv814.311.0750; L: 0; U: 0.0530.074; WRMR: 0.4970.67354.3%Var*: Percentage of variance explained by the model.There are numerous issues with the presented models. The sample size for conducting a 2Pl IRT model is low, since 2PL requires at least 200 participants (depending on the k number of questions). The CFA analysis on the other hand yielded a reliability coefficient that is considered “poor” but not unsatisfactory, and even though the model fits, there is a good deal of possible improvements to be considered. The IRT model relies on an exploratory factor analysis, which is a problematic approach especially when attempting to estimate multiple factors. However, the IRT model can still be taken to account since the goal was to allocate variance to only one factor. Concerning the amount of explained variance, the CFA approach yielded a better result, although a value of ≥60% is preferable. Concerning the model fit statistics, both models show acceptable values which are not significantly different from each other. The CFA model showed less desirable model fit indices, but also used a more conservative estimator, hence we can be more confident in the values of the CFA model fit statistics to approximate their true value. Reliability analysis of the CFA and IRT modelsThe difference in number of items and item qualities such as item total correlations, resulted in different reliability coefficients for both models. We will analyze a number of reliability estimates and compare them for both models. The reliability analysis includes:Cronbach’s Alpha internal consistency coefficient, with standard error estimates and bootstrapped confidence intervals a 95% cutoff value. The bootstrapping was done with the “bca” method in the R package “MBESS”.Although we discussed in the introduction why “Alpha if item removed” is not valuable, especially for generalizing the results outside the sample, we will still include the coefficients for purposes of evaluation.Omega internal consistency coefficient, with the above-mentioned properties.Omega for categorical data, our data is binary-ordered, making the categorical Omega estimate, a proper estimate of internal consistency reliability.KR-20 and KR21, variants of Alpha, that are used to measure binary items, with KR-21 being a negatively biased estimate.Split half reliability coefficients. We include: Lambda 4, Lambda 6, Average split half, Lambda 3, Beta. Below are listed all the mentioned reliability coefficients with their accompanying statistics. The Alpha’s without item are listed in a separate table.Table 7. Reliability of the scaleTypeCFA parsimonic modelConf. Int (95%)2PL “mirt” 2Conf. Int (95%)Alpha0.673L: 0.566U: 0.7560.707L: 0.614U: 0.790Standard error (Alpha)0.0480.046Omega0.674L: 0.550U: 0.7450.707L: 0.611U: 0.788Standard error (Omega)0.0460.048Omega - Categorical0.691L: 0.552U: 0.7610.762L: 483U: 0.818Standard error (Omega Categorical)0.0490.047KR-200.6780.7115488KR-210.6260.640Lambda 4*0.750.81Lambda 6**0.660.72Average split half0.660.71Lambda 3***0.680.71Beta0.620.65Lambda 4, refers to the maximum split half reliability coefficient; Lambda 6, refers to the Guttman split half coefficient; Lambda 3, refers to the Guttman lambda 3 (alpha) coefficient, Beta refers to the Minimum split half reliabilityThe results indicate a low reliability for both models, with the IRT model having minimal acceptable reliability. The split half coefficients are consistent with the internal consistency reliability estimates, indicating a similar average reliability across items. The omega coefficient was also computed, and is considered a better value for reliability since both scales have problems with normality, and we cannot be certain that the covariance in items is only due to the trait. There is an issue with the Onega coefficient, the sample size was relatively low, not allowing for estimation of all the possible outcomes. Despite this the coefficient was still successfully calculated, but should be taken with consideration because of the possibility for it to be biased. The chosen modelFinally, the model chosen on which the psychometric analysis is conducted is the CFA based model (“Categorical Parsimonic WLSMV”). The reasons for this follow: The CFA based model explains more variance with less items than the IRT based model.Even though the IRT model results in items that attain a reliability of > 0.7, the estimating process of the model fit uses a looser procedure, that can result in positively overestimating the model fit statistics and showing a model fit where there actually isn’t one.It would be easier to upgrade and modify a more stable model for which we are confident that the containing items have good properties, than to replicate the IRT model on a bigger sample size.The CFA analysis, computes proper factor loadings and the estimated model fit relies on more conservative computing methods ensuring that the presented values properly approximate the real values.The IRT based 2PL model, should be taken with reserve since we did not test it based on an adequate sample size of at least 200 participants.From a test participant’s perspective, it would be more attractive and easier to take a test with less questions, or in our case add a few more questions to attain a reliability of at least Alpha = 0.7, than to take a test with 10 questions (the IRT model) which still has to be upgraded, and end up with a biased result and a test with more questions.In the next section, we take a look at the Spearman-Brown prophecy coefficient, the H index and item psychometric properties of the retained model.Spearman-Brown prophecy coefficient The Spearman-Brown Prophecy formula was also implemented, allowing us to see by what factor would the scale need to increase to attain a reliability coefficient of 0.95. The needed factor of increase obtained was 9.126731, which, if implemented would result in a scale totaling 100 items. This result, allows us to rethink our scale, since it would be highly improbably to obtain a 100-item unidimensional scale, a better approach is to reconstruct the scale with more and better items – e.g. items with a higher reliability.H-indexThe H index was calculated in order to determine the ability of the test to discriminate between performance in the sample. The H-strata index was calculated based on a standardized alpha reliability of 0.676. The result is an index of 2.257124, which when rounded shows us that the test has the ability to properly distinguish between 2 groups of performances.Item psychometric propertiesThe items were assigned scores (1) for every correct and 0 for every incorrect answer. The maximum total item score being 8. The scores frequency distribution is described in the graph below. Table 8. Distribution of scoresAccording to Classical Test Theory, multiple item indexes were analyzed. Including: Item difficulty index, item discrimination index (in the form of Item total correlation), adjusted item discrimination index and their corresponding confidence intervals. The standard error of measurement for the total scores is SE= 1.188. The presented scale has a mean of M=4.62 and a standard deviation of SD=2.07. Table 9: Item characteristicsItemDifficultyAdjusted Discrimination90% Confidence intervalsItem total correlation (discrimination)Alpha “if item deleted”X40.60.380L:0.233; U:0.5100.3920.639X80.5430.373L:0.225; U: 0.5040.4360.641X110.5330.356L: 0.205; U: 0.4880.4360.645X130.8110.417L:0.274; U: 0.5420.4490.633X150.2670.322L: 0.169; U: 0.460 0.4610.652X170.4760.397L:0.251; U: 0.5250.4760.634X180.7330.355L: 0.205; U: 0.4880.4940.645X250.6570.319L: 0.165; U: 0.4560.5370.654From the table and the graph, we can see that all items are within acceptable ranges of difficulty. Additionally, items X4, X8, X11, X17 are in the recommended range of 0.3 – 0.7. Item discriminations are also above the minimal value of 0.2, most of the items are above the recommended discrimination value for binary data of >0.3. Item difficulty, adjusted discrimination and Item-total correlation (discrimination) are also displayed in a graph, which illustrates where the items are positioned in regards to the recommended and minimal values.Figure 5 . Test Item Difficulty*Item Number refers to the chronological position of the items in the CFA modelThe Item discriminations are also graphically presented below. The values were calculated with the point-biserial coefficient, which is equivalent to item-discrimination which has accounted for item overlap, and is used for binary data. The 90% confidence intervals were calculated it Rstudio with the “psychometric” package, using the “CIr” command. Figure 6. Test Item Discriminations*Item Number refers to the chronological position of the items in the CFA modelThe item discriminations are above the acceptable threshold of 0.2. Item discrimination in CFA is comparable to item information in IRT. Item total correlations are similar to item discrimination, the difference in calculation is that the item-total correlations used the biserial coefficient, and discrimination coefficients were done with the point-biserial coefficient. The graph below shows the item-total correlations.Figure 7. Item Total Correlation*Item Number refers to the chronological position of the items in the CFA modelCorrelation matrices are also presented. The initial purpose of correlation and covariance matrices is to look for negative correlation and flag them as possible problems during model estimation. Since only the correlation matrix for the retained items is presented, we can evaluate the strength of the relationships between items. We can clearly see that although all items have a positive relationship, it is still a weak relationship, with no items above r>0.3. This is a problem especially for the reliability of the scale, and we should investigate whether the items are too complex, misunderstood due to language differences, is there some external factor impacting the response patterns. Table 10. Item correlation matrixX41X80.2651X110.2100.2911X130.1980.1880.2271X150.2730.1210.1320.1831X170.1950.1470.2800.2680.2441X180.1670.3110.1270.3110.1200.2301X250.1880.1420.1290.2630.2090.2070.1541NormingThe final segment is the norming table. Norming transformations allow us to more clearly interpret the data, and gives each participant the ability to compare themselves to the general population. Norming was conducted based on the mean and standard deviation of the sample M=4.62, SD=2.07. The scores distribution with the previously mentioned properties is also presented. Norming was done by converting to Z-scores and percentiles. The results are shown belowTable 11. Transformed scoresScoreFrequencyPercentile Cumulative percentZ-score95% Confidence interval (z)0 3 0.0286 0.0286 -2.224L: -2.42; U: -2.031 8 0.0762 0.1048 -1.742L: -1.936; U: -1.5482 9 0.0857 0.1905 -1.261L: -1.455; U: -1.0673 9 0.0857 0.2762 -0.779L: -0.975; U: -0.5854 15 0.1429 0.4190 -0.298L: -0.492; U: -0.1045 21 0.2000 0.6190 0.183L -0.012; U: 0.3776 17 0.1619 0.7810 0.665L: 0.471; U: 0.8597 19 0.1810 0.9619 1.146L: 0.952; U: 1.3408 4 0.0381 1.0000 1.628L: 1.434; U: 1.821Figure 8. Probability density distribution of scoresRstudio code: damcevski Part of the code is sourced from: Research Center, September (2015), “A Look at What the Public Knows and Does Not Know About Science”Brown, T. A. (2015). Confirmatory factor analysis for applied research (2ed). New York, NY: Guilford. Chapters 2-5.Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness-of-fit in the analysis of covariance structures. Psychological Bulletin, 88, 588-600., (accessed December 2016)., (accessed December 2016). HYPERLINK "" Dunn, J. Thomas, Baguley T. and Brunsden V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology. 105(3):399-412. Muthen O. Bengt, et al. (1998). Robust Inference using Weighted Least Squares and Quadratic Estimating Equations in Latent Variable Modeling with Categorical and Continuous Outcomes. Guilford J. P. (1965) Fundamental Statistics in Psychology and Education. 4th Edn. New York: McGraw-Hill.Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. BIBLIOGRAPHY ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download