Reliability - University of Tennessee at Chattanooga



PSY 513 – Lecture 1ReliabilityCharacteristics of a psychological test or measuring procedureAnswers to the questions: How do I know if I have a good test or not?What makes a good test?Is this a good test?There are (only) two primary measures of test quality. 1. Reliability – The extent to which a test or instrument or measuring procedure yields the same score for the same person from one administration to the next.2. Validity – the extent to which scores on a test correlate with some valued criterion. The criterion can be other measures of the same construct, other measures of different constructs, performance on a job or task.So, we ask about any test: Is it reliable? Is it valid? These are the two main questions.Other, less important characteristics, considered only for reliable and valid tests.3. Reading level4. Face validity – Extent to which test appears to measure what it measures.5. Content validity – Extent to which test content corresponds to content of what it is designed to measure or predict.6. Cost7. Length – time required to test.So what is a good test?A good psychological test is reliable, valid, has a reading level appropriate to the intended population, has acceptable face and content validity, is cheap and doesn’t take too long.This will likely be a test question.Scoring psychological testsMost tests have multiple items. The test score is usually the sum or average of responses to the multiple items. If the test is one of knowledge, the score is typically the sum of the number of correct responses. But newer methods based on Item Response Theory use a different type of score.If the test is a measure of a personality characteristic, the score is often the sum or mean of numerically coded responses, e.g., 1s, 2s, . . . 5s, is typically used. But I’ll argue later that methods based on Item Response Theory or Factor Analysis may be better.Sometimes subtest scores are computed and the overall score will be the sum of scores on subtests.Occasionally, the overall score will be the result of performance on some task, such as holding a stylus on a revolving disk, as in the Pursuit Rotor task or moving pegs from holes in one board to holes in another, as in the Pegboard dexterity task. But most psychological tests are “paper and pencil” or the computer equivalent of paper and pencil.Invariably the result of the “measurement” of a characteristic using a psychological test is a number – the person’s score on that test, just as the result of measurement of weight is a number – the score on the face of the bathroom scale.Give Thompson Big Five Minimarkers for Inclass Administration (MDBR\Scales) here Dimensions are Extraversion Openness to ExperienceStability Conscientiousness Agreeableness.Below are summary statistics from a group of 206 UTC mostly undergraduates.Plot your mean responses on the following graph to get an idea of your Big Five profile.ReliabilityWorking Definition: The extent to which a test or measuring procedure yields the same score for the same person from one administration to the next in instances when the person’s true amount of whatever is being measured has not changed from one time to the next.Consider the following hypothetical measurements of IQHighly Reliable TestTest with Low ReliabilityIQ at Time 1IQ at Time 2PersonIQ at Time 1IQ at Time 211211111121051401412140128858638592106108410610010810751081169593695105117118711711012012181201261351349135130High reliability: Persons' scores will be about the same from measurement to measurement.Low reliability: Persons' scores will be different from measurement to measurement.Note that there is no claim that these IQ scores are the “correct” values for Persons 1-9. That is, this is not about whether or not they are valid or accurate measures. It’s just about whether whatever measures we have are the same from one time to the next.Lay people often use the word, “reliability”, to mean validity. Don’t be one of them.Reliability means simply whether or not the scores of the same people stay the same from one measurement to the next, regardless of whether those scores represent the true amount of whatever the test is supposed to measure.Why do we care about reliability?Think about your bathroom scale and the number it gives you from day to day. What would you prefer – a number that varied considerably from day to day or a number that, assuming you haven’t changed, was about the same from day to day.Obviously, we’re mostly interest in the validity of our tests. But we first have to consider reliability.Need for high reliability is a technical issue – we have to have an instrument that gives the same result every time we use it before we can consider whether or not the result is valid.Classical Test Theory: A way of thinking about test scores.Basic Assumption: Each observed score is the sum of a True Score and an Error of Measurement.True scores are assumed to be unchanged from one time to the next.Errors of measurement are assumed to vary randomly and independently from one time to the next.Observed score.The score of a person on the measuring instrument.True score.The actual amount of the characteristic possessed by an individual.It is assumed to be unchanged from measurement to measurement (within reason).Error of measurement.An addition to or subtraction from the true score which is random and unique to the person and time of measurement. In Classical Test Theory, the observed score is the sum of true score and the error of measurement.Symbolically:Observed Score at time j = True Score + Error of Measurement at time j.Xj = T + Ejwhere j represents the measurement time.Note that T is not subscripted because it is assumed to be constant across times of measurement.It is assumed that if there were no error of measurement the observed score would equal the true score. But, typically error of measurement causes the observed score to be different from the true score.This means that everyone who measures anything hates error of measurement. At last, something we can agree on.So, for a person,Observed Score at time 1 = True Score + Measurement Error at time 1.Observed Score at time 2 = True Score + Measurement Error at time 2.Note again that the true score is assumed to remain constant across measurements.Implications for ReliabilityNotice that if the measurement error at each time is small, then the observed scores will be close to each other and the test will be reliable – we’ll get essentially the same number each time we measure.So, reliability is related to the sizes of measurement errors – smaller measurement errors mean high reliability. This means that unreliability is the fault of errors of measurement. If it weren’t for errors of measurement, all psychological tests would be perfectly reliable – scores would not change from one time to the next.Two Ways of Conceptualizing reliabilityTwo possibilities, both requiring measurement at two points in time.1. Conceptualizing reliability as differences between scores from one time to another.This is the conceptualization that follows directly from the Classical Test Theory notions above.Consider just the absolute differences between measures.PersonHighly ReliableTest with Low ReliabilityIQ at Time 1104267014097000IQ at Time 2DifferenceIQ at Time 191440014097000IQ at Time 2Difference111211111121092214014001401281238586-18592-74106108-2106100651081071108116-869593295105-107117118-1117110781201200120123-2913513562865762000013517081568580001305The distributions of differences3012440-95250003015615812800032080205397500279654084455003018155603250034366205397500261366053975002811780539750032156404635500304806350246810-21012-4-6-8-10-12000246810-21012-4-6-8-10-1210363205397500143256053975001630680692150041732208445500577342076835004363720844550039598607302500340931541275002613025387350058826408255005484495825500508635082550046882058255004290060825500389255082550034944058255003096260825500269811582550023006058255001902460825500150431582550011061708255007086608255003810046355005422904445-1200-129144008255-1000-10133985012065-800-817424404445-600-62145030-3175-400-45709285635120012528320063510001025787354445-200-24857750635100010450850012065800841592501206560063809365120654004336105544452002299656544450000A measure of variability of the differences could be used as a summary of reliability. One such measure is the standard deviation of the difference scores obtained from two applications of the same test. The smaller the standard deviation, the more reliable the test.Advantages1) This conceptualization naturally stems from the Classical Test Theory framework – it is based directly on the variability of the Es in the Xi = T + Ei formulation. Small Eis mean less variability.2) So, it’s easy to understand, kind of.Problems:1) It's a golf score, smaller is better. Some nongolfers have trouble with such measures.2) The standard deviation of difference scores depends on the response scale. Tests with a 1-7 scale will have larger standard deviations than tests that use a 1-5 scale, even though the test items might be identical.3) It requires that the test be given twice, with no memory of the first test when participants take the 2nd test, a situation that’s hard to create.It is useful however, to assess how much one could expect a person’s score to vary from one time to another. For example: Suppose you miss the cutoff for a program by 10 points. If the standard deviation of differences is 40, then you have a good chance of exceeding the cutoff next time you take the test. If the standard deviation of differences is 2, then your chances of exceeding the cutoff by taking the test again are much smaller.2. Conceptualizing reliability as the correlation between measurements at two time periods.This conceptualization is based on the fact that if the differences in values of scores on two successive measurements are small, than the correlation between those two sets of scores will be large and positive.3124200131445Correlation between the two administrations of a highly reliable test.If the scores on two administrations are nearly the same, then the correlation between the paired scores will be positive and large.00Correlation between the two administrations of a highly reliable test.If the scores on two administrations are nearly the same, then the correlation between the paired scores will be positive and large.-2667002838450Score at Time 200Score at Time 2-266700581025Score at Time 200Score at Time 23040380125730Correlation between the two administrations of a test with low reliability.If the scores on two administrations are not nearly the same, then the correlation between the paired scores will be close to zero.00Correlation between the two administrations of a test with low reliability.If the scores on two administrations are not nearly the same, then the correlation between the paired scores will be close to zero.428625-234950Score at Time 100Score at Time 14191002070100Score at Time 100Score at Time 1If the measurements are identical from time 1 to time 2 indicating perfect reliability, r = 1.If there is no correspondence between measures at the two time periods, indicating the worst possible reliability, r = 0.Advantages of using the correlation between two administrations as a measure of reliability -1) It’s a bowling score – bigger r means higher reliability.2) It is relatively independent of response scale – items responded to on a 1-5 scale are about as reliable as the same items responded to on a 1-7 scale.3) The correlation is a standardized measure ranging from 0 to 1, so it’s easy to conceptualize reliability in an absolute sense – Close to 1 is good; close to 0 is bad.Disadvantages1) Nonobvious relationship to Classical Test Theory requires some thought.2) Assessment as described above requires two administrations.ConclusionMost common measures of reliability are based on the conception of reliability as the correlation between successive measures.Definition of reliabilityThe reliability of a test is the correlation between the population of values of the test at time 1 and the population of values at time 2 assuming constant true scores and no relationship between errors of measurement on the two occasions. Symbolized as Population rXX' or simply as rXX' This is pronounced “r sub X, X-prime”.As is the case with any population quantity such as the population mean or population variance, the definition of reliability refers to a situation that most likely is not realizable in practice.1) If the population is large, vague, or infinite as most are, then it will be impossible to access all the members of the population.2) The assumption of no carry-over from Time 1 to Time 2 is very difficult to realize in practice, since people remember how they performed or responded on tests. For this reason, it is often (though not always) not feasible in practice to test people twice to measure reliability.The bottom line is that the true reliability of a test is a quantity that we’ll never actually know, just as we’ll never know the true value of a population mean or a population variance. What we will know is the value of one or more estimates of reliability.You’ll hear people speak about “the reliability of the test”. You should remember that they should say, “the estimate of the reliability of the test”.I’ll use the phrase “true reliability” or “population reliability” to refer to the population value. I’ll try to remember to use “estimate of reliability” when referring to one of the estimates.Some facts about reliability if Classical Test Theory is true, in case you’re interested . . .1. Variance of Observed scores = Variance of True scores + Variance of Errors of Measurementσ2X = σ2T + σ2E2. True reliability = Variance of True scores / Variance of Observed scores.rXX' = σ2T / σ2X Neither of these is of particular use in practice, though. They’re presented here for completeness. Estimates of ReliabilityAs said above, we never know the true reliability of a test. So we have to get by with estimates of reliability.Test-retest estimate – the most acceptable estimate if you can meet the assumptionsOperational Definition1. Give the test to a normative group.2. Minimize memory/carryover from the first administration.3. Insure that there are no changes in true values of what is being measured.4. Give the test again to the same people.5. Compute the correlation between scores on the two administrations.Most straightforward – fits nicely with the conceptual definition of true reliabilityDisadvantagesRequires two administrations of the test – more time.May be inflated by memory/carryover from the first administration to the secondMay be deflated by changes in True scores from the first to the second administration.AdvantagesHas good “face” validity.Essentially always acceptable if the assumptions are met – regardless of the nature of the test.For performance tests, the test-retest method may be the only feasible method.For single-item scores, may be the only feasible method.Bottom LineYou should always compute and report the test-retest reliability estimate if you can. If you can meet the assumptions necessary, it is the most generally accepted estimate of reliability in my view.Practicing what I preach. Excerpt from a recent paper . . .1987710219360019878759543001467015513052Parallel Forms estimateSolving the “memory for previous responses to the same items problem.”Operational Definition1. Develop two equivalent forms of the test. They should have same mean and variance.2. Give both forms to the normative group.3. Compute the correlation between paired scores. That correlation is the reliability estimate of each form.Note that this definition has introduced a new notion – the notion that an equivalent form can “stand in” for the original test when computing the correlation that is the estimate of reliability.If we give the same test twice to compute a test-retest estimate, we can be reasonably sure it’s the same test on the second administration as it was on the first.But giving an equivalent form of the test requires a leap of faith – that the 2nd form is interchangeable with the original form on that second administration.The key to the success of the parallel forms method is that the two forms be equivalent. Equal means and variances are necessary for that equivalence.AdvantagesDon’t have to worry about memory/carryover between two administrations.Having two forms that can be used interchangeably may be useful in practice – a bonus.One reliability estimate, if high enough, can be applied to TWO tests.DisadvantagesTakes more time to develop two forms than it does one.It may not be possible to develop alternative, equivalent forms.A low reliability estimate, i.e., low r between forms, has two interpretations1. It could be due to low reliability of one or both of the forms.2. It could be that the forms are not equivalent. The idea represented here – that the correlation between equivalent measures of the same thing can be used to assess the reliability of each is a profound one, one that has had important implications for the estimation of reliability, as we’ll soon see.Bottom LineIf you can develop two alternative and equivalent forms of the test, then by all mean use them and report the correlation of the two as the reliability estimate of each.Split-half estimate“Halving your test and using it two.”The lazy person’s answer to parallel forms.Operational Definition1. Identify two equivalent halves of the test with equal means and variances.2. Give the test once.3. Score the halves separately, so that you have two scores for each person – score on 1st half and on 2nd half.4. Compute the correlation between the 1st Half and 2nd Half scores. Call that correlation rH1,H2.That correlation is the parallel forms reliability estimate of each half.But we want the reliability of the sum of the two halves, not each half separately. What to do????(Dial a statistician. Ring Ring. “Hello, Spearman here. Hmm, good question. Let me ask Brown about it, and I’ll get back with you.”)5. Plug the correlation into the following Spearman-Brown Prophecy Formula2 * rH1,H2Split-half reliability estimate of whole test = -------------------------------1 + rH1,H2Trust me: the higher the correlation between the two halves, the larger the estimated reliability. This is what is called an Internal Consistency Estimate – assuming that if the two halves are highly correlated, the whole test would correlate highly with itself, if given twice.The split-half method is the simplest example of what are called internal consistency estimates of reliability. They’re called internal consistency estimates because they rely on the consistency (correlation) of the two halves, both of which are internal to the test.The greater the consistency – correlation - of the two halves, the higher the reliability.Advantages1. It allows you to estimate reliability in a single setting – a major contribution to reliability estimation.2. Very computerizable. The program that scores the whole test can be program to score the two halves and compute a reliability estimate at the same time.Disadvantages1. Test may not be splittable – single item tests or performance tests.2. Requires equivalent halves. This may be hard to achieve.3. A low reliability estimate may be the result of either 1) low reliability of one or both halves or 2) nonequivalence of the halves.4. Different halving techniques give different estimates of reliability. Cronbach’s Coefficient Alpha estimateCoefficient alpha takes the notion introduced by the split-half technique to its logical conclusion.LogicThe split-half uses the consistency of two halves to estimate the reliability of the whole – the sum of the two halves.But it’s surely the case that the particular halves chosen will affect the estimate of reliability. Some will lead to lower estimates. Other possible halves might lead to larger estimates of reliability.So, the logic goes, why not look at all possible halves; compute a reliability estimate for each possible split; then average all those reliability estimates.Coefficient alpha essentially does this, although it is not based directly on halving the test.Instead, alpha is based on splitting the test into as many pieces as you can, usually into as many items as there are on the test, and computing the correlations between all of the pairs of pieces.The basic idea is that if all the pieces are correlated with each other, the total of those pieces will be reliable from one administration to the next.Operational Definition of Standardized Cronbach’s Alpha.1. Identify as many equivalent pieces of the test as possible. Let K be the number of pieces identified. Each piece is usually one item, so K is usually the number of items on the test.2. Compute the correlations between all possible pairs of pieces. You’ll compute K*(K-1)/2 correlations.3. Compute the mean (arithmetic average) of the correlations. Call it r-bar. (r for correlation; bar for mean)4. Plug K (number of items) and r-bar (mean of the K*(K-1)/2 correlations) into the following formula K * r-barStandardized alpha of whole test = α = ---------------------------------------------- 1 + (K-1) * r-barRelationship to split-half reliabilityCoefficient alpha is simply an extension of split-half reliability to more than two pieces.Note that if K = 2, then there is only one correlation – the correlation between the two halves.So if there were only two pieces, r-bar would be simply rH1,H2, the correlation between the two halves of the test.So if K=2, the formula for alpha reduces to 2*rH1,H2 / (1 + rH1,H2). This is the split-half formula.“Regular” Cronbach’s AlphaThere is another formula, based on variances of the pieces and covariances between them that is typically computed and reported. If you see alpha reported, it will likely be the variance-based version.I presented the standardized version here, because 1) it’s formula is easier to follow than the variance-based formula and 2) its value is typically within .02 of the variance-based formula.SPSS used to report both. Now, I believe, it reports only “Regular” alpha.Hand Computation Of Standardized Coefficient AlphaSuppose a scale of job satisfaction has four items.Q1: I'M HAPPY ON MY JOB.Q2: I LOOK FORWARD TO GOING TO WORK EACH DAY.Q3: I HAVE FRIENDLY RELATIONSHIPS WITH MY COWORKERS.Q4: MY JOB PAYS WELL.Suppose I gave this "job satisfaction" instrument to a group of 100 employees. Each person responded with extent of agreement to each item on a scale of 1 to 5. Total score, i.e., observed amount of job satisfaction, is either the sum of the responses to the four items or the mean of the four itemsThe data matrix might look like the following:Two different Expressions of Scale scores PERSONQ1Q2Q3Q4TOTALMEAN13433 133.2525455 194.7531211 51.2543233 112.2554543 164.00644 32 133.25etcetcetcetcetc etcetc.586740121285001162050-18732500Suppose the correlations between the items were as follows:221043593980Obviously, each item correlated perfectly with itself, so the 1's on the diagonal will not be used in computation of alpha.00Obviously, each item correlated perfectly with itself, so the 1's on the diagonal will not be used in computation of alpha.Q1Q2Q3Q4Q11Q2.41Q3.5.41Q4.3.4.51The average of the interitem correlations, r-bar, is r-bar = (.4 + .5 + .3 + .4 + .4 + .5) / 6 = 2.5 / 6 = .417Standardized Coefficient alpha is No. items * r-bar 4 * .4171.6681.668 Alpha = ---------------------------- = ------------------------ = ------------ --------- = .74 1+(No.items-1)*r-bar 1 + (4-1)*.417 1 + 1.2512.251Notes:1. Alpha is merely a re-expression of the correlations between the items. The more highly the items are intercorrelated, the larger the value of alpha. 2. Alpha can be increased by adding items, as long as adding them does not decrease the average of the interitem correlations, r-bar. So any test can be made more reliable by adding relevant items - items which correlate with the other items.3. Just as was the split-half reliability estimate, alpha depends on the consistency (correlations) of the pieces of the test, all of which are internal to, i.e., part of, the test. So it’s an internal consistency estimate. The more consistent the responses to the items, the higher the reliabilityThe SPSS RELIABILITY PROCEDUREExample Data: Items of a Job Satisfaction Scale. 60 respondents. 1=Dissatisfied; 7=Satisfied. Q27 Q32 Q35 Q37 Q43 Q45 Q50 OVSAT 1.00 5.00 2.00 2.00 1.00 1.00 2.00 2.00 1.00 7.00 6.00 4.00 6.00 2.00 6.00 4.57 7.00 7.00 1.00 7.00 7.00 6.00 7.00 6.00 4.00 6.00 6.00 6.00 6.00 6.00 6.00 5.71 1.00 6.00 5.00 2.00 1.00 1.00 3.00 2.71 3.00 3.00 7.00 6.00 7.00 1.00 6.00 4.71 6.00 7.00 7.00 6.00 6.00 6.00 7.00 6.43 2.00 7.00 3.00 3.00 3.00 1.00 3.00 3.14 6.00 6.00 7.00 6.00 6.00 6.00 6.00 6.14 4.00 6.00 5.00 4.00 4.00 3.00 3.00 4.14 1.00 3.00 6.00 5.00 5.00 6.00 5.00 4.43 1.00 5.00 1.00 1.00 1.00 1.00 1.00 1.57 1.00 5.00 1.00 1.00 1.00 5.00 1.00 2.14 1.00 7.00 2.00 2.00 3.00 3.00 3.00 3.00 7.00 7.00 6.00 7.00 7.00 7.00 7.00 6.86 6.00 4.00 4.00 7.00 6.00 7.00 7.00 5.86 7.00 7.00 7.00 7.00 5.00 7.00 7.00 6.71 7.00 7.00 4.00 4.00 7.00 7.00 6.00 6.00 6.00 5.00 7.00 7.00 6.00 5.00 6.00 6.00 7.00 5.00 5.00 6.00 6.00 2.00 9.00 5.17 3.00 6.00 6.00 3.00 5.00 5.00 5.00 4.71 3.00 7.00 6.00 7.00 4.00 3.00 7.00 5.29 6.00 6.00 7.00 7.00 7.00 6.00 7.00 6.57 3.00 7.00 7.00 7.00 6.00 1.00 7.00 5.43 5.00 7.00 6.00 6.00 7.00 6.00 6.00 6.14 4.00 6.00 6.00 6.00 6.00 3.00 6.00 5.29 5.00 5.00 6.00 5.00 5.00 1.00 5.00 4.57 3.00 6.00 2.00 5.00 6.00 6.00 5.00 4.71 4.00 4.00 2.00 3.00 3.00 2.00 2.00 2.86 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 5.00 6.00 6.00 4.00 7.00 6.00 6.00 5.71 7.00 6.00 4.00 7.00 7.00 5.00 7.00 6.14 4.00 5.00 4.00 5.00 7.00 5.00 7.00 5.29 3.00 7.00 7.00 7.00 6.00 6.00 7.00 6.14 6.00 6.00 6.00 6.00 5.00 5.00 6.00 5.71 4.00 5.00 7.00 4.00 6.00 4.00 7.00 5.29 7.00 7.00 6.00 7.00 7.00 6.00 7.00 6.71 6.00 5.00 2.00 7.00 6.00 6.00 7.00 5.57 3.00 6.00 7.00 5.00 3.00 7.00 6.00 5.29 6.00 6.00 7.00 7.00 6.00 6.00 7.00 6.43 6.00 4.00 5.00 7.00 6.00 6.00 6.00 5.71 4.00 4.00 4.00 6.00 4.00 1.00 2.00 3.57 5.00 5.00 6.00 6.00 7.00 5.00 6.00 5.71 4.00 6.00 6.00 6.00 6.00 6.00 6.00 5.71 5.00 6.00 6.00 6.00 6.00 6.00 6.00 5.86 2.00 2.00 2.00 2.00 2.00 2.00 3.00 2.14 5.00 6.00 6.00 5.00 5.00 6.00 6.00 5.57 2.00 6.00 6.00 5.00 3.00 5.00 6.00 4.71 5.00 6.00 2.00 5.00 5.00 6.00 4.00 4.71 5.00 6.00 7.00 6.00 6.00 7.00 7.00 6.29 1.00 6.00 6.00 2.00 5.00 1.00 5.00 3.71 5.00 6.00 7.00 6.00 6.00 3.00 7.00 5.71 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 7.00 7.00 7.00 7.00 7.00 6.00 7.00 6.86 7.00 1.00 6.00 7.00 3.00 5.00 6.00 5.00 4.00 6.00 6.00 5.00 5.00 6.00 6.00 5.43 1.00 5.00 5.00 5.00 1.00 2.00 5.00 3.43 1.00 6.00 5.00 3.00 5.00 5.00 3.00 4.00 7.00 7.00 7.00 7.00 7.00 5.00 7.00 6.71 4.00 6.00 7.00 7.00 7.00 5.00 7.00 6.14Analyze -> Scale -> Reliability Analysis …50530545312475410863396073Click on this button.0Click on this button.170180105410000 The syntax for this output, if you’re interested.RELIABILITY /VARIABLES=Q27 Q32 Q35 Q37 Q43 Q45 Q50 /SCALE('ALL VARIABLES') ALL /MODEL=ALPHA /STATISTICS=DESCRIPTIVE SCALE /SUMMARY=TOTAL MEANS VARIANCE COV CORR.ReliabilityScale: ALL VARIABLESCase Processing SummaryN%CasesValid60100.0Excludeda0.0Total60100.0a. Listwise deletion based on all variables in the procedure.Reliability StatisticsCronbach's AlphaCronbach's Alpha Based on Standardized ItemsN of Items.866.8627Item StatisticsMeanStd. DeviationNQ274.31672.0625160Q325.70001.2795760Q355.25001.8651660Q375.28331.7668560Q435.20001.8114560Q454.55002.0454660Q505.60001.75827603589655-1403709All items should have approximately equal standard deviations. Item 32 is suspect here. In general, items with small standard deviations will tend to suppress reliability.00All items should have approximately equal standard deviations. Item 32 is suspect here. In general, items with small standard deviations will tend to suppress reliability.In the correlation matrix below, look for items with small or negative correlations with the other items. They'll be the most likely candidates for exclusion from the scale. Item 32’s correlations have been highlighted.Inter-Item Correlation MatrixQ27Q32Q35Q37Q43Q45Q50Q271.000.139.296.738.659.565.648Q32.1391.000.217.143.326.245.277Q35.296.2171.000.534.477.252.631Q37.738.143.5341.000.686.495.801Q43.659.326.477.6861.000.514.760Q45.565.245.252.495.5141.000.496Q50.648.277.631.801.760.4961.000Summary Item StatisticsMeanMinimumMaximumRangeMaximum / MinimumVarianceN of ItemsItem Means5.1294.3175.7001.3831.320.2647Item Variances3.2931.6374.2542.6172.598.7617Inter-Item Covariances1.583.3242.6882.3658.305.6267Inter-Item Correlations.471.139.801.6625.747.0437-7553711501800Item-Total StatisticsScale Mean if Item DeletedScale Variance if Item DeletedCorrected Item-Total CorrelationSquared Multiple CorrelationCronbach's Alpha if Item DeletedQ2731.583362.484.699.644.839Q3230.200081.417.280.160.884Q3530.650069.926.516.442.864Q3730.616763.901.796.739.826Q4330.700063.536.786.649.827Q4531.350066.401.568.374.859Q5030.300062.959.841.767.8205657353524500450441452622Use this column to identify items whose inclusion makes alpha smaller than it would be without the item.00Use this column to identify items whose inclusion makes alpha smaller than it would be without the item.Scale StatisticsMeanVarianceStd. DeviationN of Items35.900089.5159.461257I’ve reproduced the display of alpha for the whole scale to make it easier to use the values in the rightmost column above.Reliability StatisticsCronbach's AlphaCronbach's Alpha Based on Standardized ItemsN of Items.866.8627Reliability ExampleTests with Right/Wrong AnswersThe example below illustrates how reliability analysis would be performed on a multiple choice test in which there was a right and wrong answer to each item.I chose to enter the raw responses to the items into SPSS from within a Syntax Window. The DATA LIST command tells SPSS the names of the variables (q1, q2, . . ., q36) and where each is located within a line (columns 1-36). For this example, q36 was an essay question and was not included in the reliability analysis.The values represent responses marked by test takers as follows: 1=a2=b3=c4=d9=no answer provided.DATA LIST /q1 to q36 1-36.BEGIN DATA.333331112113322114241221114421423122311333432212311114341422321112133224323333431212311424242411331222413225333333441212321421242411322921423223313333441232311121241411324121423225323333141212311121242412321221423225111411431212213434342421111222433225211321413212333314142413224121443124333311112313122412341411222122423221332333133212311414142411224222423220323333431212311414242441321221433223213332131212311134142211221221433225313333441312322114122411214222333325323333431212314121242411324221422223331332412212312114241111311422413221333133413232311124142214131121433224313333431212311414242411221121423225313321432313341324342431311221433224312321431333212424232111223221433223323332131213311414242411321421433224323333441212311124242411321221423225313333431212311123242411321221421225331333431212311121242411214312423293113333412313313121242241221321423324END DATA.The following syntax commands "score" each response and put the score for each question into a new variable.RECODE q1 (3=1) (ELSE=0) INTO q1score.RECODE q2 (2=1) (ELSE=0) INTO q2score.RECODE q3 (3=1) (ELSE=0) INTO q3score.RECODE q4 (3=1) (ELSE=0) INTO q4score.RECODE q5 (3=1) (ELSE=0) INTO q5score.RECODE q6 (3=1) (ELSE=0) INTO q6score.4248150-63500This question had two correct answers.00This question had two correct answers.RECODE q7 (4=1) (ELSE=0) INTO q7score.RECODE q8 (3=1) (ELSE=0) INTO q8score.RECODE q9 (1=1) (ELSE=0) INTO q9score.RECODE q10 (2,3=1) (ELSE=0) into q10score.RECODE q11 (1=1) (ELSE=0) INTO q11score.RECODE q12 (2=1) (ELSE=0) INTO q12score.RECODE q13 (3=1) (ELSE=0) INTO q13score.RECODE q14 (1=1) (ELSE=0) INTO q14score.RECODE q15 (1=1) (ELSE=0) INTO q15score.RECODE q16 (1=1) (ELSE=0) INTO q16score.RECODE q17 (2=1) (ELSE=0) INTO q17score.RECODE q18 (1=1) (ELSE=0) INTO q18score.RECODE q19 (2=1) (ELSE=0) INTO q19score.RECODE q20 (4=1) (ELSE=0) INTO q20score.422910034290This question had two correct answers.00This question had two correct answers.RECODE q21 (2=1) (ELSE=0) INTO q21score.RECODE q22 (4=1) (ELSE=0) INTO q22score.RECODE q23 (1=1) (ELSE=0) INTO q23score.RECODE q24 (1=1) (ELSE=0) INTO q24score.RECODE q25 (2,3=1) (ELSE=0) INTO q25score.RECODE q26 (2=1) (ELSE=0) INTO q26score.RECODE q27 (1=1) (ELSE=0) INTO q27score.RECODE q28 (2=1) (ELSE=0) INTO q28score.RECODE q29 (2=1) (ELSE=0) INTO q29score.RECODE q30 (1=1) (ELSE=0) INTO q30score.RECODE q31 (4=1) (ELSE=0) INTO q31score.RECODE q32 (2=1) (ELSE=0) INTO q32score.RECODE q33 (3=1) (ELSE=0) INTO q33score.RECODE q34 (2=1) (ELSE=0) INTO q34score.RECODE q35 (2=1) (ELSE=0) INTO q35score.The following is a list of the newly created "score" variables.5792470-365760These are the variable names.00These are the variable names. Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q QQ Q Q Q Q Q Q Q Q 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 31 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S SC C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C CO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O OR R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R RE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E TOTSCORE1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1 161 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 211 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 301 0 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 291 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 291 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 320 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 1 180 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 0 1 171 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 1 1 1 1 1 171 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 251 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 300 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 261 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 211 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 321 0 0 1 1 0 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 1 211 0 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 221 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 301 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 231 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 211 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 271 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 331 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 321 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 270 0 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1 25The RELIABILITY procedure was invoked with the following syntax command. Obviously, it can also be invoked from a pull down menu:Note that the variables which are assessed are the 1/0 "score" variables, not the original responses.RELIABILITY/VARIABLES=q1score q2score q3score q4score q5score q6score q7score q8score q9score q10score q11score q12score q13score q14score q15score q16score q17score q18score q19score q20score q21score q22score q23score q24score q25score q26score q27score q28score q29score q30score q31score q32score q33score q34score q35score4629150111125The syntax invoking the RELIABILITY procedure.00The syntax invoking the RELIABILITY procedure. /FORMAT=NOLABELS /SCALE(ALPHA)=ALL/MODEL=ALPHA /STATISTICS=DESCRIPTIVE SCALE /SUMMARY=TOTAL CORR .Reliability output from a previous version of SPSS. ****** Method 2 (covariance matrix) will be used for this analysis ******_ R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A) Mean Std Dev Cases 1. Q1SCORE .8333 .3807 24.0 2. Q2SCORE .2500 .4423 24.0 3. Q3SCORE .7083 .4643 24.0 4. Q4SCORE .9167 .2823 24.0 5. Q5SCORE .7917 .4149 24.0 6. Q6SCORE .6250 .4945 24.0 7. Q7SCORE .7500 .4423 24.0 8. Q8SCORE .5417 .5090 24.0 9. Q9SCORE .6250 .4945 24.0 10. Q10SCORE .9583 .2041 24.0 11. Q11SCORE .8750 .3378 24.0 12. Q12SCORE .7500 .4423 24.0 13. Q13SCORE .8750 .3378 24.0 14. Q14SCORE .7500 .4423 24.0 15. Q15SCORE .6250 .4945 24.0 16. Q16SCORE .5417 .5090 24.0 17. Q17SCORE .5000 .5108 24.0 18. Q18SCORE .2500 .4423 24.0 19. Q19SCORE .6250 .4945 24.0 20. Q20SCORE .9167 .2823 24.0 21. Q21SCORE .7917 .4149 24.0 22. Q22SCORE .7500 .4423 24.0 23. Q23SCORE .7500 .4423 24.0 24. Q24SCORE .8333 .3807 24.0 25. Q25SCORE .8750 .3378 24.0 26. Q26SCORE .6667 .4815 24.0 27. Q27SCORE .5833 .5036 24.0 28. Q28SCORE .5000 .5108 24.0 29. Q29SCORE .9167 .2823 24.0 30. Q30SCORE .6667 .4815 24.0 31. Q31SCORE .9167 .2823 24.0 32. Q32SCORE .5000 .5108 24.0 33. Q33SCORE .9167 .2823 24.0 34. Q34SCORE .8333 .3807 24.044672251905This message will be printed whenever the number of variables exceed the number of persons. Alpha is not affected.00This message will be printed whenever the number of variables exceed the number of persons. Alpha is not affected. 35. Q35SCORE .9583 .2041 24.0 * * * Warning * * * Determinant of matrix is zero Statistics based on inverse matrix for scale ALPHA are meaningless and printed as ._ R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A) N of Cases = 24.0 N ofStatistics for Mean Variance Std Dev Variables Scale 25.1667 28.7536 5.3622 35Inter-itemCorrelations Mean Minimum Maximum Range Max/Min Variance .0972 -.3780 .7977 1.1757 -2.1106 .0454_ R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A)Item-total Statistics Scale Scale Corrected Mean Variance Item- Squared Alpha if Item if Item Total Multiple if Item Deleted Deleted Correlation Correlation DeletedQ1SCORE 24.3333 27.6232 .2463 . .8050Q2SCORE 24.9167 26.0797 .5486 . .7937Q3SCORE 24.4583 26.6938 .3844 . .7999Q4SCORE 24.2500 27.9348 .2477 . .8051Q5SCORE 24.3750 26.3315 .5285 . .7950Q6SCORE 24.5417 25.4764 .6075 . .7901Q7SCORE 24.4167 28.2536 .0647 . .8119Q8SCORE 24.6250 27.7228 .1440 . .8101Q9SCORE 24.5417 25.5634 .5890 . .7909Q10SCORE 24.2083 27.9982 .3304 . .8042Q11SCORE 24.2917 28.5634 .0211 . .8113Q12SCORE 24.4167 27.0362 .3308 . .8021Q13SCORE 24.2917 27.1721 .4166 . .8001Q14SCORE 24.4167 26.5145 .4486 . .7976Q15SCORE 24.5417 25.6504 .5707 . .7917Q16SCORE 24.6250 28.1576 .0624 . .8135Q17SCORE 24.6667 26.1449 .4495 . .7969Q18SCORE 24.9167 26.9493 .3503 . .8013Q19SCORE 24.5417 25.8243 .5342 . .7933Q20SCORE 24.2500 28.1087 .1888 . .8065Q21SCORE 24.3750 27.0272 .3604 . .8011Q22SCORE 24.4167 27.2101 .2921 . .8035Q23SCORE 24.4167 27.3841 .2536 . .8050Q24SCORE 24.3333 28.1449 .1148 . .8092Q25SCORE 24.2917 27.1721 .4166 . .8001Q26SCORE 24.5000 26.9565 .3130 . .8028Q27SCORE 24.5833 27.4710 .1949 . .8079Q28SCORE 24.6667 27.1884 .2449 . .8058Q29SCORE 24.2500 28.6304 .0144 . .8106Q30SCORE 24.5000 27.1304 .2774 . .8042Q31SCORE 24.2500 28.1087 .1888 . .8065Q32SCORE 24.6667 26.8406 .3122 . .8029Q33SCORE 24.2500 30.0217 -.4356 . .8208Q34SCORE 24.3333 27.0145 .4028 . .7999Q35SCORE 24.2083 28.9547 -.1105 . .8117_ R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A)Reliability Coefficients 35 itemsAlpha = .8080 Standardized item alpha = .7902The logic behind coefficient alphaCoefficient alpha is based on the premise that originated with the use of parallel forms estimates and continued with the use of the Spearman-Brown split half estimate: If different tests or different parts of a test correlate highly with each other, then that means they would be likely to correlate higher with themselves at some different time. That’s because, according to classical test theory, these items have small errors of measurement values. T is much bigger than e for each item.Coefficient alpha is best for instruments with the following characteristics . . .1. The test is comprised of multiple items.2. Items have essentially equal means and standard deviations and are U.S.3. All items fit classical test theory with the same T value.Relationships between Test-retest and Coefficient AlphaHere are test-retest reliability and coefficient alpha values for four common measures of affect.Sample size is 1195. All participants responded to each scale twice – once when completing a “HEXACO” Sona project, the other when completing a “NEO” Sona project.ScaleTest-retestAlpha 1Alpha 2Rosenberg self-esteem Scale.843.912.910Costello & Comrey Depression scale.829.947.946Watson PANAS PA scale.760.895.887.Watson PANAS NA scale.786.890.870Note that the coefficient alpha estimates of reliability are all slightly larger than the test-retest estimates involving the same variables. This is common.Which estimate of reliability should you use?If you can compute and defend your method, a test-retest estimate is probably the most defensible.If your instrument meets the assumptions listed above for alpha, it should be reported.Parallel forms and split-half estimates are alternative estimates in special circumstances.You must report SOME estimate of reliability.Acceptable ReliabilityHow high should reliability be?How tall is tall? How tall should you be to play in the NBA?Some very general guidelinesReliability RangeCharacterization0 - .6Poor.6 - .7Marginally Acceptable.7 - .8Acceptable.8 - .9Good.9+Very Good.95+Too good – this IS psychology, after all.Factors affecting estimates of reliability and their relationships to population reliability.There are at least three major factors that will affect the relationship of a reliability estimate to the true reliability of the test in the population in which the test will be used.Let’s call the sample of persons upon whom the reliability estimate is based the reliability sample.1. Variability of the people in reliability sample relative to variability of people in the population in which the instrument will be used.If the reliability sample is more homogeneous than the population on which the test will be used, the reliability estimate from the sample will be smaller than true reliability for the whole population.On the assumption that you want to report as high a reliability coefficient as possible, this suggests that you should make the sample from whom you obtain the estimate of reliability as heterogeneous as possible. The sample should definitely be at least as variable as the population.2. Errors of measurement specific to the reliability sample.Guessing is represented in Classical Test Theory by large errors of measurement.Suppose the test requires the reading level of a college graduate and will be used with college graduates, but you include persons not in college in the reliability sample.This means that some of the people won’t understand some of the items and will guess. So test characteristics such as inappropriate reading level or poor wording that cause large errors of measurement reduce reliability and estimates of reliability.3. Consistency of the people making up the reliability sample.The specific people making up the sample may contribute to the errors of measurement referred to in 2 above.Some people are more careless (?) inconsistent (?) than others. If the reliability sample is composed of a bunch of careless respondents, the reliability estimates will be smaller than if the reliability sample were composed of consistent responders.Reddock, Biderman, & Nguyen (International Journal of Selection and Assessment, 2011) split a sample into two groups based on the variability of their responses to items within the same Big Five dimension. Here are the coefficient alpha reliability estimates from the two groups . . .GroupExtraversionAgreeablenessConscientiousnessStabilityOpennessConsistent Group .92.83.84.90.85Inconsistent Group.85.69.79.83.76So the bottom line is that the more consistent the respondents in the reliability sample, the higher the estimate of reliability.4. For multiple item tests, the average of the interitem correlations determines reliability estimates.The greater the mean of the interitem correlations, the higher the reliability estimate, all other things, e.g., K, being equal.Recall . . . K * r-barAlpha = -------------------------------- 1 + (K-1) * r-barTrust the mathematicians – alpha gets bigger as r-bar gets bigger.5. For multiple item tests, the number of items making up the test affects reliability. As K increases, alpha increases.Longer tests have higher reliability, all other things, e.g., r-bar, being equal.174625167640The graph plots K*rbarAlpha = -------------------- 1 + (K-1)rbaras a function of K for 3 different values of rbar.Note that the point of diminishing returns is K equal 7 or 8. After that, each additional item contributes less and less to alpha.00The graph plots K*rbarAlpha = -------------------- 1 + (K-1)rbaras a function of K for 3 different values of rbar.Note that the point of diminishing returns is K equal 7 or 8. After that, each additional item contributes less and less to alpha.0000-389255042989500-389255011684000-15138401524000-22586951524000-30035501524000-22562994304800-440499599695alpha0alpha-297229114110200K: Number of itemsSo, a 5 item scale with mean of interitem correlations equal .3 would have reliability = .68.Adding 5 items to make a 10-item scale would increase reliability to .80.A 5-item scale with mean of interitem correlations equal to .5 would have reliability = .85.Adding 5 items to make a 10-item scale would increase reliability to .92.So, heterogeneous samples, items understandable by all respondents, consistent responders, closely-related items, lots of items lead to high reliability estimates.Why be concerned with reliability? The Reliability Ceiling: Start here on 1/23/18Goal of research: To find relationships (significant correlations) between independent and dependent variables. By the way, significant difference between means counts as a significant relationship.If we find significant correlations, our work is lauded, published, rewarded.If we don’t find significant correlations, our work is round-filed, we end up homeless.So, most of the time we want large correlations between the measures we administer.Basic Question: Of all the tests out there, with which test will your test correlate most highly?The answer is that any test will correlate most highly with itself.It cannot correlate more highly with any other test than it does with itself.And reliability is the extent to which a test would be expected to correlate with itself on two administrations. So test reliability is the best you can do if you’re looking for large correlations.If reliability is low, that means that a test won’t even correlate highly with itself.If a test won’t correlate with itself, how could we expect it to correlate highly with any other test?And the answer is: we couldn’t. If a test can’t be expected to correlate highly with itself, it couldn’t be expected to correlate highly with any other test.The fact that the reliability of a test limits its ability to correlate with other tests is called the reliability ceiling associated with a test.Reliability Ceiling Formula.Suppose X is the independent variable and Y is the dependent variable in the relationship being tests.Let rXX’ and rYY’ be the true reliability of X and Y respectively.Let rtX,tY be the correlation between the True scores on the X dimension and True scores on the Y dimension.Then rXY < = rtX,tY * sqrt(rXX’*rYY’)The observed correlation between observed X and Y scores can be expected to be no higher than the true correlation between X and Y times the square root of the products of the two reliabilities. Unless reliabilities are 1, this means that the observed correlation is expected to be less than the true correlation.This means that low reliability is associated with smaller chances of significant correlations.Let’s all hate low reliability. Something else we can agree on.Turning the Reliability Ceiling Around – Estimating rTxTy.If rXY < = rtX,tY * sqrt(rXX’*rYY’)then, using the algebra you learned in 7th grade . . .rXYrtXtY >= --------------------------sqrt(rXX’*rYY’)If the reliabilities of two tests are known, then a “purer” estimate of the correlation between true scores can be obtained by dividing the observed correlation by the square root of the product of the reliabilities.So what?Estimates of the true correlation is more constant across different scales of the same construct than are observed correlations. Estimates of the true correlation give us a better perspective on how related or unrelated difference scales are. Industrial – Organizational Psych ExampleAre job satisfaction and job commitment different characteristics?Le, H., Schmidt, F. L., Harter, J. K., & Lauver, K. J. (2010). The problem of empirical redundancy of constructs in organizational research: An empirical investigation. Organizational Behavior and Human decision Processes, 112, 112-125.Question asked in this research: Are satisfaction and commitment different constructs?Le et al. correlated Satisfaction with Commitment Correlation was rXY = .72 averaged over two measurements periods. (Table 1, p. 120). This is pretty high, but not so high as to cause us to treat them as the same construct.But after adjusting for reliability of the measures using the above formula . . .True correlations adjusted for unreliability = .90.This suggests that the two constructs – job satisfaction and job commitment are essentially identical. Even though the questionnaires seem different to us, they’re responded to essentially identically by respondents.Le et al. (2010) argued that the Satisfaction literature and the Commitment literature may be redundant.Affect Example from UTC research.Many consider there to be two separate types of affect.Self-esteem is typically viewed as type of positive affective. Depression is typically viewed as type of negative affective.For many people the two types of affect are viewed as distinct characteristics.Here is the observed correlation, rXY, of Rosenberg Self-esteem scale scores (Rosenberg, 1965) with Costello and Comrey Depression scores: -0.805.The test-retest of the Rosenberg scale is .843. (See above, p. 22)The reliability of the Depression scale is ..829. (See above, p. 22)The estimate of true-score correlation, rTXTY = rXY -.805 -.805------------------- = ------------------------- = ---------------- = -.963sqrt(rXX’,rYY’) sqrt(.843*.829).836 If self-esteem and depression were truly distinct and independent characteristics, the correlation between the two would be zero: 0.If self-esteem and depression were different views of the same affective state, the adjusted correlation would be -1.00. It’s certainly not 0. And it’s not -1, but it’s getting very very close to -1. This suggests that these two constructs, which are typically measured separately and are parts of very different literatures, are in fact very highly related to each other, so much so that they must certainly involve many of the same underlying biological structures.The bottom line is that if you’re going to study the self-esteem literature, you will probably benefit from reviewing the depression literature, and vice versa.Reasons for nonsignficant correlations between independent and dependent variables. –.We’ve now covered three reasons for failure to achieve statistically significant correlations. 1. Sucky Theory: The correlation between true scores, rtX,tY, is 0, i.e., X and Y really are not related to each other.This means that our theory which predicts a relationship between X and Y is wrong.We must revise our thinking about the X – Y relationship.From a methodological point of view, this is the only excusable reason for a nonsignificant result.2. Low power (from last semester).There could really be a relationship between True X and True Y, i.e., rtX,tY is different from 0, but our sample size is too small for our statistical test to detect it.This is inexcusable. (Ring, Ring. “Hello, who is this?” “This is Arnold Schwarzenegger. You’re terminated!!”)We should always have sufficient sample size to detect the relationship we expect to find.3. Low reliability (new – from this semester)There could really be a relationship between True X and True Y, i.e., rtX,tY is different from 0, but our measures of X and Y are so unreliable that even though True X and True Y may be correlated the observed correlation is not significant.This is inexcusable. (Ring, Ring. “Hello. Wait, I only just learned about reliability.” “OK, but if this happens next week, you’re fired!”)We should always have measures sufficiently reliable to allow us to detect the relationship we expect to find.The above is a good candidate for an essay test question or a couple of multiple choice questions.Introduction to Path Diagrams SymbolsObserved variables are symbolized by Squares or Rectangles.27241506350103 84121 76 . . . 97 8100103 84121 76 . . . 97 812476506350ObservedVariable00ObservedVariableTheoretical Constructs, also called Latent Variables are symbolized by Circles or Ellipses.2705101130300031813530480LatentVariable/TheoreticalConstruct00LatentVariable/TheoreticalConstruct276225012700106 78115 80. . . 93 8300106 78115 80. . . 93 83Correlations between variables are represented by double-headed arrows676275622300083820033655"Correlation"Arrow00"Correlation"Arrow3495675-4445LatentVariableLatentVariable"Correlation"Arrow00LatentVariableLatentVariable"Correlation"Arrow35242535560ObservedVariable00ObservedVariable150495026035ObservedVariable00ObservedVariable326707584455106 78115 80. . . 93 83104 79114 79. . . 92 8100106 78115 80. . . 93 83104 79114 79. . . 92 8125717543180103 84121 76 . . . 97 81101 90128 72 . . . 93 8000103 84121 76 . . . 97 81101 90128 72 . . . 93 80"Causal" or "Predictive" relationships between variables are represented by single-headed arrows102870046355"Causal"Arrow00"Causal"Arrow2857501968500336232574930LatentVariableObservedVariable"Causal"Arrow00LatentVariableObservedVariable"Causal"Arrow1724025109220ObservedVariable00ObservedVariable39052571120LatentVariable00LatentVariable10382251040765001038225153035001724025306705LatentVariable00LatentVariable100965068580"Causal"Arrow00"Causal"Arrow371475268605LatentVariable00LatentVariable34194758255"Causal"ArrowObservedVariableObservedVariable00"Causal"ArrowObservedVariableObservedVariableRepresentation of Classical Test TheoryIn Equation form: Observed Scores = True Scores + Errors of Measurement Xi = T + Ei533400103505TTrue ScoresXObservedScores"Causal"ArrowEErrors of Measurement"Causal"Arrow00TTrue ScoresXObservedScores"Causal"ArrowEErrors of Measurement"Causal"ArrowThat is, every observed score is the sum of a person's true position on the dimension of interest plus whatever error occurred in the process of measuring. The relationship between T and O is one in which Observed score is said to be a reflective indicator of the True amount.In terms of the labels of the diagrams . . .60960040005True ScoreLatentVariable00True ScoreLatentVariable383857520383500038290501581150"Causal"Arrow00"Causal"Arrow186690020193000018573751562100"Causal"Arrow00"Causal"Arrow375285038100"Causal"Arrow00"Causal"Arrow3762375495300004486275-153670Error of MeasurementLatentVariable00Error of MeasurementLatentVariable18288000"Causal"Arrow00"Causal"Arrow1838325457200002628900-20320ObservedVariable00ObservedVariable454342566040-3+6+6-4. . .+4-200-3+6+6-4. . .+4-2647700113665106 78115 80. . . 93 8300106 78115 80. . . 93 833381375695960003381375581660003371850334010003371850229235003362325105410003324225-88900014573257150100014573256007100014192253625850014097002387600014192251435100014192252921000265747524765103 84121 76 . . . 97 8100103 84121 76 . . . 97 81The relatioinship of observed correlations to true score correlations using path notation160782020828000312420011747500SYMBOLICALLY, rXY ≤ rTxTy sqrt(rxx' * ryy')22866352476500347472062230Error of Measurement00Error of Measurement138684062230Error of Measurement00Error of Measurement4726305125730What we observe.00What we observe.2240280133350Observed rXY00Observed rXY38938201257300017678401104900017754601433830001775460123190003329940142240GPA’s00GPA’s127254096520Wonderlic00Wonderlic2400300106045True rXY00True rXY17449801282700038709605080003413760131445AcademicAbility00AcademicAbility1348740139065Intelligence00Intelligence262128069215What we want.00What we want.Example of how reliability affect estimates of rxyFrom a recent study in which Intelligence was measured by the Paper and Pencil Wonderlic and Academic Ability was measures by Academic Record GPAs taken from Banner: rXY = 0.299.From a more recent study in which Intelligence was measured by a unproctored web-based short form of the Wonderlic and Academic Ability was measured by self-reported GPA:rXY = 0.180Why is the 2nd correlation so much smaller than the first? Perhaps because the unproctored web-based short form of the Wonderlic is less reliable than the Paper and Pencil form. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download