HANDOUT ON RELIABILITY

1

HANDOUT ON RELIABILITY

Reliability refers to the consistency and stability in the results of a test or scale. A test is said to be reliable if it yields similar results in repeated administrations when the attribute being measured is believed not to have changed in the interval between measurements, even though the test may be administered by different people and alternative forms of the test are used. For example, if you weighed yourself twice consecutively and the first time the scale read 130 lbs. And the second time 140 lbs., we would say that the scale was an unreliable measure of weights. In addition, to be reliable, an instrument or test must be confined to measuring a single construct and only one dimension. For example, if a questionnaire designed to measure anxiety simultaneously measured depression, the instrument would not be a reliable measure of anxiety. A reliable instrument or test must meet two conditions: it must have a small random error; and it must measure a single dimension.

Among others, one major source of inconsistency in test results is random measurement error. A primary concern of test developers and test users is therefore to determine the extent to which random measurement errors influence test performance. The classical true score model provides a useful theoretical framework for defining reliability and for the development of practical reliability investigations. In the classical true score model, an examinee's or a subject's observed score on a particular test is viewed as a random sample of one of the many possible test scores that a person could have earned under repeated administrations of the same test; and the observed score (X) is envisioned as the composite of two hypothetical components - a true score (T) and a random error component (E). T is defined as the expected value of the examinee's test scores over many repeated testings with the same test and E is the discrepancy between an examinee's observed score and his/her true score. The following equation summarizes the relationship between X, T and E:

X = T + E

An important question which follows from the above is: How closely related are the

examinees' true and observed scores on a particular test or instrument? Based on the classical true score model1, two indices are derived to measure the relationship between true and observed

scores.

1.

Reliability coefficient - defined as the correlation between parallel measures2 .

1"X = T + E" is only one of the assumptions of the classical true score theory. Please consult texts on measurement/test theory for other assumptions in the model as well as how the reliability coefficient and the reliability index are derived from the model.

2According to classical true score theory, two measures/tests are defined as parallel when 1) each examinee or subject has the same true score on both measures/tests, and 2). The error variances of the two measures/tests are equal. Based on this definition, it is sensible to assume that

2

This coefficient ( Dxx,) can be shown to equal the ratio F2T/F2X , the proportion of

observed score variance due to true score variance.

2. Reliability index - defined as the correlation between true and observed scores on a single

measure (i.e. DXT) and is equivalent to Fx/FT.

However, in reality, we rarely know about the true scores. Besides, the reliability coefficient defined above is purely a theoretical concept because it is not possible to verify that two tests are truly parallel. Therefore reliability of tests have to be estimated using other methods.

Methods of Estimating Reliability:

The methods of estimating reliability can be roughly categorized into two groups: one group of methods includes methods that require two separate test administrations; and another group of methods includes those using one test administration.

1. Methods Requiring Two Separate Test Administrations:

a. Test-Retest Method -

Test-Retest method yields a reliability estimate, m12, is based on testing the same examinees/subjects twice with the same test/scale and then correlating the results. If each examinee/subject receives exactly the same observed score on the second testing as he/she did on the first, and if there is some variance in the observed scores among examinees/subjects, then the correlation is 1.0, indicating perfect reliability. The correlation coefficient obtained from this test-retest procedure is called the coefficient of stability, which measures how consistently examinees/subjects respond to this test/scale at different times.

b. Alternate-Forms Method This method involves constructing two similar forms of a test/scale (i.e. both forms have the same content) and administering both forms to the same group of examinees within a very short time period. The correlation between observed scores on the alternate test/scale forms, (i.e. mxy computed using the Pearson product moment formula), is n estimate of the reliability of either one of the alternate forms. This correlation coefficient is known as coefficient of equivalence.

a. Test-Retest with Alternate Forms Method

This method is a combination of the test-retest and alternate-forms methods. In

parallel tests are matched in content.

Dr. Robert Gebotys 2003

3

this case, the procedure is to administer form 1 of the test/scale, wait, and then administer form 2. The correlation coefficient between the two sets of observed scores is an estimate of the reliability of either one of the alternate forms and is known as the coefficient of stability and equivalence.

2. Methods Using One Test Administration:

There are many situations when a single form of a test/scale will be administered only once to a group of examinees/subjects. The following are methods of estimating reliability based on scores from a single test administration. These methods of estimating reliability are mainly focused on how consistently the examinees/subjects performed or scored across items or subsets of items on this single test/scale form. The reliability estimates generated by these methods are usually called coefficient of internal consistency.

These methods of estimating reliability are based on the argument that if the scores of the subjects/examinees are consistent across items or subsets of items on the single test/scale form, then it is reasonable to think that these items or subsets of items came from the same content domain and were constructed according to the same specifications. In addition, if the examinees/subjects' performance is consistent across subsets of items within a test/scale, the test/scale administrator can also have some confidence that this performance would generalize to other possible items in the content domain.

a.

Reliability Estimates Based on Item Variances: Calculation of Cronbach's Alpha -

This is the most widely used method of estimating reliability using a single test administration. Cronbach's Alpha (") is calculated based on the following formula:

" = k / k -1 ( { 1 - E F2i } / F2x )

where k is the number of items on the test/scale, F2i is the variance of item i, and F2x is

the total test variance Cronbach's " can actually be conceived as the average of all the possible split-half reliabilities (Calculation of split-half reliabilities will be discussed in a following section) estimated on the single test/scale. However, unlike the split-half methods, Cronbach's " is not affected by how the items are arranged in the test/scale.

b. Split-Half Method -

Under this method, test/scale developers divide the scale/test into two halves, so that the first half forms the first part of the entire test/scale and the second half forms the remaining part of the test/scale. Both halves are normally of equal lengths and they are designed in such a way that each is an alternate form of the other. Estimation of reliability is based on correlating the results of the two halves of the same test/scale. If

4

the two halves of the test/scale are parallel forms of one another, the Spearman Brown prophecy formula is used to estimate the reliability coefficient of the entire test/scale. The Spearman Brown prophecy formula is:

D D D xx' = 2 YY, / 1 + YY,

D D where xx' is the reliability projected for the full-length test/scale, and YY` is the D correlation between the half-tests. YY, is also an estimate of the reliability of the

test/scale if it contains the same number of items as that contained in the half-test.

If the two halves of test/scale are not parallel, the reliability of the full-length test/scale is calculated using the formula for coefficient " for split halves:

" = 2 [ F2x - ( F2Y1 + F2 Y2) ] 1 / F2x

Where F2Y1 and F2 Y2 are the variances of scores on the two halves of the test, and F2x

is the variance of the scores on the whole test, with X = Yl + Y2.

In the SPSS program, the `SPLIT-HALF" model for reliability analysis is conducted on the assumption that the two halves of the test/scale are parallel forms. Hence, coefficient " has to be obtained by hand calculations.

Besides, it must be noted that split-half reliability estimate is contingent upon how the items in the test/scale are arranged. Reordering of the items and/or regrouping of items in the test/scale can result in different reliability estimates using the split-half method. Hence, reliability estimate obtained from the even/odd method (a method which is similar to split-half method and which will be mentioned below) on the same test/scale will most likely be different from the reliability estimated by using the split-half method.

c. Even/Odd Method -

Even/odd method is similar to split-half method, with the exception that the estimation of reliability for the entire test/scale is no longer based on correlating the first half of the test/scale with the second half, but instead it is based on correlating even items with odd items.

Determining Reliability Using SPSS: Example 1:

Dr. Robert Gebotys 2003

5

The following illustrative example contains six items extracted from a scale used to measure adolescents' attitude towards the use of physical aggressive behaviours in their daily life. Each item in the scale refers to a situation where physical aggressive behaviour is or is not used. Adolescents are asked whether they agree or disagree with each and every item on the scale. Adolescents' responses to the items are converted to scores of either 1 or 0, where 0 represents the endorsement of the use of physical aggressive behaviours and 1 represents disapproval of the use of physical aggressive behaviours. Below are the contents of the six items as well as the scores of 14 adolescents on these six items:

Item No.

Content

1

When there are conflicts, people won't listen to you unless you get physically

aggressive.

2

It is hard for me not to act aggressively if I am angry with someone.

3

Physical aggression does not help to solve problems, it only makes situations

worse.

4

There is nothing wrong with a husband hitting his wife if she has an affair.

5

Physical aggression is often needed to keep things under control.

6

When someone makes me mad, I don't have to use physical aggression. I can

think of other ways to express my anger.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download