Notes on the “Skills for Life Survey: a National Needs and ...



NATIONAL RESEARCH AND DEVELOPMENT CENTRE FOR ADULT LITERACY AND NUMERACY

The Skills for Life Survey: a National Needs and Impact Survey of Literacy, Numeracy and ICT Skills

DRAFT RESPONSE

FOR DISCUSSION AND DEVELOPMENT

Introduction

1. NRDC has been asked by DfES to offer comment on the Skills for Life national needs survey which was published in October 2003. Prior to this, NRDC offered a paper in June 2003 which analysed the benefits and issues relating to engagement in international adult literacy and numeracy surveys. The purpose of the “Taking Part in international adult literacy and numeracy surveys: the issues” paper was to inform the DfES’s decision about the UK’s possible involvement in the OECD’s Adult Literacy and Lifeskills (ALL) survey. The DfES is currently considering whether to take the process a stage further by an attempt at conceptual mapping of the Skills for Life Survey onto the ALL survey. Proposed New Stages are set out in Section 24 of this report.

2. NRDC’s view is that the attempt to assess the acquisition of literacy, numeracy and English language in the population as a whole is worthwhile in research terms largely because, as well as providing a broad outline of likely problems and needs, it supplies a rank ordering of the population on dimensions of literacy and numeracy performance. Such ordering is important for research in basic skills. It also helps to investigate the origins of low basic skills and their application, and provides an understanding of the labour market and other outcomes of low levels of basic skills acquisition in adults. National or international surveys repeated at intervals can also provide a picture of changes in basic skills in the population across time. In particular they allow an in-depth analysis of the relationships between literacy and social and personal factors. We make a recommendation about this at the end of this paper.

3. The survey confirms low levels of adult literacy and numeracy in England. We would caution, however, that such findings are heaviiy dependent on the precise way in which literacy and numeracy are defined. We would propose further research in this area, following the work that was carried out in the re-analysis of the IALS survey. Low levels of education and skills are clearly damaging to individual people, families, society and the economy. It is therefore important to address the outcomes of the research appropriately in policy development and improvement in the provision of learning opportunities.

4. NRDC’s view is that the report is very clear in listing the issues which make comparing different surveys difficult, such as the use of different test items, different scoring processes, different age profiles, different populations, different reach (e.g. UK or England). These are listed on p.141, with additional information about the population provided earlier in the report, e.g. that people in institutions were excluded from the survey. Less information is available on the characteristics of non-respondents and the likely effect of a rather high non-response rate on the survey findings.

5. As discussed in some detail below, it is also the view of NRDC that the estimates given for performance in terms of levels must be treated with caution. A number of factors, relating to, for example, the lack of time to pilot items, and the adaptive approach used, make it possible that a re-run of the survey, using the same items and format, could result in very different estimates (which could imply ‘worse’ or ‘better’ performance). Small changes in the scoring approach could have the same outcome.

6. Given these differences, it is not wise to compare surveys. Nor would it be advisable to use the survey as an apparently precise baseline against which to measure the success of Skills for Life in the future. While the sensitivity of results to minor changes in sampling, items format and scoring is well known to assessment professionals, the general public (including media commentators) tend to ignore caveats when presented with neat comparisons of figures or diagrammatic presentations of figures.

Technical Analysis

7. The Skills for Life Survey goes much further than the aims and rationale of research set out in paragraphs 2 and 3 above. It aims to establish performance at a number of levels matched to the national basic skills curricula. This involves identifying test items that will match curriculum content at a number of levels of increasing difficulty as defined by national standards. Such an approach has a number of assumptions and procedures built into it, which does raise questions about the meaning of the findings and the limits to the value of the exercise. A number of these are listed below.

a) In the view of most NRDC researchers, the ‘adaptive tests’ used (Appendices 2 and 3) assume that failure to pass test items at a particular curriculum level means that skills above that level cannot be achieved. This is at variance with the idea of “spiky profiles”, as discussed in chapter 9, which are particularly common in numeracy performance. Such a profile is evident when a respondent fails items at a lower level and yet can pass items at higher levels. The adaptive tests close off such performances except in a limited way. In other words, once a person’s level is determined on the basis of failure at a given level, further competence may be impossible to demonstrate. Notably, chapter 9 shows the interesting result that ‘spikiness’ increases the lower the overall level of the performance. In other words the lower the person’s location is in terms of the test levels, the more likely they are to be inaccurately allocated to a level. Yet it is at the lowest levels at which the ‘Skills for Life’ Strategy is directed and where accuracy of allocation in curriculum terms might be thought most important.

b) One NRDC research partner pointed out that, although the matrix for the literacy test (p.222) does show several places where people being tested have the chance to move up a Level on the basis of success at the lower level (and the converse), and much the same is true of the ICT awareness items (p.249), it is unclear whether numeracy had an equivalent system. It was different from literacy, but the description of the numeracy system is hard to follow. It appears to mean that the Level boundaries were in effect set by inspection of borderline cases. This is a widespread procedure, but it is not adequately described here for readers to be sure how exactly it was done and whether, therefore, the process was sound.

c) Determining whether a respondent has passed a particular level depends to a certain extent on an arbitrary judgement that they have to have passed a certain proportion of items at that level. The particular cut-off used in the national test is 70% (approx.) items passed. But it could just as easily be 60% or 80%. Failure, which determines dropping to a lower level, is indicated when the respondent passes 30% or less of the items. Again, it could just as easily be 20% or 40%. The importance of variable cut-offs is two fold:

i. They determine whether a person progresses further up (or down) the levels of the test,

ii. They can have a major impact on the distribution of respondents across the levels. Tom Sticht quotes evidence relating to IALS that a shift from 80% to 70% ‘level 1’ items passed produced a drop from 20% to 15% in the percent classified as ‘level 1’. Thus the reported distribution, which may then be placed in international league tables for comparison, has a highly arbitrary aspect to it. (In fact it would be quite easy to fix the distribution to match a wide range of (desired) outcomes.)

d) The effective construction of an educational test is based on the results of item analysis. Such item analysis depends on determining the pass rate for each test item, which then gives the opportunity for rank ordering the items in terms of the ‘facility’ (or its obverse, ‘difficulty’) index. Item analysis also gives a ‘discrimination index’ – basically a measure of internal consistency among the test items, that is to say, the extent to which each item is aligned with a single overall dimension of achievement underlying the test. The Skills for Life Survey report, however, gives no item analysis information at item level, only facilities for each of the basic skills levels computed from all the individual items making up the level. We thus have no way of knowing to what extent the items are genuinely working together in determining performance (comparable facilities and discrimination indices) at each level. All we do know is that, on average, across all items, the levels broadly correspond to a dimension of difficulty, i.e. the average facility index is ranked broadly, as predicted, across the levels.

e) The reason why individual item information is not supplied is partly because of the nature of adaptive testing. As noted above, this assumes that if a person is not required to attempt a particular test item, it is either a) because it is too easy, in which case one assumes it is passed (score 1), or b) because it is too difficult, in which case it is failed (score 0). As explained above, strong assumptions are made about the nature of performance so that classical item analysis is not appropriate. Nevertheless, it is possible to study the reasonableness of this assumption, and we think that this should have been attempted. It is probably best done in a pilot study.

8. A response rate of 60% is low by the standards of national surveys, but need not be a problem if those not participating are representative of those who do, and are not therefore biasing the distribution of performance in one direction or another. However, with a survey of basic skills, as has been argued in the international comparisons, there are two kinds of potentially biasing non-response because they are based on the nature of the task, which is the focus of the survey. Thus people who are very poor at literacy, and especially numeracy, may refuse to take part once they know the nature of the survey, or they may refuse to do a part of one of the tests. Some will do a literacy test, for example, but refuse to do a numeracy test. (Are they allocated the lowest score on the test or are they excluded from the distribution?) Another form of potentially biasing non-response relates to respondents’ belief that the assessment is a waste of time because they have no basic skills problems. Thus well-educated respondents may refuse the literacy test because they think the task is beneath them, consequently biasing the distribution in the direction of the poor performers. This is considered to be one of the reasons behind the poor performance exhibited in the early results for France in IALS.

The rate of refusal is one significant element of the response rate. The refusal rate is considered to be high in the Skills for Life survey. If people have refused on the grounds that the tests are too easy, the sample is therefore potentially more liable to be distorted demographically towards those whose skills are lower. If refusal is on grounds of inability to do the tests, this would bias the results upwards. Most NRDC commentators feel that on the basis of past surveys, refusals are more likely to bias the sample downwards because respondents do not feel that literacy tests in particular are relevant to them.

9. Chapter 10 of the report supplies comparisons between a number of basic skills surveys, based on the performance of individuals categorised in terms of the basic skills standards. The results show, for example, that IALS indicates poorer performance at entry level than do the other surveys. However the linking across surveys can not be justified, because in the case of those carried out in the cohort studies, for example, there was no real attempt to classify cohort members in terms of levels - simply to split the distribution of their scores in terms of those who were ‘very poor’, ‘poor’, ‘average’ or ‘good’ on the particular skill assessed. Cut-offs in this case were based on natural breaks in the distribution of scores, as revealed by associations between location in the categories determined by the cut-offs and other attributes such as poor examination performance, poor functioning in the labour market, signs of social exclusion and so on. The cut-offs may therefore be seen as delineating socially meaningful and useful categories.

10. As noted above, the National Survey approach to testing attempts to locate individuals at different levels of curriculum content. However, to do this effectively would require a substantial number of items at each level - far more for example than are currently involved in the adaptive test in the National Survey. The ‘reliability of a test’, from which the ‘error of measurement’ (or misclassification) can be derived, is a function of the number of items used to assess each level. The handful of items used for each level in the national test is capable of producing only a crude classification. Therefore the idea of the test as an instrument for the effective assessment of needs, at the level of detail required by a national curriculum, is questionable in research terms.

11. There is a lack of consensus about precisely what surveys, including the Skills for Life survey, are actually measuring. The use of what are known as ‘real world’ tasks recognises that literacy, language and numeracy practices are embedded in the contexts in which they take place and may often be invisible, or not recognised, as numeracy or literacy. This makes it difficult to establish precisely what adults can and cannot do. This may be particularly true of numeracy. In his response to the Skills for Life Survey, Sticht has argued that the use of ‘real world’ tasks in a test situation actually tests a number of skills in the same exercise, which makes it problematic to establish precisely which skills were performed correctly or incorrectly. Sticht also argues that “real world” activities approaches also expose the lack of a ‘clearly specified theory of literacy’, for example ‘literacy as a psychological construct’. Without such underpinning knowledge it is difficult to differentiate in survey testing between skills of literacy and numeracy and other abilities such as ‘problem solving, reasoning…..management of test-taking anxiety, interpersonal skills, motivation, impulsivity, etc.’ (Sticht, 2003).

12. The design of test items also raises issues about the extent to which questions are abstract or contextualised and the media through which they are presented (spoken or written words; numbers; pictures; photographs; etc.). The way questions are set or asked introduces further complexity. For example, a multiple choice format could yield different results than short answer versions of the same question. The former allows for guessing but the latter does not, suggesting that test results using multiple choice formats might be adjusted appropriately for guessing in survey findings. The report does not mention the need to do this, or whether it was in fact done. Furthermore, different groups of the population may be more or less familiar with particular contexts, so introducing further relative biases.

13. It is known from a wide variety of research that not only are apparently similar questions often very different in their difficulty levels, but also that the adult population is very heterogeneous in terms of the sort of context effects that make questions harder or easier for a respondent. An important way of ensuring that these issues are addressed in the final research design and test items is to pilot questions for a national sample very carefully with the sorts of populations it will be used with. It is not clear the extent to which the items for the Skills for Life survey were piloted. Pp. 15 and Pp. 226-7 refer to piloting literacy items and p233 mentions numeracy piloting. No mention could be found of piloting ICT items (either set). However, it is our view that the piloting of literacy items was not sufficient, was on too small a scale and did not really allow time for revision.

There is a question about revalidation, as the Level 1 and 2 literacy and numeracy items were taken from the Key Skills tests, and in the process adapted from paper-based to computer-based format. Also, some Entry level items were adaptations of Level 1 items from the Key Skills tests – hence these items were doubly adapted, not only from paper-based to computer-based format, but also from one level to another. It is our view that these items should have been put through extensive trialling, but the report calls the adapted Level 1 and 2 items ‘tried and tested’ (p.14) – which they were not, in their new format – and for the literacy items states that ‘it was assumed that all Level 1 and 2 items were valid and reliable’ (p.223).

14. The tests used were those initially developed for Key Skills. Piloting items would be particularly important with the bulk of respondents, namely those who are currently not in learning provision in comparison with those in education or training and taking Key Skills tests.

15. In view of this, it seems undesirable to express the results in terms of levels, and in terms of comparison with IALS results. The presentation of the Survey in the public domain initially implied that population improvements have happened without any formal learning taking place, which could potentially undermine the Skills for Life strategy.

16. On numeracy, we do not think it advisable to suggest comparatively that the results are ‘worse’ than for literacy, because one is not comparing like with like in any absolute sense. Numeracy items are in our view more context sensitive than literacy ones and are often ‘harder’ as they can combine difficult reading tasks with using numerical skills at the same time. It makes more sense to discuss the results in terms of a lot of people not being able to do tasks which informed observers and stakeholders think they should be able to do in our society, as citizens and workers, rather than discussing numeracy skills in terms of being ‘worse’ than literacy.

17. The estimates for low numeracy (except for that from Older ad Younger) appear to cluster fairly well, in the range 20%-23% - but within that group the Needs survey estimate is supposedly based on a higher criterion. Also this set of estimates for very low numeracy actually overlaps slightly with the bottom end of the range of estimates for low numeracy. Those estimates are oddly scattered, 23%-57%. Also odd is the fact that the criterion for BCS70, Older and Younger, NCDS5 and BSA (first version) is supposedly lower than that for IALS and BSA (second version), yet the supposedly lower criterion produces higher estimates, 33%-46% vs 23%-24%. The most probable explanation is shifting application of the criteria. In these circumstances, there seems to be little justification for mapping the numeracy section of this survey to IALS differently from the literacy section (pp. 140-1).

18. In fact, over the years there has clearly been a raising of what it is reasonable to expect of ‘competent’ adults. The Watts-Vernon test used in the 1972 sweep of NSHD would seem much too simple now as a measure of very low literacy attainment, and within numeracy the same is true of the very simple arithmetic test used in the ACACE survey of 1981. The same process is probably at work more recently but not yet so visible.

19. The survey, as with previous surveys, does not investigate writing skills. Spelling skills are tested reactively but do not include generation of text. Since writing skills, increasingly through the medium of ICT, are required in work settings as well as daily life, they are a critical set of vocational and social skills. Instruments have been developed for research purposes and NRDC would argue strongly that priority be given to developing appropriate instruments for assessing writing skills, so that all future surveys should give proper weight to writing skills assessment.

20. While the Skills for Life survey reveals a serious and persistent problem of adult innumeracy, there is still no consensus about what surveys should measure or how best to measure it. Numeracy is a deeply contested concept and concepts of it have varied between successive surveys.

Whatever the definition of numeracy used, it is difficult to establish what adults can do, rather than to demonstrate what they cannot do under test conditions.

The addition of language to Bynner and Parsons’ typology gives an indication of the further complexity involved when language deficit and competence are taken into account.

21. The inconsistency between the survey findings and the participants’ self-assessment of literacy and numeracy is a key area of concern. It suggests that even if their skills are as low as the survey findings propose, they are unlikely to be easily persuaded in large numbers of the need to do anything about it. There is a major challenge in motivating learners, particularly in numeracy.

22. Noss (1997:5) points out “mathematics is not always visible, it lies beneath the surface of practices and cultures”, making it easy to miss. This may go some way to account for the paradox that most people seem to manage the mathematical demands of their lives relatively well, while surveys show large numbers with weak skills. Many adults also feel great anxiety about mathematics (Boxton, 1981), evidenced, for example, by the high refusal rate in a survey for the Basic Skills Agency (BSA, 1997:20). Anxiety is likely to be exacerbated under test conditions, depressing performance in numeracy surveys.

23. The survey confirms that little has changed in the relationship between social class and skills levels. This problem is deep-rooted and there is much evidence that it can only be tackled intergenerationally and across a wide spectrum of policy. Education can play a huge role, but can not alone address issues of policy, social exclusion and inequality. The implication for policy development is a strong focus on long-term change focused on early years, schooling and family learning as well as continued drive on Skills for Life. The lowest achievers and least motivated groups could be high priorities for the 2004-2007 period

Proposed next stages

24. It has been agreed with DfES that:

a) The DfES Analytical Services Basic Skills team will explore with NRDC how the latter can perform a conceptual comparison of the frameworks used in ALL and the Needs Survey. This is to check whether the conceptions of literacy and numeracy are similar enough to attempt putting the two tests on the same scales. NRDC will consult with Statistics Canada as necessary about how best to carry out this conceptual mapping, including comparison of items.

By conceptual mapping, we mean:

• Comparing the domain specifications of the two surveys – the Literacy and Numeracy Standards in the case of the Basic Skills Needs survey in England, whatever ETS and Stats Canada, etc., used in the case of ALL

• Comparing the items used in the two surveys to operationalise the domain specifications.

b) Once the conceptual mapping is complete, its outcome will be considered and if it suggests that the surveys, or elements of them (e.g. numeracy), can be linked convincingly then around 1,000 individuals who have taken the Needs Survey tests could be recontacted. They would then be asked to take the ALL survey. This might form part of the NRDC research programme in 2004/5.

c) The data would then be linked, possibly via something similar to the sophisticated software used to link PISA data to provincial reading performance standards for British Columbia. NRDC will be asked to consider doing this should the conceptual mapping prove promising.

d) Continue with the conceptual mapping of this survey and ALL. However, within this exercise an expert panel should be asked to judge the validity of the Needs survey items and tests against the National Standards and Curricula.

e) The DfES might wish to commission a technical report from CDELL/BMRB covering all the gaps in reporting mentioned in section 25. Only when this is available will it be possible to take a fully considered view of the survey. In particular, until such a report is available it is impossible to judge adequately its reliability and validity, and its suitability as a baseline for possible future surveys.

25. Conclusions

• The survey has confirmed that it is undeniable that a large number of adults in England have poorer literacy and numeracy skills than are desirable today for them as citizens, family members and employees and for their own purposes.

• The ICT survey was highly innovative and should be built on in future work.

• The ‘literacy’ survey was almost entirely a reading survey, as no attempt was made to assess written composition.

• There are gaps and inadequacies in the technical reporting of the survey, namely:

➢ Reporting of the piloting.

➢ Analysis of the representativeness of the achieved sample against the target population.

➢ Analysis of the validity of the items (tabulating their target statements in the National Standards and Curricula provides no basis for judging appropriateness).

➢ A clear account of boundary-setting for numeracy.

➢ Mention of correction for guessing.

➢ Absence of item statistics.

• The comparisons with previous surveys do not provide a sufficient perspective on the history of attempts to estimate adult literacy and numeracy levels. In particular, no account is taken of probable variability in the match between survey items and domain specifications.

26. Recommendations for future Surveys

• The results should be made available for in-depth analysis, which would further the research aims outlined at the start.

• Further exploration of the survey might be carried out with a view to finding and understanding detailed relationships between literacy and social and personal factors

• The Strategy Unit at DfES might commission this exploration, which could also include recommendations for further work. In particular, detailed consideration might be given to the use of adaptive testing, whether and when it is appropriate and what alternative methods might be used.

• The survey should not be re-run before a number of questions about its present form have been discussed and addressed.

• The DfES might wish to plan for a further survey in about 2007. However, the 2002-3 survey should only be used as a baseline for that survey (i.e. as the basis for comparisons over time) if tests used in 2002-03 are shown to be sufficiently valid, reliable and suitable for re-use. Otherwise, the 2007 survey could be mounted as a baseline.

• If a further survey is commissioned, it will need either completely new tests or enough new items to be aligned with the 2002-03 items (in order to replace those before they age too far). In either case, sufficient development time should be allowed. This would imply starting work in 2005.

• If work is to continue on the assessment of basic skills in the adult population, it might be useful to view the present baseline needs survey as a pilot for a soundly based series of baseline surveys. Although we may question allocation of individuals to curriculum levels in the way the report does, the survey itself potentially supplies a lot of information about performance on a range of basic skills tasks at different levels of difficulty, which could be used to establish more reliable and valid general measures for future use. This would involve gaining access to the whole dataset on which the report was based and, through the use of such techniques as confirmatory factor analysis, evaluating the psychometric properties (variable difficulties and strong discrimination, lack of gender bias and so on) of the test items.

• Apart from offering the opportunity to reconstruct the overall test scores on a sounder base, such data could also supply the basis for the selection of items to form a new numeracy and a new literacy test. Alongside this reconstruction a wider ‘item pool’ could also be constructed, and comprehensively piloted, to enable items with comparable properties to be available for the construction of parallel tests. Such tests would abandon the ‘adaptive principle’ and instead use fairly liberal stopping rules of say 8 constructive tasks in ascending order of difficulty failed before the test terminates. All rules of this kind would be decided by an expert panel, who would have overall responsibility for the test design strategy.

Ursula Howard

22 January 2004

Contributors:

David Barton

Greg Brooks

John Bynner

Diana Coben

Harvey Goldstein

Ursula Howard

Alison Wolf

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download