Development of Non-Objective (Semi-Subjective) Assisted ...



TDF Evaluation Report (2004/05)

Project Title:

Review of CAA for both Objective and Subjective Assessment

Dr Peter Dawson

Lecturer in Economics

Department of Economics and International Development

Email: P.M.Dawson@bath.ac.uk

Phone: +44(0)1225 383074

Abstract / Background

Growing student numbers and advancements in technology have led many academic disciplines to consider alternative forms of assessment. Technology has been instrumental in improving the student learning experience through, for example, virtual learning environments such as Blackboard and Moodle. Technology can also be utilised to maintain the integrity and timing of feedback[1].

A significant area of development has been in the use of computer-assisted assessment (CAA). Today there are well over forty computer-based systems available on the market. CAA is well-developed as a mechanism for objective testing (e.g. multiple-choice questions, completion questions, labelling and building questions and true/false questions) and in the collation and analysis of optically-captured data. These examples of objective testing, whilst useful in assessing knowledge and comprehension, generally fail to assess higher order critical thinking skills (Bull and Collins, 2001). Consequently attempts have been made to consider the assessment of free-text assignments and essays.

The aim of this project was first to conduct a review of CAA activity within the economics discipline in the UK. Its second aim was to discuss developments in subjective assessment testing (commonly referred to in this context as the automated assessment of free text) and to assess the appropriateness of this to economics.

Keywords: Computer-assisted assessment, objective testing, free text, automated essay scoring systems.

Evaluation

CAA Activity in the Economics Discipline (Objective Testing)

The benefits of CAA for objective testing (i.e. where the answer is pre-determined) are well documented and include the ability to increase the frequency and range of assessment methods, thereby motivating students to learn and encouraging them to practice skills. Moreover, from a lecturer viewpoint it has the potential to decrease the time taken to grade assignments. Table 1 provides a list of some of the commonly used systems in this area.

Table 1: Some Common CAA Objective Testing Systems

|System |Website |Description |

|QuestionMark |( used (including a number of departments at Bath) commercial-based product. |

| |aspx) |One of the first software systems developed to create tests, quizzes and on-line |

| | |surveys. One example, outside of the UK, where this is being used is in the |

| | |Economics Department of the 'Hanzehogeschool' in Groningen (the Netherlands). |

|i-assess |() |Another commercial-based product. Can handle a variety of assessment mechanisms, |

| | |principally multiple-choice, multiple-response, hot-spot, word-match and gap-fill.|

|TRIADS (Tripartite |( project between Earth Science Departments at the University of |

|Interactive Assessment |/triads/ and |Liverpool, the University of Derby and the Open University. Can handle a variety |

|Delivery System) |) |of question styles. |

|Hot Potatoes |( by the Research and Development team at the University of Victoria |

| |om/hot_pot.php) |Humanities Computing and Media Centre, Australia. Free of charge for publicly |

| | |funded, non-profit, educational users who make their pages available on the web. |

|CASTLE (Computer Assisted | |This software is supplied for free for non-commercial use but compared to other |

|Teaching and Learning) | |packages it has only a limited number of question styles (predominately |

| | |multiple-choice based). |

|TAL | |TAL was produced by Bristol University in an attempt to introduce computers in the|

|(Test and Learn) | |assessment process. TAL is internet run; tutors choose which questions to set for |

| | |the tests and students sit the tests using an ordinary web browser. |

|TOIA (Technologies for | |TOIA was funded by JISC as part of the Exchange for Learning (X4L) Programme. Can|

|Online Interoperable | |handle a variety of question types, is completely web-based and has customisable |

|Assessment) | |user face. |

Note: A number of publishing companies also provide on-line learning centres providing web-based content which includes instructor resources, data-sets and other digital supplements. Sometimes on-line tests and question banks are available too. McGraw-Hill for example developed a desktop generator called EZ Test, which is used for creating paper tests. A similar package, called TestGen, is offered by Pearson Education. Both support a variety of question types (including true or false, multiple choice, check all that apply, numeric response, matching and ranking). As well as product features another important criteria in the choice of CAA system relates to the ability to transfer the activity into a virtual learning environment (this commonly referred to as interoperability).

Turning specifically to the economics sector, recent studies have found evidence[2] to suggest that a variety of traditional (essay / exam-based), innovative (group-based projects) and time-saving (multiple-choice, peer-review) assessments are being used in an attempt to deal with the pressures mentioned above. To date however, the use of technology has been limited, tending to be confined to the use of virtual learning environments (VLE) or departmental websites. The latest Bi-Annual Lecturer’s Survey carried out by the Economics Network in 2005 found that 87% of the respondents use either a VLE or departmental websites in their teaching but less than a quarter said they use such mediums for the delivery of formative assessments and less than 10% for summative assessments. This is also consistent with the use of technology within the Economics and International Development department at Bath.

One major reason for this seems to be the investment of time required to implement CAA activities. Indeed a lack of time and lack of support for implementation was the main reason respondents had not used CAA. One respondent commented that, “I have invested a good deal of time on developing technological learning environments in the past. They are VERY time consuming and the payoff is low.” Another simply said, “Online assessment - unconvinced it offers added advantages at present.”

Although the 2005 Survey figures are an improvement on those gained from the 2003 Survey, they remain someway short of the take-up rate in other disciplines (as is evident in the range of disciplines commonly showcasing CAA activities at E-Learning conferences). A search on the EconLit database returned no results for academic articles based on keyword searches relating to “computer assessment” and “automated assessment”. Since the inception of the International Computer-Assisted Assessment Conference in 1997 no study has been presented using Economics as a case-study.

This seeming inertia to explore new methods is not completely widespread. Some objective-testing resources are available such as online quizzes accessible through the Economics Network of the Higher Education Academy. The most prominent of these resources include:

• Biz/Ed question bank ().

Questions on a variety of topics, mainly relating to microeconomics and macroeconomics.

• Maths for economists: interactive multiple-choice tests by Hilary Lamaison, Brunel University.

• The MathCentre Resources for Economics.

• Simple Regression Quiz by Guy Judge, University of Portsmouth

• Labour economics and education economics multiple choice quizzes by Geraint Johnes, Lancaster University.

• The Economics Network of the Higher Education Academy provides an extensive question bank of essay, multiple-choice questions and problem-sets for use in core economic subjects. Most of these materials are aimed at undergraduate level.

Another noted example of CAA which has been used within the economics discipline is the work of David Whigham and John Houston at Glasgow Caledonian University who have developed a software package called Excel Assess. The package can be used for objective testing in microeconomics, macroeconomics and statistics as well as for testing basic Excel skills. In total, there are around 50 questions and the package is freely available to staff at educational institutions.

CAA: Developments in subjective assessment testing (automated assessment of free text)

Whilst the development of computer and web-based systems for the delivery of automated assessment for objective testing has generated various benefits, many feel that the ability to automate essay grading would be particularly useful. There are several reasons for this. Firstly, essays, in contrast to objective tests such as multiple-choice, are considered the most useful tool to assess higher order critical thinking skills as identified in Bloom’s (1956) taxonomy. Secondly, across many social science disciplines, including economics, a substantial amount of time is devoted to the marking of essays and projects. Thirdly, CAA has the potential to overcome the problems of perceived unfairness that students claim when work is marked by different (human) assessors.

As part of the investigation of developments in automated marking of free text, I attended the 9th International Computer-Assisted Assessment Conference in July 2005. This was subsequently followed up by a review of available software in this field.

Though the majority of CAA systems available are based on the use of objective-type questions (as already documented in Table 1), in more recent times a number of automated essay scoring (AES) systems have been developed (and, in many cases, continue to be developed). A list of some of the systems currently available is provided below in Table 2.

A feature which is required of all AES systems is that they need to be trained on a large number of essay samples. Providing a good enough sample enables the system to learn, and mimic, the judgement of human scorers. It must be stressed however that computers cannot score essays in the same way as humans. Whereas humans evaluate passages according to content knowledge and literary experience, computers typically can count surface features (word order and essay length), identify stop-words, parse each sentence and examine sentence-to-sentence relatedness. Significantly they can also compare each new essay with hundreds of pre-scored essays, thereby minimising the potential variance achieved across different human assessors.

AES systems employ various techniques to provide feedback and scoring. Some use single mathematical models, such as Natural Processing Language (C-rater), Latent Semantic Analysis (e.g. Intelligent Essay Assessor (IEA)) or statistical methods (e.g. Bayesian Essay Test Scoring sYstem (BETSY)). Others, such as E-rater and IntelliMetric, use a combination of models. A discussion of all of these models is beyond the scope of this report, however, a good discussion of these models is provided in Dikli (2006). Each AES system essentially arrives at a score following a multi-stage process, which typically involves preparing the text for processing, some form of text parser (syntax analysis and feature extraction), computational analysis (based on one or more of the models discussed above), and the final score.

AES systems can be classified into two groups of those which evaluate free text on the basis of style (i.e. writing quality) or on content (e.g. keywords). Some of the earlier systems tended to be based on style only. The first example of this is Project Essay Grade (PEG), which was initially developed by Page in the 1960s. More recent contributions have attempted to incorporate both dimensions (e.g. BETSY and IEA). The performance of AES systems is generally favourable, particularly for those systems which evaluate both style and content. For example, the adjacent agreement correlation (i.e. within one scale point of each other on, say, a six point scale) between IEA and (expert) human assessors is typically in the 80-90% range. Similar results have been found for BETSY, IntelliMetric and E-rater.

Table 2: Systems for Automated Assessment of Free Text Answers

|System |Website |Description |Used by |

|PEG (Project Essay | |Measure of surface linguistic features (such as essay |English entrance exams |

|Grade) | |length). Assesses style rather than content of the | |

| | |essay. Early prototypes criticised for not | |

| | |considering organisation and style of the essay. | |

|IEA (Intelligent | |Latent Semantic Analysis. Focus is more on content. |Psychology, Medicine and|

|Essay Assessor) | |Can detect plagiarism. Only requires 100 pre-scored |History |

| | |essays to train. | |

|Criterion, E-rater, |: Hybrid Approach. Provides holistic scores |E-Rater: Non-native |

|and C-rater Tools |em.c988ba0e5dd572bada20bc47c3921509/?vgne|for essays. |English Writing |

| |xtoid=d7cdaf5e44df4010VgnVCM10000022f9519|C-Rater: Natural Language Processing. Provides |Criterion: Science and |

| |0RCRD&vgnextchannel=ceb2be3a864f4010VgnVC|automated analysis of conceptual information in |Engineering |

| |M10000022f95190RCRD |short-answer, free responses. | |

| | |Criterion: automatically evaluates essay responses | |

| | |using E-rater and the Critique1 writing analysis tools| |

|BETSY (Bayesian | |Statistical techniques based on Bayesian Networks. |High School Biology |

|Essay Test Scoring | |Requires 1000 texts to train the system. Research on | |

|sYstem) | |the performance of BETSY is limited. | |

|IntelliMetric | over 300 semantic- syntactic- and |ACT Online Preparation. |

| |ric/ |discourse-related features. Uses artificial |SATPrep application. |

| | |intelligence and natural language processes. Able to |TOEFL tests. |

| | |evaluate essay responses in a variety of languages. |Creative writing essays.|

|MY Access! | |Uses IntelliMetric system adapted for formative |English Language |

| | |testing. Like IntelliMetric can handle a variety of |courses. |

| | |languages. | |

|Notes: Adapted from Marin (2004) and on-line web material. Other less popular systems, which are not listed above, |

|include Auto-marking, CarmelTC, Paperless School Marking Engine (PS-ME) and Schema Extract Analyse and Report |

|(SEAR) as it was not possible to find web-addresses or further details regarding them. |

|1 Critique provides real-time feedback about grammar, usage, mechanics and style, and organization and development. |

A summary of the strengths and weaknesses of AES systems, including some issues specifically related to the economics discipline (following discussion with practitioners in the field):

Strengths:

• Provides immediate feedback and scoring thereby improving the learning process.

• Cost-effective (large groups often require multiple markers and each piece of work might need to be marked by more than one marker to minimise individual scorer bias).

• Reliability (can be used as a check to determine whether there is any bias created by different human assessors).

• Can be used to detect plagiarism.

• High correlation between automated systems and human raters (typically claims in the region of 85 – 95%).

Weaknesses:

• Often require a large sample of essays to initially train the system (typical range is between 300 and 1000).

• Student-Tutor relationship can get lost.

• Tend to be developed for summative rather than formative assessment (although MyAccess! is a notable exception).

• From an economic perspective, it is not clear whether and indeed how such systems could work in assignments which include statistical analysis such as regression equations.

• Diagrams are also an integral part of economic assignments. Again, it is not clear whether any of these existing systems can accommodate diagrams and charts.

• With the exception of BETSY, most systems are commercial packages and tend to be expensive.

• Many systems are not suited to assessments where the student has hand-written the assignment.

Whilst there appears to be merit in the use of automated assessment of free text, certainly in very large courses, it is likely that the number of outstanding issues which remain will not be resolved until one of these systems is trialled within the economics discipline. In terms of the choice of system, there does not appear to be one which is considered best under all comparisons. However, given that that it is freely available, includes evaluation based on style and content, and performs well in comparison with human scoring my preference would be to conduct a trail using BETSY[3].

Financial Summary of the TDF Project

|Item |Amount spent |Amount committed |% of total allocation |

|Conference attendance1 and |£870 |£1350 |47.54% |

|travelling expenses | | | |

|Research assistant 12 |£480 | £480 |26.23% |

|Research assistant 23 |£480 | |26.23% |

|1 9th International Computer Assisted Assessment Conference, 5-6 July 2005.2 Employed to carry out a review of objective |

|testing systems.3 Employed to carry out a review of automated essay grading systems. Due to the costs involved in conducting |

|the review of both systems the research, as yet, has not been disseminated at any national or international conferences. |

-----------------------

[1] The issue of feedback continues to be one of the main areas of student concern (as documented in the National Student Survey 2005, 2006 and 2007).

[2] Census of UK Economic Departments conducted by Jessica Thompson of the Economics Network and by Paul Reithmuller of the University of Queensland (conducted during the 2006-2007 academic session).

[3] A detailed bibliography of references cited in this report is available on request.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download