TIMSS 1994: Performance assessment in TIMSS: New Zealand results Publications
Publication Details
The aim of this report is to present the tasks, and indicate the performance expectations for which measures were sought, with New Zealand and mean international success rates, and comment where this seems desirable. Direct comparisons between New Zealand success rates and those of selected countries will be made for some items.
Author(s): Robert Garden
Date Published: November 1997
Executive Summary
Why Performance Assessment?
Results from large-scale surveys of achievement have sometimes been criticised because multiple-choice has been the only item-type used in the tests. Tests based on multiple-choice items allow wide coverage of curricula and give reliable measures of achievement, but there are some skills and procedures taught in schools which are thought to be best measured by having students write their own answers to questions, or carry out tasks, which allow students to demonstrate whether or not they have learned these skills and procedures and are able to apply them. Tests for the Third International Mathematics and Science Study (TIMSS) therefore included a range of assessment methods — multiple-choice items, free-response items (short-answer and extended-response), and performance assessment tasks.
What is Performance Assessment?
All achievement test items, including multiple-choice and free-response items, assess student performance. "Performance assessment" is the term most often used in the literature for assessment tasks in which students are required to carry out "hands-on" activities with equipment to show how well they are able to apply strategies and procedures to investigate and solve problems in practical settings. Other terms used in the literature include "alternative assessment", "practical assessment", and "authentic assessment". In TIMSS, tasks consisted of a number of sub-tasks, each of which was assessed. These sub-tasks are referred to as items. They range from demonstrating knowledge or skills needed to carry out the task, to complex outcomes such as describing, or critiquing, an experimental plan.
Who Took Part?
The TIMSS written tests in mathematics and science were administered in more than 40 countries, but inability to raise funding, and lack of interest in performance assessment in some centres, resulted in only about half of these countries administering the performance assessment tasks.
Countries
The following countries participated in the performance assessment component of TIMSS:
Population 1 | Population 2 | |
---|---|---|
Australia | Australia | Portugal |
Canada | Canada | Romania |
Cyprus | Cyprus | Scotland |
Colombia | Czech Republic | Singapore |
Hong Kong | England | Slovenia |
Iran, Islamic Rep. | Hong Kong | Spain |
Israel | Iran, Islamic Rep. | Sweden |
New Zealand | Israel | Switzerland |
Portugal | Netherlands | United States |
Slovenia | New Zealand | |
United States | Norway |
Students
Three populations took written tests, but the most senior of these, students in their final year of schooling, did not take part in the performance assessment component of TIMSS. The two other TIMSS populations were:
Population 1: Students in the two adjacent grades with the largest proportion of 9-year-olds at the time of testing (standards 2 and 3 in New Zealand ); and
Population 2: Students in the two adjacent grades with the largest proportion of 13-year-olds at the time of testing (forms 2 and 3 in New Zealand ).
Written tests were primarily targeted at the upper class level in each of these populations, and it was to samples of students drawn from these levels, i.e. standard 3 and form 3, that the performance tasks were administered. Names for these levels vary across countries, but in this report they are referred to as standard 3 and form 3 levels respectively.
Both teachers and students in the participating countries showed great interest in the exercise. Students in many places were reported to have been enthusiastic about taking part. Most were said by administrators to enjoy attempting the tasks, even if not always successful. However, it has to be recognised that this may not have been the case if TIMSS had been a high stakes assessment for the students.
This Report
The aim of this report is to present the tasks, and indicate the performance expectations for which measures were sought, with New Zealand and mean international success rates, and comment where this seems desirable. Direct comparisons between New Zealand success rates and those of selected countries will be made for some items.Aggregate scores for some tasks vary slightly (of the order of 1 or 2 percent) from those quoted in earlier national reports because of weighting or, in the case of international means, different methods of calculation. In a very few cases the differences are greater because, when all the data had been received, TIMSS management at the international centre judged it desirable to collapse some codes.
International Reports
Readers interested in more detailed descriptions of the development of the Performance Assessment component of TIMSS, in technical aspects of the data analysis, and comparative data for all participating countries, will find them in the TIMSS Technical Report (Martin & Kelly, 1996) and the international report of the Performance Assessment (Harmon et al, 1997).
The Challenges
Performance assessment tasks were included in TIMSS for two reasons. First, there was a desire to measure achievement in as many mathematics and science curriculum objectives as feasible, in order to increase the validity of the assessment. Second, studies of performance assessment in the past decade have given rise to questions about its feasibility for use in large-scale surveys, and whether inclusion of performance assessment tasks provides useful information not provided by traditional written tests. TIMSS provided an opportunity to investigate these, and other, research questions.
In assessing aspects of student achievement using performance assessment, the cost of what is seen as enhanced validity is lower reliability than is usually attained with traditional pencil and paper tests (Moss, 1992), unless large numbers of tasks are used (Shavelson et al, 1993). This can be accepted so long as the reliability does not fall to a level where measures are so unreliable that all validity is lost. If different raters cannot agree on whether or not a student has completed a task, or an item within a task, successfully, or if individuals give inconsistent ratings, the measures cannot be valid for any purpose.
The challenge for TIMSS was therefore to produce tasks which would give measures of achievement of curricular objectives which experts from the participating countries would agree were valid for this purpose, and which were sufficiently reliable to allow comparisons between country means and, in some cases, between groups within countries. In essence each task had to be testing the same things in the same way, and under comparable conditions, whether in New Zealand , Hong Kong , Norway , Iran , or any other participating country.
In addition, because a student's achievement as measured by performance tasks tends to be very dependent on the particular tasks (Linn & Burton, 1994), it was necessary to have as many students attempt as many tasks as possible. On the other hand, the financial cost per student of performance testing is high and the maximum time available to administer the tasks was 90 minutes.
Cost limited the study in several ways. Of the countries taking a full part in TIMSS at the form 3 level, only 21 secured funding to administer the performance assessment component of TIMSS and, of these countries, only 10 also did so at the standard 3 level. Cost was one factor that ruled out tasks requiring more elaborate equipment and, in some countries, travel costs meant that remote schools could not be included in the performance assessment sub-sample.
The Solutions
Standardisation of tasks and procedures across countries, and across settings within countries, was accomplished through the following actions:
- trialling of about twice as many tasks as required so that those selected required equipment and materials that were widely available or easily replicated;
- analysis of trial data and information supplied by national research coordinators from each country to identify and reject tasks or sub-tasks affected by differing geographic, climatic, or cultural conditions;
- development of a manual for test administrators, setting out in detail the procedures to be followed in preparing for and administering the performance assessment, including specifications for equipment and materials to be used (TIMSS, 1994a);
- provision of a manual, with exemplars, detailing how coding was to be carried out (TIMSS, 1994b & 1995);
- provision of training in administration of the assessment, and training in coding the student data.
The need to maximize the number of tasks and students, yet to keep costs within reasonable limits, gave rise to a design which involved:
- a form of multiple-matrix sampling in which each student attempted either three or four of the 12 tasks in the main survey;
- rotation of tasks amongst students by a scheme which simultaneously keeps error within acceptable bounds and provides data in a form which allows key analyses addressing research questions to be carried out;
- data based on student responses being provided in written form, rather than on observation by trained observers.
Selecting the Tasks
Tasks were collected from several of the TIMSS national centres, and from research and evaluation agencies. Of these tasks, 22 were selected for trialling. Nineteen countries trialled the tasks, under standardised conditions, with samples of students. Following the trials, committees of subject-matter specialists, performance assessment administrators, and national research coordinators in each country reviewed and evaluated the tasks. Reports from these committees and data from the field trials were used by the TIMSS Performance Assessment Committee in selecting the 12 tasks required for the main survey.
Trial tasks, or sub-tasks (items), were rejected if they had proved too difficult for students, received low quality ratings from subject-matter experts, or if problems in administration had been encountered. Some were rejected because students could not complete them in time, standardisation of some equipment was difficult, and for some proposed tasks differing climatic conditions (such as humidity) affected the materials or equipment differently in various geographic regions.
From the remaining tasks, 12 were selected. This set included some which were judged to need 30 minutes for completion, and some to need 15 minutes. The investigations, and problems to be solved, balanced science and mathematics content, and represented a range of topic, skill, and procedure areas. Several complete tasks, and a number of sub-tasks, were identical for both populations tested (standard 3 and form 3 in New Zealand ).
The Design
One of the research issues to be addressed was the question of whether information about student achievement collected by means of performance assessment differed from that collected by traditional pencil and paper tests. The sample of students selected to participate in each country was therefore a sub-sample of the students who had completed the written TIMSS tests and questionnaires a few days earlier. This will allow each student's performance assessment data to be associated with the written test data, as well as the student, teacher, and school background data collected for each student.
In selecting the national sub-samples for performance assessment, national centres were permitted to exclude schools which had less than nine students in the target class, and schools which were so remote that it would have been too expensive to send a trained administrator to them. Such exclusions were to be kept to a minimum and, for most countries, the potential for bias (so far as national representativeness of the samples is concerned) was considered to be offset by maintenance of a high quality of project management, and the various quality control measures in place.
In New Zealand , the exclusion rate at standard 3 level was high (27%), partly because of the high proportion of very small schools and partly because of the remoteness factor. The potential for bias in achievement measures from this source is obvious, but comparison between the written test means for rural and urban standard 3 students revealed no significant difference so it is a reasonable assumption that bias in the performance measures, if it exists at all, is very small (see ).
Note:
| |||
Class Level | Student | Mathematics Mean % | Science Mean % |
---|---|---|---|
Form 3: | Rural | 52 | 57 |
Urban | 54 | 58 | |
Standard 3: | Rural | 55 | 62 |
Urban | 53 | 60 |
A direct comparison of TIMSS written test mean achievement of the performance assessment sub-samples with the sub-samples not selected for performance assessment () indicates that no significant bias in overall mean achievement occurred, and it is very unlikely that achievement distributions for the performance assessment sub-samples were skewed with respect to the respective populations.
Class Level | Sub-Sample | Maths Mean % | Science Mean % |
---|---|---|---|
Standard 3 | Performance Assessment Girls | 54 | 60 |
Non-Performance Assessment Girls | 54 | 62 | |
Performance Assessment Boys | 54 | 61 | |
Non-Performance Assessment Boys | 52 | 59 | |
Form 3 | Performance Assessment Girls | 53 | 52 |
Non-Performance Assessment Girls | 51 | 50 | |
Performance Assessment Boys | 52 | 53 | |
Non-Performance Assessment Boys | 53 | 54 |
The time available for testing (90 minutes) allowed each student to complete three 30-minute tasks, or two 30-minute tasks and two 15-minute tasks. Tasks were grouped into nine 90-minute sets, and for administration each of these sets was arranged at a "station". Clusters of nine students took part simultaneously and changed stations at 30-minute intervals, so that each student attempted three or four tasks. Stations to be visited by each student were pre-allocated at national level to ensure that each task was attempted by approximately equal numbers of randomly selected students, and so that the allocation of stations to students was also random.
Six hundred and thirteen students from 50 standard 3 classes in New Zealand schools, and 824 form 3 students from 49 classes in New Zealand schools, participated in the both the performance assessment and written test components of TIMSS. This meant that approximately 205 standard 3 students and approximately 270 form 3 students attempted each task at the respective levels. This was a greater number than in most countries because where possible in New Zealand two clusters of nine students per class were taken, whereas other national centres commonly selected only one cluster of nine per class. Only Canada had larger samples.
Administration
Ideally, each task needed to be identical from school to school, and from country to country. Similarly, interactions between students attempting the tasks and the test administrator needed to conform to the same criteria for all students. Equipment for the tasks was therefore prepared centrally in each country according to strict specifications, and the detailed manuals for test administrators (TIMSS 1994a, 1994b, & 1994c) distributed from the international study centre. Representatives from each participating country were trained in administering the assessment, and they in turn conducted training sessions in their own countries.
In New Zealand , performance assessment was directed by Robyn Caygill from the Educational Assessment Research Unit at the University of Otago . Twenty-one teachers received two days training in administering the tasks, and were released from their teaching positions for two weeks. They were each allocated schools in which to carry out the assessment.
Besides ensuring that the equipment and accompanying instructions for students were in good order and laid out at the correct stations, administrators checked that students were at their correct stations at the beginning of each 30-minute spell, saw to it that test conditions were maintained, and collected student work. Prior to the assessment beginning they showed students how to use the stopwatches provided, and made sure that students understood how to read the rulers and thermometers provided. Ability to use this equipment was essential to being able to complete certain tasks in which use of equipment was not the performance being measured. Such instruction was not given where use of equipment was one of the outcomes being measured, and nor was instruction or assistance permitted with other procedures, or for student questions relating to the tasks. The only concession in this respect was that administrators could read task instructions to any standard 3 students unable to do so for themselves.
Navigation
Contact Us
Education Data Requests
If you have any questions about education data please contact us:
Email: Requests Data and Insights
Phone: +64 4 463 8065