CRITERIA OF A GOOD TEST IN ENGLISH
CRITERIA OF A GOOD TEST IN ENGLISH
VALIDITY
A
test is said to be valid if it measures accurately what it is intended
to measure. This seems simple enough. When closely examined, however,
the concept of validity reveals a number of aspects, each of which
deserves our attention.
Content validity
A test is said to have content validity if
its content constitutes a representative sample of the language skills,
structures, etc. with which it is meant to be concerned. It is obvious
that grammar test, for instance, must be made up of items testing
knowledge of control of grammar. But this in itself does not ensure content validity. The test would have content validity if it included a proper sample of the relevant structures. Just what are the relevant structures will depend, of course, upon the purpose of the test. We wouldn’t expect an achievement test for intermediate learners to contain just the same set of structures as one for advanced learners. In order to judge whether or not a
test has content validity, we need a specification of the skills or
structures etc. that is meant to cover. Such a specification should be made at a
very early stage in test construction. It is not to be expected that
everything in the specification will always appear in the test; there
may simply be too many things for all of them to appear in a single
test.
What
is the importance of content validity? First, the greater a test’s
content validity, the more likely it is to be an accurate measure of
what it is supposed to measure. Secondly,
such a test is likely to have harmful backwash effect. Areas which are
likely to become areas ignored in teaching and learning. Too often the
content of tests is the best
safeguard against this is to write full test specifications and to
ensure that the test content is a fair reflection of these.
The
effectiveness of a content validity strategy can be enhanced by making
sure that the experts are truly experts in the appropriate field and
that they have adequate and appropriate tools in the form of rating
scales so that their judgments can be sound and focused. However,
testers should never rest on their laurels. Once they have established
that a test has adequate content validity, they must immediately explore
other kinds of validity of the test in terms related to the specific
performances of the types of students for whom the test was designed in
the first place.
Criterion-related validity/ Empirical validity
There
are essentially two kinds of criterion-related validity: concurrent
validity and predictive validity. Concurrent validity is established
when the test and the criterion are administered at about the same time.
To exemplify this kind of validation in achievement testing, let us
consider a situation where course objectives call for an oral component
as part of the final achievement test. The objectives may list a large
number of ‘function’ which students are expected to perform orally, to
test of all which might take 45 minutes for each student. This could
well be impractical.
The
second kind of criterion-related validity is predictive validity. This
concerns the degree to which a test can predict candidates’ future
performance. An example would be how well a proficiency test could
predict a student’s ability to cope with a graduate course at a British
University. The criterion measure here might be an assessment of the
student’s English as perceived by his or her supervisor at the
university, or it could be the outcome of the course (pass/fail etc.)
Construct validity
A
test, part of test, or a testing technique is said to have construct
validity if it can be demonstrated that it measures just the ability
which it is supposed to measure. The word ‘construct’ refers to
underlying ability (or trait) which is hypothesized in a theory of
language ability. One might hypothesize, for example, that the ability
to read involves a number of sub-abilities, such as the ability to guess
the meaning of unknown words from the context in which they are met. It
would be a mater of empirical research to establish whether or not such
a distinct ability existed and could be measured. If we attempted to
measure that ability in a particular test, then that part of the test
would have construct validity only if we were able to demonstrate that we were indeed measuring just that ability.
Construct
validity is the most important form of validity because it asks the
fundamental validity question: What this test really measuring? We have
seen that all variables derive from constructs and that constructs are
nonobservable traits, such as intelligence, anxiety, and honesty,
“invented” to explain behavior. Constructs underlie the variables that
researchers measure. You cannot see a construct, you can only observe
its effect. “Why does the person act this way and that person a
different way? Because one is intelligent and one is not – or one is
dishonest and the other is not.” We cannot prove that constructs exist,
just as we cannot perform brain surgery on a person to “see” his or her
intelligence, anxiety, or honesty.
Face validity
A
test is said to have face validity if it looks as if it measures what
it is supposed to measure, for example, a test which pretended to
measure pronunciation ability but which did not require
the candidate to speak (and there have been more) might be thought to
lack face validity. This would be true even if the test’s construct and
criterion-related validity could be demonstrated. Face validity is
hardly a scientific concept, yet it is very important. A test which does
not face validity may not be accepted by candidates, teachers,
education authorities or employers. It may simply not be used; and if it
is used, the candidates’ reaction to it may mean that they do not
perform on it in a way that truly reflects their ability.
The use of validity
What
use is the reader to make of the notion of validity? First, every
effort should be made in constructing tests to ensure content validity.
Where possible, the tests should be validated empirically against some
criterion. Particularly where it is intended to use indirect testing,
reference should be made to the research literature to confirm that measurement of the relevant underlying constructs has been demonstrated using the testing techniques that are to be used.
RELIABILITY
Reliability
is a necessary characteristic of any good test: for it to be valid at
all, a test must first be reliable as a measuring instrument. It test is
administered to the same candidates on different occasion (with
no language practice work taking place between these occasion), then,
to the extent that it produces differing results. It is not reliable.
Reliability measured in this way is commonly referred to as test/re-test
reliability to distinguish it from mark/re-mark reliability. In short,
in order to be reliable, a test must be consistent in its measurements.
Factors affecting the reliability of a test are:
- the extent of the sample of material selected for testing; whereas validity is concerned chiefly with the content of the sample, reliability is concerned with the size. The larger the sample (i.e the more tasks the testees have to perform), the greater the probability that the test as a whole is reliable-hence the favoring of objectives tests, which allow for a wide field to be covered.
- the administration of the test : is the same test administered to different groups under different conditions or at different times? Clearly, this is an important factor in deciding reliability, especially in tests of oral production and listening comprehension.
One
method of measuring the reliability of a test is to re-administer the
same test after a lapse of time. It is assumed that all candidates have
been treated in the same way in the interval – that they have either all
been taught or that none of them have.
Another
means of estimating the reliability of a test is by administering
parallel forms of the test to the same group. This assumes that two
similar versions of a particular test can be constructed; such tests
must be identical in the nature of their sampling, difficulty, length,
rubrics, etc.
How to make tests more reliable
As
we have seen, there are two components of test reliability: the
performance of candidates from occasion to occasion, and the reliability
of the scoring.
Take enough sample of behavior. Other
things equal, the more items that you have on a test, the more reliable
that test will be. This seems intuitive right. While it is important to
make a test long enough to achieve satisfactory reliability, it should
not be made so long that the candidates become so bored or tired that
the behavior that they exhibit becomes
unrepresentative of their ability. At the same time , it may often be
necessary to resist pressure to make a test shorter than is appropriate.
The usual argument for shortening a test is that it is not practical.
Do not allow candidates too much freedom. In
some kinds of language test there is a tendency to offer candidates a
choice of questions and then to allow them a great deal of freedom in
the way that they answer the ones that they have chosen. Such a
procedure is likely to have a
depressing effect on the reliability of the test. The more freedom that
is given, the greater is likely to be the difference between the
performance.
Write unambiguous items. It is essential that candidates should not be presented with items whose meaning is not clear or to which there is an acceptable answer which the test writer has not anticipated.
Provide clear and explicit instructions. This
applies both to written and oral instructions. It is possible for
candidates to misinterpret what they are asked to do, then on some
occasions some of them certainly will. Test writers should not rely on
the students’ powers of telepathy to elicit the desired behavior.
Ensure that tests are well laid out and perfectly legible. Too
often, institutional tests are badly typed (or handwritten), have too
much text in too small a space, and are poorly reproduced. As a result,
students are faced with additional tasks which are not ones meant to measure their language ability. Their variable performance on the unwanted tasks will lower the reliability of a test.
Candidates should be familiar with format and testing techniques. If
any aspect of a test is unfamiliar to candidates, they are likely to
perform less well they would do otherwise (on subsequently taking a
parallel version, for example). For this reason, every effort must be
made to ensure that all candidates have the opportunity to learn just
what will be required of them.
Provide uniform and non-distracting conditions of administration. The
greater the differences between one administration of a test and
another, the greater the differences one can expect between a
candidate’s performance on two occasions. Great care should be taken to ensure uniformity.
Use items that permit scoring which is as objective as possible. This may appear to be a recommendation to use multiple choice items, which permit completely
objective scoring. An alternative to multiple choice item which has a
unique, possibly one word, correct response which the candidates produce
themselves. This too should ensure objective scoring, but in fact
problems with such matters as spelling which makes a candidate’s meaning
unclear often make demands on the scorer’s judgment. The longer the
required response, the greater the difficulties of this kind.
Make comparisons between candidates as direct as possible. This
reinforces the suggestion already made that candidates should not be
given a choice of items and that they should be limited in the way that
they are allowed to respond. Scoring the compositions all on one topic
will be more reliable than if the candidates are allowed to choose from
six topics, as has been the case in some well-known tests. The scoring
should be all the more reliable if the compositions are guided. In this section, do not allow candidates too much freedom.
Provide a detailed scoring key. This
should specify acceptable answer and assign points for partially
correct responses. For high scorer reliability the key should e as
detailed possible in its assignment of points.
Train scorers. This is especially important where scoring is most subjective. The scoring of comparisons, for example, should
not be assigned to anyone who has not learned to score accurately
compositions form past administrations. After each administration,
patterns of scoring should be analyzed. Individuals whose scoring
deviates markedly and inconsistently from the norm should not be used
again.
Identify candidates by number; not name. Scorers
inevitably have expectations of candidates that they know. Except in
purely objective testing, this will affect the way that they score.
Studies have shown that even where the candidates are unknown to the
scorers, the name on a script (or a photograph) will make a significant
difference to the scores given.
For example, a scorer may be influenced by the gender or nationality of
a name into making predictions which can affect the score given. The
identification of candidates only by number will reduce such effects.
Employ multiple, independent scoring. As a general rule, and certainly where testing is subjective, all scripts should be scored by at least two independent
scorers. Neither scorer should know how the other has scored a test
paper. Scores should be recorded on separate score sheets and passed to a
third, senior, colleague, who compares the two sets of scores and
investigates discrepancies.
ADMINISTRATION
A
test must be practicable; in other words, it must be fairly straight
forward to administer. It is only too easy to become so absorbed in the
actual construction of the test items that the most obvious practical
considerations concerning the test are overlooked. The length of time
available for the administration of the test is frequently misjudged
even by experienced test writers. Especially when the complete test
consists of a number of
sub-tests. In such cases sufficient time may not be allowed for the
administration of the test (i.e. a try out of the test to a small but
representative group of testees)
Another
practical consideration concerns the answer sheets and the stationary
used. Many tests require the testees to enter their answers on the
actual question paper (e.g. circling the letter of the correct option),
thereby unfortunately reducing the speed of the scoring and presenting the question paper from being used a second time. In some tests the candidates are presented with
a separate answer sheet, but too often insufficient thought has been
given to possible errors arising from the (mental) transfer of the
answer sheet itself.
A
final point concerns the presentation of the test paper itself. Where
possible, it should be printed or typewritten and appear neat, tidy and
aesthetically pleasing. Nothing is worse and more disconcerting to the
testee than an untidy test paper, full of misspellings, omissions and
corrections.
Comments
Post a Comment