![](http://welcome.hp-ww.com/img/s.gif) |
» |
|
|
![Content starts here](http://welcome.hp-ww.com/img/s.gif) |
![](http://welcome.hp-ww.com/img/s.gif) |
|
![](http://welcome.hp-ww.com/img/s.gif) |
![](http://welcome.hp-ww.com/img/s.gif) |
Although not ideal for all purposes, multiple-choice exams are the only practical and
cost-effective means for en-masse testing. They have the advantage of
objectivity, repeatability and automation of grading. The same exam may be
given anywhere on the planet without concern for biases. They are, therefore,
the standard means for examining certification candidates.
In the past, the creation of multiple-choice exams was fairly hit or miss. Just make up a
bunch of questions and hope they test what you want to test. Today, it's done
using a highly structured methodology, with feedback and statistical
validation. A multidisciplinary team consisting of statisticians,
psychometricians (who deal with generic issues surrounding exams), and Subject
Matter Experts (SMEs) create the exams. In the case of OpenVMS exams, SMEs were
drawn from both HP internal and external sources. Specialists were from
Services, Engineering, Education, Presales and external Partners.
|
![](http://welcome.hp-ww.com/img/s.gif) |
![](http://welcome.hp-ww.com/img/s.gif) |
|
![](http://welcome.hp-ww.com/img/s.gif) |
![](http://welcome.hp-ww.com/img/s.gif) |
Stating the
obvious - the most important aspect of creating an exam is to know exactly what
it is you want to examine. The first step in the process is to produce a
"Competency Model", which is a list of competencies
and skills that you expect the candidate to have mastered. This model is tree
structured, starting with general areas and branching down to specifics.
Once the competency model is
complete, the next step is to weight the branches according to importance. For
example, two branches in an OpenVMS competency model might be "Queuing" and
"Security". Security might be deemed more important and, therefore, given a
higher weighting. This leads to an exam blueprint, which is essentially
the competency model with percentages attached to each branch and distributed
down to the leaves. The blueprint also determines the size of the exam - that
is the duration and number of questions with which the candidate is presented.
From the exam blueprint, the SMEs
can start writing questions (known as items in exam jargon). Each item
addresses a specific competency objective from the blueprint, and the number
created for each competency branch is determined from the weightings in the
blueprint. The typical target is for two versions of the exam (or forms
in exam jargon) so that a candidate taking the exam a second time receives a
different set of questions. Because many items are expected to be "lost" (for a
variety of reasons, covered below), it's usually necessary to create at least
three times the final number of items required for one form. Items must be
distributed according to the blue print weightings, but note that in the final
exam, there may not be enough questions to cover all competencies. The exam is
really just a sample, rather than a comprehensive coverage of all competencies.
Once the item pool is complete, it
is offered for beta test. Volunteer candidates are invited to take the exam to
see how it performs. There are three parts to the beta test - first a
demographic survey (to determine the candidates' self-assessed level of skill in the product);
second, answering all items in the complete item pool; and third, comments and
feedback.
|
![](http://welcome.hp-ww.com/img/s.gif) |
![](http://welcome.hp-ww.com/img/s.gif) |
|
![](http://welcome.hp-ww.com/img/s.gif) |
![](http://welcome.hp-ww.com/img/s.gif) |
At the completion of the beta test, the results are analyzed. Candidates are divided
into groups according to their results and demographics. We expect those who
report more experience in the product to receive better marks than those with
little or no experience.
Individual items are examined statistically to see how they performed across the beta
group. The simplest statistical measure is called p, the percentage of
candidates who answered the item correctly.
The second measure is called the point-biserial. It's a kind of a correlation
co-efficient. Items that tend to be answered correctly by the high-scoring
candidates and incorrectly by low-scoring candidates have higher positive
values. These items discriminate "good" from "bad" candidates. Those with a
flat distribution don't discriminate between the groups and have low or zero
point-biserial values. Sometimes an item may have a negative value, indicating
that the low-scoring candidates answered correctly while the high scorers did
not.
The third measure is called r. It measures a confidence interval for the
distribution of the item answers.
Items that fall outside threshold values for these three measures are dropped from the
pool. For example, Items with p < 25% are considered "too hard" (or perhaps
the expected answer is incorrect) and those with p > 90% are considered "too
easy". Similarly, those items with low or negative point-biserial, or low r, are
rejected. Any remaining items, over the target requirements are then
considered. Better performing items are retained, subject to maintaining the
target weightings from the exam blueprint.
For the OpenVMS exams, most items passed the validity tests, so the rejection rate was
surprisingly low. Therefore, numerous items, which were well within acceptable
performance levels, had to be rejected, because they were surplus to the
requirements. Many are included in the Exam Preparation Guides as sample/practice
items. EPGs are available from the HP certification web site:
http://www.hp.com/certification/
Once the final item pool has been selected, statisticians distribute items among the
exam forms, so that they can show statistically that a given candidate is
expected to obtain the same mark for each exam form.
The exam team then examines the results, looking in particular at the candidates in the
middle of the sample. Using both the results and the demographics of the
candidates, a pass mark is selected.
The exam is then released.
|
![](http://welcome.hp-ww.com/img/s.gif) |
![](http://welcome.hp-ww.com/img/s.gif) |
|
![](http://welcome.hp-ww.com/img/s.gif) |
![](http://welcome.hp-ww.com/img/s.gif) |
More jargon! An item consists of the question (known as the stem), and some number of
potential answers (choices), one of which is correct and the others are
distracters. In practice, generating the stem and the correct answer is fairly
easy. Finding good distracters is the hard part.
The science of psychometrics gives us some clues to help generate good items. Here are some
examples of rejected items that (negatively) demonstrate some of the factors in
creating good exam items.
|
![](http://welcome.hp-ww.com/img/s.gif) |
![](http://welcome.hp-ww.com/img/s.gif) |
|
![](http://welcome.hp-ww.com/img/s.gif) |
![](http://welcome.hp-ww.com/img/s.gif) |
Some other psychometric guidelines include:
- Avoiding negatives in the stem, for example, "Which command is not..."
- Avoiding culture specific terminology (slang words, jargon, references to holidays,Latin
abbreviations i.e.; via; sic; status quo; bona fide; et al, culture specific geography)
- Avoiding acronyms and abbreviations
Other item formats that have proven to perform poorly are:
- True/false questions
- Choices "all of the above" and "none of the above"
- Any choice that includes references to other choices like "A and C"
The item writing team includes a psychometrician who ensures
that all items conform to the writing standards. The team includes members from
around the world, which helps ensure that items are as culture neutral as
possible (also note that the beta results include analysis to detect any
culture bias in the results).
|
![](http://welcome.hp-ww.com/img/s.gif) |
|