This vignette will use measures of agreement, in particular, the Fleis-Cohen kappa, and the Goodman-Kruskal lambda to explore the proposed adequacy of a cognitively diagnostic assessment.
This assessment is based on an example found in [@Mislevy1995]. It is a test of language arts in
which there are four constructs being measured: Reading,
Writing, Speaking and Listening. Each
variable can tale on the possible values Advanced
,
Intermediate
or Novice
.
There are four kinds of tasks:
There are 5 Reading, 5 Listening, 3 Writing and 3 Speaking questions on a form of the test. Therefore, the Q Matrix looks like:
Reading | Writing | Speaking | Listening | |
---|---|---|---|---|
R1 | 1 | 0 | 0 | 0 |
R2 | 1 | 0 | 0 | 0 |
R3 | 1 | 0 | 0 | 0 |
R4 | 1 | 0 | 0 | 0 |
R5 | 1 | 0 | 0 | 0 |
W6 | 1 | 1 | 0 | 0 |
W7 | 1 | 1 | 0 | 0 |
W8 | 1 | 1 | 0 | 0 |
L9 | 0 | 0 | 0 | 1 |
L10 | 0 | 0 | 0 | 1 |
L11 | 0 | 0 | 0 | 1 |
L12 | 0 | 0 | 0 | 1 |
L13 | 0 | 0 | 0 | 1 |
S14 | 1 | 0 | 1 | 1 |
S15 | 1 | 0 | 1 | 1 |
S16 | 1 | 0 | 1 | 1 |
Assuming that the parameters of the 5 item models are known, it is straightforward to simulate data from the assessment. This kind of simulation provides information about the adequacy of the data collection for classifying the students.
The simulation procedure is as follows:
Generate random proficiency profiles using the proficiency model.
Generate random item responses using the item (evidence) models.
Score the assessment. There are two variations:
Modal (MAP) Scores: The score assigns a category to each of the four constructs. (These are the columns named “mode.Reading” and similar.)
Expected (Probability) Scores: The score assigns a probability of high, medium or low to each individual. (These are the columns named “Reading.Novice”, “Reading.Intermediate”, and “Reading.Advanced”.)
The simulation study itself can be found in
vingette("SimulationStudies",package="RNetica")
.
The results are saved in this package in the data set
language16
.
The simulation data has columns containing the “true” value for the
four proficiencies—“Reading”, “Writing”, “Speaking”, and “Listening”—and
for values for the estimates—“mode.Reading”, “mode.Writing”, &c. The
cross-tabulation of a true value and its corresponding estimate is known
as a confusion matrix. This is a matrix A, were akm is
the count of the number of simulated cases for which first variable
(simulated truth) is k and the
second variable (MAP estimate) is m. This can be built in R using the
table
function.[^2]
Estimated
|
|||
---|---|---|---|
Novice | Intermediate | Advanced | |
Simulated | |||
Novice | 216 | 46 | 0 |
Intermediate | 37 | 398 | 45 |
Advanced | 0 | 41 | 217 |
Estimated
|
|||
---|---|---|---|
Novice | Intermediate | Advanced | |
Simulated | |||
Novice | 134 | 98 | 5 |
Intermediate | 43 | 375 | 91 |
Advanced | 14 | 146 | 94 |
Estimated
|
|||
---|---|---|---|
Novice | Intermediate | Advanced | |
Simulated | |||
Novice | 130 | 91 | 1 |
Intermediate | 65 | 390 | 78 |
Advanced | 2 | 76 | 167 |
Estimated
|
|||
---|---|---|---|
Novice | Intermediate | Advanced | |
Simulated | |||
Novice | 232 | 14 | 0 |
Intermediate | 29 | 437 | 46 |
Advanced | 0 | 71 | 171 |
The Bayes net scoring engine (like many scoring models) expresses its uncertainty about the abilities of the students by estimating the probability that the student is in each category. These are often called the marginal probabilities because they are one margin of the joint distribution over all four variables. The table below shows the “true” (simulated) reading ability and estimated marginal probabilities for the first five simulees.
kbl(language16[1:5,c("Reading","Reading.Novice",
"Reading.Intermediate","Reading.Advanced")],
caption="Reading data from first five simulees.",
digits=3) |>
kable_classic()
Reading | Reading.Novice | Reading.Intermediate | Reading.Advanced |
---|---|---|---|
Intermediate | 0.565 | 0.435 | 0.000 |
Advanced | 0.000 | 0.069 | 0.931 |
Novice | 0.954 | 0.046 | 0.000 |
Novice | 0.703 | 0.297 | 0.000 |
Advanced | 0.000 | 0.521 | 0.479 |
The expected value of the confusion matrix, $\overline{\mathbf{A}}$ is calculated as follows: let Xi be the value of the first (simulated truth) variable for the ith simulee, and let $p_{im} = P(\hat{X_i}=m|e)$ be the estimated probability that Xi = m for the ith simulee is m. Then $\overline{a_{km}} = \sum_{i: X_i=k} p_{im}$.
In the running example, the first row (novice
) is the
sum of all rows for which the true scores is novice
. The
second row is the sum of the intermediate
rows and the
third row advanced
.
The function expTable()
does this work. Note that it
expects the marginal probability to be in a number of columns marked
var.
state, where var is the name
of the variable and state is the name of the state. If the data
uses a different naming convention, this can be expressed with the
argument pvecregex
which is a regular expression with
special symbols <var>
to be substituted with the
variable name and <state>
to be substituted with the
state name.
The table below shows the expected matrix from the first five rows of the reading data.
Estimated
|
|||
---|---|---|---|
Novice | Intermediate | Advanced | |
Simulated | |||
Novice | 1.658 | 0.565 | 0.000 |
Intermediate | 0.342 | 0.435 | 0.589 |
Advanced | 0.000 | 0.000 | 1.411 |
What follows are the expected confusion matrixes for all four proficiency variables.
Estimated
|
|||
---|---|---|---|
Novice | Intermediate | Advanced | |
Simulated | |||
Novice | 206.702 | 51.083 | 0.168 |
Intermediate | 55.238 | 372.955 | 58.196 |
Advanced | 0.060 | 55.962 | 199.636 |
Estimated
|
|||
---|---|---|---|
Novice | Intermediate | Advanced | |
Simulated | |||
Novice | 125.296 | 85.976 | 27.997 |
Intermediate | 86.891 | 284.228 | 134.581 |
Advanced | 24.812 | 138.796 | 91.421 |
Estimated
|
|||
---|---|---|---|
Novice | Intermediate | Advanced | |
Simulated | |||
Novice | 123.018 | 113.348 | 10.582 |
Intermediate | 91.180 | 317.257 | 91.247 |
Advanced | 7.802 | 102.395 | 143.171 |
Estimated
|
|||
---|---|---|---|
Novice | Intermediate | Advanced | |
Simulated | |||
Novice | 208.522 | 34.682 | 0.200 |
Intermediate | 36.774 | 385.996 | 79.082 |
Advanced | 0.704 | 91.321 | 162.718 |
The sum of the diagonal of the confusion matrix, ∑kakk,
gives a count of how many cases are exact agreements (in this case
between the simulation and estimation). Let N = ∑k∑makm;
then the agreement rate is ∑kakk/N.
For the reading data using the MAP scores, this is 831 out of 1000, so
over 80% agreement. The function accuracy()
calculates the
agreement rate.
MAP | EAP | |
---|---|---|
Reading | 0.831 | 0.779 |
Writing | 0.603 | 0.501 |
Speaking | 0.687 | 0.583 |
Listening | 0.840 | 0.757 |
Raw agreement can be easy to achieve if there is not much variability
in the population. For example, if 80% of the target population was
intermediate, a classifier that simply classified each respondent as
intermediate
would achieve 80% accuracy, and one that
guessed intermediate
randomly 80% of the time would achieve
at least 64% accuracy. For that reason, two adjusted agreement rates,
lambda and kappa, are often used.
If the test was not available, the only strategy would be a one-size-fits-all one, assuming that all subjects are at the same level of the variable. The best strategy is to treat all subjects as if they are in the modal (most likely) category. The marginal distribution for the variable is found from the row sums of A (or $\overline{\mathbf{A}}$), ak+ = ∑makm, and the best that can be done with the one-size-fits-all strategy is maxkak+.
Goodman and Kruskal (1952) suggest that the agreement rate be adjusted by subtracting out the best that can be done with mapping everybody to a single value. So they propose
$$\lambda = \frac{\sum_{k} a_{kk} - \max_{k} a_{k+}}{N - \max_{k} a_{k+}} \ .$$
This ranges from -1 to 1 with 0 representing doing no better than just guessing the most probable character and 1 indicating perfect agreement. (Negative numbers are doing worse than just guessing the mode.)
The function gkLambda()
will do this calculation. Here
are the values for the language test using both the MAP and expected
agreements.
MAP | EAP | |
---|---|---|
Reading | 0.675 | 0.570 |
Writing | 0.191 | -0.010 |
Speaking | 0.330 | 0.167 |
Listening | 0.672 | 0.513 |
Jacob Cohen (Flies, Levin & Paek, 2003) took a different approach which treats the two assignments of cases to categories more symmetrically. The idea is that these are two raters and the goal is to judge the extent of the agreement. Baseline here is to imagine two raters one of which assign categories randomly with probabilities ak+/N and a+k/N. Then the expected number of random agreements is ∑kak+a+k/N. So the agreement measure adjusted for random agreement is:
$$ \kappa = \frac{\sum_{k} a_{kk} - \sum_{k}a_{k+}a_{+k}/N}{N-\sum_{k}a_{k+}a_{+k}/N} \ .$$
Again this runs from -1 to 1. The function fcKappa
calculates Cohen’s kappa.
MAP | EAP | |
---|---|---|
Reading | 0.733 | 0.651 |
Writing | 0.329 | 0.197 |
Speaking | 0.478 | 0.325 |
Listening | 0.740 | 0.609 |
All three of kappa, lambda and raw agreement all assume that any missclassification is equally bad. Fleis suggest adding weights, were 1 ≥ wkm ≥ 0 is the desirability of classifying a subject who is k as m. In this case, weighted agreement is ∑k∑mwkmakm/N. The weighted versions of kappa and lambda are given by:
$$ \lambda = \frac{\sum\sum w_{km} a_{km} - \max_k \sum_m w_{km} a_{km}}{N - \max_k \sum_m w_{km} a_{km}} \ ;$$
$$ \kappa = \frac{\sum\sum w_{km} a_{km} - \sum_k \sum_m w_{km} a_{k+}a_{+m}/N}{N - \sum_k \sum_m w_{km} a_{k+}a_{+k}/N} \ .$$
There are three commonly uses cases:
Both linear and quadratic weight have increasing penalties for the number of categories of difference. So off-by-one has a lower penalty than off-by-two.
The accuracy
, gkLambda
and
fcKappa
have both a weights
argument where no
weights (“None”, default), “Linear” or “Quadratic” weights can be
selected, and a w
argument where a custom weight matrix can
be entered.
wacc.tab <- data.frame(
None = sapply(cm,accuracy,weights="None"),
Linear = sapply(cm,accuracy,weights="Linear"),
Quadratic = sapply(cm,accuracy,weights="Quadratic"))
wlambda.tab <- data.frame(
None = sapply(cm,gkLambda,weights="None"),
Linear = sapply(cm,gkLambda,weights="Linear"),
Quadratic = sapply(cm,gkLambda,weights="Quadratic"))
wkappa.tab <- data.frame(
None = sapply(cm,fcKappa,weights="None"),
Linear = sapply(cm,fcKappa,weights="Linear"),
Quadratic = sapply(cm,fcKappa,weights="Quadratic")
)
None | Linear | Quadratic | |
---|---|---|---|
Reading | 0.831 | 0.915 | 0.958 |
Writing | 0.603 | 0.792 | 0.886 |
Speaking | 0.687 | 0.842 | 0.919 |
Listening | 0.840 | 0.920 | 0.960 |
None | Linear | Quadratic | |
---|---|---|---|
Reading | 0.675 | 0.675 | 0.675 |
Writing | 0.191 | 0.153 | 0.075 |
Speaking | 0.330 | 0.323 | 0.310 |
Listening | 0.672 | 0.672 | 0.672 |
None | Linear | Quadratic | |
---|---|---|---|
Reading | 0.733 | 0.780 | 0.837 |
Writing | 0.329 | 0.393 | 0.479 |
Speaking | 0.478 | 0.550 | 0.645 |
Listening | 0.740 | 0.782 | 0.834 |