Measues of Agreement

library(CPTtools)
library(kableExtra)

This vignette will use measures of agreement, in particular, the Fleis-Cohen kappa, and the Goodman-Kruskal lambda to explore the proposed adequacy of a cognitively diagnostic assessment.

Assessment Design

This assessment is based on an example found in [@Mislevy1995]. It is a test of language arts in which there are four constructs being measured: Reading, Writing, Speaking and Listening. Each variable can tale on the possible values Advanced, Intermediate or Novice.

There are four kinds of tasks:

Reading
Subjects read a short segment of text and then answer a selected response question.
Writing
An integrated Reading/Writing task where the subject provides a short written response based on an reading passage.
Listening
Subject listens to a prompt followed by a mulitple choice question where the options are also spoken (as are the instructions).
Speaking
Subject is requested to respond verbally (response is recorded) after listening to an audio stimulus. The instructions are written.

Form Design

There are 5 Reading, 5 Listening, 3 Writing and 3 Speaking questions on a form of the test. Therefore, the Q Matrix looks like:

Q-matrix for 16 item test
Reading Writing Speaking Listening
R1 1 0 0 0
R2 1 0 0 0
R3 1 0 0 0
R4 1 0 0 0
R5 1 0 0 0
W6 1 1 0 0
W7 1 1 0 0
W8 1 1 0 0
L9 0 0 0 1
L10 0 0 0 1
L11 0 0 0 1
L12 0 0 0 1
L13 0 0 0 1
S14 1 0 1 1
S15 1 0 1 1
S16 1 0 1 1

Simulation Experiment

Assuming that the parameters of the 5 item models are known, it is straightforward to simulate data from the assessment. This kind of simulation provides information about the adequacy of the data collection for classifying the students.

The simulation procedure is as follows:

  1. Generate random proficiency profiles using the proficiency model.

  2. Generate random item responses using the item (evidence) models.

  3. Score the assessment. There are two variations:

  1. Modal (MAP) Scores: The score assigns a category to each of the four constructs. (These are the columns named “mode.Reading” and similar.)

  2. Expected (Probability) Scores: The score assigns a probability of high, medium or low to each individual. (These are the columns named “Reading.Novice”, “Reading.Intermediate”, and “Reading.Advanced”.)

The simulation study itself can be found in vingette("SimulationStudies",package="RNetica").

The results are saved in this package in the data set language16.

data(language16)

Building the Confusion Matrix

The simulation data has columns containing the “true” value for the four proficiencies—“Reading”, “Writing”, “Speaking”, and “Listening”—and for values for the estimates—“mode.Reading”, “mode.Writing”, &c. The cross-tabulation of a true value and its corresponding estimate is known as a confusion matrix. This is a matrix A, were akm is the count of the number of simulated cases for which first variable (simulated truth) is k and the second variable (MAP estimate) is m. This can be built in R using the table function.[^2]

cm <- list()
cm$Reading <- table(language16[,c("Reading","mode.Reading")])
Reading confusion matrix for 16 item test.
Estimated
Novice Intermediate Advanced
Simulated
Novice 216 46 0
Intermediate 37 398 45
Advanced 0 41 217
cm$Writing <- table(language16[,c("Writing","mode.Writing")])
Writing confusion matrix for 16 item test.
Estimated
Novice Intermediate Advanced
Simulated
Novice 134 98 5
Intermediate 43 375 91
Advanced 14 146 94
cm$Speaking <- table(language16[,c("Speaking","mode.Speaking")])
Speaking confusion matrix for 16 item test.
Estimated
Novice Intermediate Advanced
Simulated
Novice 130 91 1
Intermediate 65 390 78
Advanced 2 76 167
cm$Listening <- table(language16[,c("Listening","mode.Listening")])
Listening confusion matrix for 16 item test.
Estimated
Novice Intermediate Advanced
Simulated
Novice 232 14 0
Intermediate 29 437 46
Advanced 0 71 171

The expected Confusion Matrix

The Bayes net scoring engine (like many scoring models) expresses its uncertainty about the abilities of the students by estimating the probability that the student is in each category. These are often called the marginal probabilities because they are one margin of the joint distribution over all four variables. The table below shows the “true” (simulated) reading ability and estimated marginal probabilities for the first five simulees.

kbl(language16[1:5,c("Reading","Reading.Novice",
                     "Reading.Intermediate","Reading.Advanced")],
     caption="Reading data from first five simulees.",
     digits=3) |>
  kable_classic()
Reading data from first five simulees.
Reading Reading.Novice Reading.Intermediate Reading.Advanced
Intermediate 0.565 0.435 0.000
Advanced 0.000 0.069 0.931
Novice 0.954 0.046 0.000
Novice 0.703 0.297 0.000
Advanced 0.000 0.521 0.479

The expected value of the confusion matrix, $\overline{\mathbf{A}}$ is calculated as follows: let Xi be the value of the first (simulated truth) variable for the ith simulee, and let $p_{im} = P(\hat{X_i}=m|e)$ be the estimated probability that Xi = m for the ith simulee is m. Then $\overline{a_{km}} = \sum_{i: X_i=k} p_{im}$.

In the running example, the first row (novice) is the sum of all rows for which the true scores is novice. The second row is the sum of the intermediate rows and the third row advanced.

The function expTable() does this work. Note that it expects the marginal probability to be in a number of columns marked var.state, where var is the name of the variable and state is the name of the state. If the data uses a different naming convention, this can be expressed with the argument pvecregex which is a regular expression with special symbols <var> to be substituted with the variable name and <state> to be substituted with the state name.

The table below shows the expected matrix from the first five rows of the reading data.

reading5 <- expTable(language16[1:5,],"Reading","Reading")
Expected reading confusion matrix using 5 items.
Estimated
Novice Intermediate Advanced
Simulated
Novice 1.658 0.565 0.000
Intermediate 0.342 0.435 0.589
Advanced 0.000 0.000 1.411

What follows are the expected confusion matrixes for all four proficiency variables.

em <- list()
em$Reading <- expTable(language16,"Reading","Reading")
Reading expected confusion matrix for 16 item test.
Estimated
Novice Intermediate Advanced
Simulated
Novice 206.702 51.083 0.168
Intermediate 55.238 372.955 58.196
Advanced 0.060 55.962 199.636
em$Writing <- expTable(language16,"Writing","Writing")
Writing expected confusion matrix for 16 item test.
Estimated
Novice Intermediate Advanced
Simulated
Novice 125.296 85.976 27.997
Intermediate 86.891 284.228 134.581
Advanced 24.812 138.796 91.421
em$Speaking <- expTable(language16,"Speaking","Speaking")
Speaking expected confusion matrix for 16 item test.
Estimated
Novice Intermediate Advanced
Simulated
Novice 123.018 113.348 10.582
Intermediate 91.180 317.257 91.247
Advanced 7.802 102.395 143.171
em$Listening <- expTable(language16,"Listening","Listening")
Listening expected confusion matrix for 16 item test.
Estimated
Novice Intermediate Advanced
Simulated
Novice 208.522 34.682 0.200
Intermediate 36.774 385.996 79.082
Advanced 0.704 91.321 162.718

Measures of agreement

The sum of the diagonal of the confusion matrix, kakk, gives a count of how many cases are exact agreements (in this case between the simulation and estimation). Let N = ∑kmakm; then the agreement rate is kakk/N. For the reading data using the MAP scores, this is 831 out of 1000, so over 80% agreement. The function accuracy() calculates the agreement rate.

acc.tab <- data.frame(MAP=sapply(cm,accuracy),
                      EAP=sapply(em,accuracy))
Agreement for MAP (modal classification) and EAP (expected confusion matrix).
MAP EAP
Reading 0.831 0.779
Writing 0.603 0.501
Speaking 0.687 0.583
Listening 0.840 0.757

Raw agreement can be easy to achieve if there is not much variability in the population. For example, if 80% of the target population was intermediate, a classifier that simply classified each respondent as intermediate would achieve 80% accuracy, and one that guessed intermediate randomly 80% of the time would achieve at least 64% accuracy. For that reason, two adjusted agreement rates, lambda and kappa, are often used.

Goodman and Kruskal Lambda

If the test was not available, the only strategy would be a one-size-fits-all one, assuming that all subjects are at the same level of the variable. The best strategy is to treat all subjects as if they are in the modal (most likely) category. The marginal distribution for the variable is found from the row sums of A (or $\overline{\mathbf{A}}$), ak+ = ∑makm, and the best that can be done with the one-size-fits-all strategy is maxkak+.

Goodman and Kruskal (1952) suggest that the agreement rate be adjusted by subtracting out the best that can be done with mapping everybody to a single value. So they propose

$$\lambda = \frac{\sum_{k} a_{kk} - \max_{k} a_{k+}}{N - \max_{k} a_{k+}} \ .$$

This ranges from -1 to 1 with 0 representing doing no better than just guessing the most probable character and 1 indicating perfect agreement. (Negative numbers are doing worse than just guessing the mode.)

The function gkLambda() will do this calculation. Here are the values for the language test using both the MAP and expected agreements.

lambda.tab <- data.frame(MAP=sapply(cm,gkLambda),
                         EAP=sapply(em,gkLambda))
Lambda for MAP (modal classification) and EAP (expected confusion matrix).
MAP EAP
Reading 0.675 0.570
Writing 0.191 -0.010
Speaking 0.330 0.167
Listening 0.672 0.513

Flies-Cohen Kappa

Jacob Cohen (Flies, Levin & Paek, 2003) took a different approach which treats the two assignments of cases to categories more symmetrically. The idea is that these are two raters and the goal is to judge the extent of the agreement. Baseline here is to imagine two raters one of which assign categories randomly with probabilities ak+/N and a+k/N. Then the expected number of random agreements is kak+a+k/N. So the agreement measure adjusted for random agreement is:

$$ \kappa = \frac{\sum_{k} a_{kk} - \sum_{k}a_{k+}a_{+k}/N}{N-\sum_{k}a_{k+}a_{+k}/N} \ .$$

Again this runs from -1 to 1. The function fcKappa calculates Cohen’s kappa.

kappa.tab <- data.frame(MAP=sapply(cm,fcKappa),
                         EAP=sapply(em,fcKappa))
Kappa for MAP (modal classification) and EAP (expected confusion matrix).
MAP EAP
Reading 0.733 0.651
Writing 0.329 0.197
Speaking 0.478 0.325
Listening 0.740 0.609

Weighted versions

All three of kappa, lambda and raw agreement all assume that any missclassification is equally bad. Fleis suggest adding weights, were 1 ≥ wkm ≥ 0 is the desirability of classifying a subject who is k as m. In this case, weighted agreement is kmwkmakm/N. The weighted versions of kappa and lambda are given by:

$$ \lambda = \frac{\sum\sum w_{km} a_{km} - \max_k \sum_m w_{km} a_{km}}{N - \max_k \sum_m w_{km} a_{km}} \ ;$$

$$ \kappa = \frac{\sum\sum w_{km} a_{km} - \sum_k \sum_m w_{km} a_{k+}a_{+m}/N}{N - \sum_k \sum_m w_{km} a_{k+}a_{+k}/N} \ .$$

There are three commonly uses cases:

None
wkm = 1 if k = m, 0 otherwise.
Linear
wkm = 1 − |k − m|/(K − 1)
Quadratic
wkm = 1 − (k − m)2/(K − 1)2

Both linear and quadratic weight have increasing penalties for the number of categories of difference. So off-by-one has a lower penalty than off-by-two.

The accuracy, gkLambda and fcKappa have both a weights argument where no weights (“None”, default), “Linear” or “Quadratic” weights can be selected, and a w argument where a custom weight matrix can be entered.

wacc.tab <- data.frame(
  None = sapply(cm,accuracy,weights="None"),
  Linear = sapply(cm,accuracy,weights="Linear"),
  Quadratic = sapply(cm,accuracy,weights="Quadratic"))
wlambda.tab <- data.frame(
  None = sapply(cm,gkLambda,weights="None"),
  Linear = sapply(cm,gkLambda,weights="Linear"),
  Quadratic = sapply(cm,gkLambda,weights="Quadratic"))
wkappa.tab <- data.frame(
  None = sapply(cm,fcKappa,weights="None"),
  Linear = sapply(cm,fcKappa,weights="Linear"),
  Quadratic = sapply(cm,fcKappa,weights="Quadratic")
)
Weighted and unweighted Accuracy
None Linear Quadratic
Reading 0.831 0.915 0.958
Writing 0.603 0.792 0.886
Speaking 0.687 0.842 0.919
Listening 0.840 0.920 0.960
Weighted and unweighted Lambda
None Linear Quadratic
Reading 0.675 0.675 0.675
Writing 0.191 0.153 0.075
Speaking 0.330 0.323 0.310
Listening 0.672 0.672 0.672
Weighted and unweighted Kappa
None Linear Quadratic
Reading 0.733 0.780 0.837
Writing 0.329 0.393 0.479
Speaking 0.478 0.550 0.645
Listening 0.740 0.782 0.834

Exercise

The data set language24 has a simulation from a longer version of the test, with 24 items, 6 of each type.

Calculate the kappas and lambdas and compare to the shorter test.

data("language24")