library(CPTtools)
library(kableExtra)

This vignette will use measures of agreement, in particular, the Fleis-Cohen kappa, and the Goodman-Kruskal lambda to explore the proposed adequacy of a cognitively diagnostic assessment.

Assessment Design

This assessment is based on an example found in [@Mislevy1995]. It is a test of language arts in which there are four constructs being measured: Reading, Writing, Speaking and Listening. Each variable can tale on the possible values Advanced, Intermediate or Novice.

There are four kinds of tasks:

Reading: Subjects read a short segment of text and then answer a selected response question.
Writing: An integrated Reading/Writing task where the subject provides a short written response based on an reading passage.
Listening: Subject listens to a prompt followed by a mulitple choice question where the options are also spoken (as are the instructions).
Speaking: Subject is requested to respond verbally (response is recorded) after listening to an audio stimulus. The instructions are written.

Form Design

There are 5 Reading, 5 Listening, 3 Writing and 3 Speaking questions on a form of the test. Therefore, the Q Matrix looks like:

Q-matrix for 16 item test
	Reading	Writing	Speaking	Listening
R1	1	0	0	0
R2	1	0	0	0
R3	1	0	0	0
R4	1	0	0	0
R5	1	0	0	0
W6	1	1	0	0
W7	1	1	0	0
W8	1	1	0	0
L9	0	0	0	1
L10	0	0	0	1
L11	0	0	0	1
L12	0	0	0	1
L13	0	0	0	1
S14	1	0	1	1
S15	1	0	1	1
S16	1	0	1	1

Simulation Experiment

Assuming that the parameters of the 5 item models are known, it is straightforward to simulate data from the assessment. This kind of simulation provides information about the adequacy of the data collection for classifying the students.

The simulation procedure is as follows:

Generate random proficiency profiles using the proficiency model.
Generate random item responses using the item (evidence) models.
Score the assessment. There are two variations:

Modal (MAP) Scores: The score assigns a category to each of the four constructs. (These are the columns named “mode.Reading” and similar.)
Expected (Probability) Scores: The score assigns a probability of high, medium or low to each individual. (These are the columns named “Reading.Novice”, “Reading.Intermediate”, and “Reading.Advanced”.)

The simulation study itself can be found in vingette("SimulationStudies",package="RNetica").

The results are saved in this package in the data set language16.

data(language16)

Building the Confusion Matrix

The simulation data has columns containing the “true” value for the four proficiencies—“Reading”, “Writing”, “Speaking”, and “Listening”—and for values for the estimates—“mode.Reading”, “mode.Writing”, &c. The cross-tabulation of a true value and its corresponding estimate is known as a confusion matrix. This is a matrix A, were a_km is the count of the number of simulated cases for which first variable (simulated truth) is k and the second variable (MAP estimate) is m. This can be built in R using the table function.[^2]

cm <- list()
cm$Reading <- table(language16[,c("Reading","mode.Reading")])

Reading confusion matrix for 16 item test.
	Estimated
	Novice	Intermediate	Advanced
Simulated
Novice	216	46	0
Intermediate	37	398	45
Advanced	0	41	217

cm$Writing <- table(language16[,c("Writing","mode.Writing")])

Writing confusion matrix for 16 item test.
	Estimated
	Novice	Intermediate	Advanced
Simulated
Novice	134	98	5
Intermediate	43	375	91
Advanced	14	146	94

cm$Speaking <- table(language16[,c("Speaking","mode.Speaking")])

Speaking confusion matrix for 16 item test.
	Estimated
	Novice	Intermediate	Advanced
Simulated
Novice	130	91	1
Intermediate	65	390	78
Advanced	2	76	167

cm$Listening <- table(language16[,c("Listening","mode.Listening")])

Listening confusion matrix for 16 item test.
	Estimated
	Novice	Intermediate	Advanced
Simulated
Novice	232	14	0
Intermediate	29	437	46
Advanced	0	71	171

The expected Confusion Matrix

The Bayes net scoring engine (like many scoring models) expresses its uncertainty about the abilities of the students by estimating the probability that the student is in each category. These are often called the marginal probabilities because they are one margin of the joint distribution over all four variables. The table below shows the “true” (simulated) reading ability and estimated marginal probabilities for the first five simulees.

kbl(language16[1:5,c("Reading","Reading.Novice",
                     "Reading.Intermediate","Reading.Advanced")],
     caption="Reading data from first five simulees.",
     digits=3) |>
  kable_classic()

Reading data from first five simulees.
Reading	Reading.Novice	Reading.Intermediate	Reading.Advanced
Intermediate	0.565	0.435	0.000
Advanced	0.000	0.069	0.931
Novice	0.954	0.046	0.000
Novice	0.703	0.297	0.000
Advanced	0.000	0.521	0.479

The expected value of the confusion matrix, $\overline{\mathbf{A}}$ is calculated as follows: let X_i be the value of the first (simulated truth) variable for the ith simulee, and let $p_{im} = P(\hat{X_i}=m|e)$ be the estimated probability that X_i = m for the ith simulee is m. Then $\overline{a_{km}} = \sum_{i: X_i=k} p_{im}$.

In the running example, the first row (novice) is the sum of all rows for which the true scores is novice. The second row is the sum of the intermediate rows and the third row advanced.

The function expTable() does this work. Note that it expects the marginal probability to be in a number of columns marked var.state, where var is the name of the variable and state is the name of the state. If the data uses a different naming convention, this can be expressed with the argument pvecregex which is a regular expression with special symbols <var> to be substituted with the variable name and <state> to be substituted with the state name.

The table below shows the expected matrix from the first five rows of the reading data.

reading5 <- expTable(language16[1:5,],"Reading","Reading")

Expected reading confusion matrix using 5 items.
	Estimated
	Novice	Intermediate	Advanced
Simulated
Novice	1.658	0.565	0.000
Intermediate	0.342	0.435	0.589
Advanced	0.000	0.000	1.411

What follows are the expected confusion matrixes for all four proficiency variables.

em <- list()
em$Reading <- expTable(language16,"Reading","Reading")

Reading expected confusion matrix for 16 item test.
	Estimated
	Novice	Intermediate	Advanced
Simulated
Novice	206.702	51.083	0.168
Intermediate	55.238	372.955	58.196
Advanced	0.060	55.962	199.636

em$Writing <- expTable(language16,"Writing","Writing")

Writing expected confusion matrix for 16 item test.
	Estimated
	Novice	Intermediate	Advanced
Simulated
Novice	125.296	85.976	27.997
Intermediate	86.891	284.228	134.581
Advanced	24.812	138.796	91.421

em$Speaking <- expTable(language16,"Speaking","Speaking")

Speaking expected confusion matrix for 16 item test.
	Estimated
	Novice	Intermediate	Advanced
Simulated
Novice	123.018	113.348	10.582
Intermediate	91.180	317.257	91.247
Advanced	7.802	102.395	143.171

em$Listening <- expTable(language16,"Listening","Listening")

Listening expected confusion matrix for 16 item test.
	Estimated
	Novice	Intermediate	Advanced
Simulated
Novice	208.522	34.682	0.200
Intermediate	36.774	385.996	79.082
Advanced	0.704	91.321	162.718

Measures of agreement

The sum of the diagonal of the confusion matrix, ∑_ka_kk, gives a count of how many cases are exact agreements (in this case between the simulation and estimation). Let N = ∑_k∑_ma_km; then the agreement rate is ∑_ka_kk/N. For the reading data using the MAP scores, this is 831 out of 1000, so over 80% agreement. The function accuracy() calculates the agreement rate.

acc.tab <- data.frame(MAP=sapply(cm,accuracy),
                      EAP=sapply(em,accuracy))

Agreement for MAP (modal classification) and EAP (expected confusion matrix).
	MAP	EAP
Reading	0.831	0.779
Writing	0.603	0.501
Speaking	0.687	0.583
Listening	0.840	0.757

Raw agreement can be easy to achieve if there is not much variability in the population. For example, if 80% of the target population was intermediate, a classifier that simply classified each respondent as intermediate would achieve 80% accuracy, and one that guessed intermediate randomly 80% of the time would achieve at least 64% accuracy. For that reason, two adjusted agreement rates, lambda and kappa, are often used.

Goodman and Kruskal Lambda

If the test was not available, the only strategy would be a one-size-fits-all one, assuming that all subjects are at the same level of the variable. The best strategy is to treat all subjects as if they are in the modal (most likely) category. The marginal distribution for the variable is found from the row sums of A (or $\overline{\mathbf{A}}$), a_k+ = ∑_ma_km, and the best that can be done with the one-size-fits-all strategy is max_ka_k+.

Goodman and Kruskal (1952) suggest that the agreement rate be adjusted by subtracting out the best that can be done with mapping everybody to a single value. So they propose

$$\lambda = \frac{\sum_{k} a_{kk} - \max_{k} a_{k+}}{N - \max_{k} a_{k+}} \ .$$

This ranges from -1 to 1 with 0 representing doing no better than just guessing the most probable character and 1 indicating perfect agreement. (Negative numbers are doing worse than just guessing the mode.)

The function gkLambda() will do this calculation. Here are the values for the language test using both the MAP and expected agreements.

lambda.tab <- data.frame(MAP=sapply(cm,gkLambda),
                         EAP=sapply(em,gkLambda))

Lambda for MAP (modal classification) and EAP (expected confusion matrix).
	MAP	EAP
Reading	0.675	0.570
Writing	0.191	-0.010
Speaking	0.330	0.167
Listening	0.672	0.513

Flies-Cohen Kappa

Jacob Cohen (Flies, Levin & Paek, 2003) took a different approach which treats the two assignments of cases to categories more symmetrically. The idea is that these are two raters and the goal is to judge the extent of the agreement. Baseline here is to imagine two raters one of which assign categories randomly with probabilities a_k+/N and a_+k/N. Then the expected number of random agreements is ∑_ka_k+a_+k/N. So the agreement measure adjusted for random agreement is:

$$ \kappa = \frac{\sum_{k} a_{kk} - \sum_{k}a_{k+}a_{+k}/N}{N-\sum_{k}a_{k+}a_{+k}/N} \ .$$

Again this runs from -1 to 1. The function fcKappa calculates Cohen’s kappa.

kappa.tab <- data.frame(MAP=sapply(cm,fcKappa),
                         EAP=sapply(em,fcKappa))

Kappa for MAP (modal classification) and EAP (expected confusion matrix).
	MAP	EAP
Reading	0.733	0.651
Writing	0.329	0.197
Speaking	0.478	0.325
Listening	0.740	0.609

Weighted versions

All three of kappa, lambda and raw agreement all assume that any missclassification is equally bad. Fleis suggest adding weights, were 1 ≥ w_km ≥ 0 is the desirability of classifying a subject who is k as m. In this case, weighted agreement is ∑_k∑_mw_kma_km/N. The weighted versions of kappa and lambda are given by:

$$ \lambda = \frac{\sum\sum w_{km} a_{km} - \max_k \sum_m w_{km} a_{km}}{N - \max_k \sum_m w_{km} a_{km}} \ ;$$

$$ \kappa = \frac{\sum\sum w_{km} a_{km} - \sum_k \sum_m w_{km} a_{k+}a_{+m}/N}{N - \sum_k \sum_m w_{km} a_{k+}a_{+k}/N} \ .$$

There are three commonly uses cases:

None: w_km = 1 if k = m, 0 otherwise.
Linear: w_km = 1 − |k − m|/(K − 1)
Quadratic: w_km = 1 − (k − m)²/(K − 1)²

Both linear and quadratic weight have increasing penalties for the number of categories of difference. So off-by-one has a lower penalty than off-by-two.

The accuracy, gkLambda and fcKappa have both a weights argument where no weights (“None”, default), “Linear” or “Quadratic” weights can be selected, and a w argument where a custom weight matrix can be entered.

wacc.tab <- data.frame(
  None = sapply(cm,accuracy,weights="None"),
  Linear = sapply(cm,accuracy,weights="Linear"),
  Quadratic = sapply(cm,accuracy,weights="Quadratic"))
wlambda.tab <- data.frame(
  None = sapply(cm,gkLambda,weights="None"),
  Linear = sapply(cm,gkLambda,weights="Linear"),
  Quadratic = sapply(cm,gkLambda,weights="Quadratic"))
wkappa.tab <- data.frame(
  None = sapply(cm,fcKappa,weights="None"),
  Linear = sapply(cm,fcKappa,weights="Linear"),
  Quadratic = sapply(cm,fcKappa,weights="Quadratic")
)

Weighted and unweighted Accuracy
	None	Linear	Quadratic
Reading	0.831	0.915	0.958
Writing	0.603	0.792	0.886
Speaking	0.687	0.842	0.919
Listening	0.840	0.920	0.960

Weighted and unweighted Lambda
	None	Linear	Quadratic
Reading	0.675	0.675	0.675
Writing	0.191	0.153	0.075
Speaking	0.330	0.323	0.310
Listening	0.672	0.672	0.672

Weighted and unweighted Kappa
	None	Linear	Quadratic
Reading	0.733	0.780	0.837
Writing	0.329	0.393	0.479
Speaking	0.478	0.550	0.645
Listening	0.740	0.782	0.834

Exercise

The data set language24 has a simulation from a longer version of the test, with 24 items, 6 of each type.

Calculate the kappas and lambdas and compare to the shorter test.

data("language24")

- Assessment Design

	Reading	Writing	Speaking	Listening
R1	1	0	0	0
R2	1	0	0	0
R3	1	0	0	0
R4	1	0	0	0
R5	1	0	0	0
W6	1	1	0	0
W7	1	1	0	0
W8	1	1	0	0
L9	0	0	0	1
L10	0	0	0	1
L11	0	0	0	1
L12	0	0	0	1
L13	0	0	0	1
S14	1	0	1	1
S15	1	0	1	1
S16	1	0	1	1

	Reading	Writing	Speaking	Listening
R1	1	0	0	0
R2	1	0	0	0
R3	1	0	0	0
R4	1	0	0	0
R5	1	0	0	0
W6	1	1	0	0
W7	1	1	0	0
W8	1	1	0	0
L9	0	0	0	1
L10	0	0	0	1
L11	0	0	0	1
L12	0	0	0	1
L13	0	0	0	1
S14	1	0	1	1
S15	1	0	1	1
S16	1	0	1	1

Measues of Agreement