In educational settings, conditional probability tables (CPTs) are
generally monotonic: the more of a skill that the student possess, the
more likely the student is to do well when posed with a task that uses
that skill. The CPTtools
package contains a framework for
building conditional probability tables for use in Bayes net models that
statisfy the monotonicity constraint. These are generally called
DiBello models after a suggestion by Lou DiBello (Almond et al,
2001; Almond et al, 2015).
DiBello’s idea was that each observable outcome variable in an educational setting corresponded to a direction in latent space, which he called the effective theta (in item response theory, IRT, θ is commonly use to represent the ability being measured). DiBello’s idea was to map each configuration of the parent variables (in educational settings, often representing configurations of skills the student is thought to possess) to a point in this effective theta dimension. Then standard models from IRT (e.g., the graded repsonse and generalized partial credit models) could be used to calculate the conditional probabilities to put in the table.
The general procedure has three steps:
Map the state of the parent variables to points on a real number line. In particular, let θ̃km be the number associate with State m for Parent k. Let θ̃i′ be the vector of such variables that correspond to one row of the table.
Combine the parent effective thetas using a combination function, Zjs(θ̃i′). This yields an effective theta for each cell of the conditional probability table.
Apply a link function, g(⋅), to go from the effective thetas to the conditional probabilities.
As R is a functional language, the combination function and link
function are passed as arguments to the key function:
calcDPCTable
or calcDPCFrame
. These can be
passed by name or the actual function can be passed. The
CPTtools
package supplies the most commonly used
combination and link functions, but others are possible.
Each step is described in more detail below.
In item response theory (IRT), the scale of the latent dimension, θ is not identified. Commonly, to identify the scale, psychometricians will assume that θ has a unit normal distribution in the target population. Thus a person who is a 0 on the theta scale is at the median for the population and a person who is at 1 is better than 5/6 of the people in the target population on the target skill.
Almond et al. (2015) using equally spaced quantiles of the normal
distribution for effective thetas. The function
effectiveThetas()
does this. It takes a single argument,
the number of states and returns a vector of effective thetas.
round(effectiveThetas(2),3)
#> [1] 0.674 -0.674
round(effectiveThetas(3),3)
#> [1] 0.967 0.000 -0.967
round(effectiveThetas(4),3)
#> [1] 1.150 0.319 -0.319 -1.150
round(effectiveThetas(5),3)
#> [1] 1.282 0.524 0.000 -0.524 -1.282
The effective values are passed to the calcDPCTable()
function via the tvals
argument. This should be a list of
vectors of effective thetas, one vector for each parent variable. The
default value simply applies the effectiveTheta()
function
to the length of the number of skills in each parent variable.
The function eThetaFrame()
, although designed to
test/illustate combination functions, is useful for understanding
effective thetas. Here the Compensatory
combination
function is set up to take the average of the parent values.
skill1 <- c("High","Medium","Low")
skill2 <- c("Master","Non-master")
eThetaFrame(list(S1=skill1,S2=skill2), log(c(S1=1,S2=1)), 0,"Compensatory")
#> S1 S2 S1.theta S2.theta Effective.theta
#> 1 High Master 0.9674216 0.6744898 1.1610066
#> 2 Medium Master 0.0000000 0.6744898 0.4769363
#> 3 Low Master -0.9674216 0.6744898 -0.2071341
#> 4 High Non-master 0.9674216 -0.6744898 0.2071341
#> 5 Medium Non-master 0.0000000 -0.6744898 -0.4769363
#> 6 Low Non-master -0.9674216 -0.6744898 -1.1610066
The last column gives the effective theta for the row. This is the number the corresponds to the ability of a person who has the skills marked in the row to complete the target task. For example, a person who is “High” on Skill 1 and has “Mastered” Skill 2 is 1.16 standard deviations above the average ability to achieve a good outcome on the specified observable.
Lou DiBello suggested that different observable could be summarized in different ways using a combination function (or structure function, or rule), Zj(⋅). Note that this can be different functional forms for different variables, with pedagogical experts free to choose a structure function depending on how they thought a student would approach a task.
The originally proposed structure functions were
Compensatory
– Having more of one skills compensates
for having less of another, so performance will be related to a
(weighted) average of the skills.
Conjunctive
– Having all skills are generally
necessary, so performance will be related to the weakest skill.
Disjunctive
– Skills each represent an alternative
solution path and students will chose the strongest skill. So
performance will be related to the strongest skill.
Inhibitor
– One skill is necessary, but only at a
minimal level. Once that threshold is met, the other skills determine
performance. For example, a mathematical word problem requires
sufficient reading skill to understand the prompt, but after that,
additional reading skills are not relevant.
The general signature of a structure function is
Z(theta,alphas,beta)
where theta
is a matrix
of effective values (see above), alphas
is a (collection
of) slope parameter(s) and beta
is a (collection of)
difficulty (negative intercept) parameter(s). The output should be a
vector of effective theta values corresponding to the rows of
theta
. See the function eThetaFrame()
for
examples. The Compensatory()
structure function is
basically the linear predictor of a generalized linear model, so is the
basis for understanding other combination functions.
Conjunctive()
and Disjunctive
are variants on
the idea. OffsetConjunctive()
and
OffsetDisjunctive()
are improvements which use different
sets of alphas and betas.
Let $\tilde{\theta_k}$ be the effective theta associated with the kth parent variable for a particular individual (or row in the CPT), and let αk be the discrimination parameter associated with the kth parent variable, and let β be a difficulty parameter. Then the combined effective theta is given as
$$ \frac{1}{\sqrt{K}} \sum_{k=1}^K \alpha_k \tilde{\theta_k} - \beta \ .$$ This is essentially a generalized linear model (in the case of binary outcome, a logistic regression). The $1/\sqrt{K}$ term is a variance stablization term: in ensures that the variance of the linear predictor is related to the average of the discriminations instead of growing as the number of parent variables increases.
This the the basic combination function, and probably the easiest to explain as intuition from regression works pretty well here. A discrimination value of 1 corresponds to average importance; higher values mean that skill is more important and lower values mean that the skill is less important. For educational models, it is customary to restrict the discriminations to be positive (this identifies the direction of the latent scale), although negative discriminations might make sense if the skill variables represent attitudes or other psychological traits or states. For that reason, log descrimination parameters are often used instead of discrimination. On the log scale, a log descrimination of 0 corresponds to average importance.
Note that the difficulty is the negative of the interecept. The value is related to the probability that a person who is average in all of the input skills has of answering the question. This is roughly on an inverse normal scale, so a 0 corresponds to a 50-50 chance of solving the problem (or obtaining that level).
The psychological intuition is that the parent variables represent skills which complement and can to a certain degree substitute for each other in solving the problem. For example, consider a Physics problem which can be solved either by working through the force vectors and Newton’s laws of motion, or by writing down the engergy equations and solving hem. Students who were comfortable with both techniques would have an even better chance of success because they could solve the problem with one technique and use the other to check their work.
The function eThetaFrame()
is useful for
inspecting/testing the combination function. The example below shows a
typical use of the combination function. Note that in each case, the
combined value is a weighted average of the two inputs.
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), c(S1=1.25,S2=.75), 0.33, "Compensatory")
#> S1 S2 S1.theta S2.theta Effective.theta
#> 1 High High 0.9674216 0.9674216 3.5058171
#> 2 Medium High 0.0000000 0.9674216 1.1181769
#> 3 Low High -0.9674216 0.9674216 -1.2694632
#> 4 High Medium 0.9674216 0.0000000 2.0576401
#> 5 Medium Medium 0.0000000 0.0000000 -0.3300000
#> 6 Low Medium -0.9674216 0.0000000 -2.7176401
#> 7 High Low 0.9674216 -0.9674216 0.6094632
#> 8 Medium Low 0.0000000 -0.9674216 -1.7781769
#> 9 Low Low -0.9674216 -0.9674216 -4.1658171
The term difficulty used for the negative intercept parameter has a slightly different meaning from the lay definition of difficulty. In the lay definition, a task is difficult if a typical member of the population (θ = 0) has a low probability of success. The difficulty parameter determines the ability level (along the effective theta dimension) where the probabiliy of success is 50/50. Thus, it is really determining demand for the skill (combination). What determines the lay difficulty is a combination of the difficulty and discrimination.
To get the conjunctive and disjunctive models, replace the sum in the
equation above with a maximum or minimum. Thus the
Conjunctive()
function is: $$
\min_{k=1}^K \alpha_k \tilde{\theta_k} - \beta \
,$$ and the Disjunctive()
function is: $$ \max_{k=1}^K \alpha_k \tilde{\theta_k} - \beta \
.$$ The variance stablization term is dropped, as the min and max
functions will not increase the variance as the number of parents
increases.
The Pyschological justification is that all skills are necessary for the conjunctive model, so the weakest skill will drive importance. The disjunctive model corresponds to alternate solution paths. If the students knows what their strongest skills are, then these should dominate the performance.
Again, the function eThetaFrame()
is used to illustrate
the combination functions. The examples below show a typicals use of the
conjunctive and disjunctive function. Note that in each case, the
combined value is a weighted min or max of the two inputs.
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), c(S1=1.25,S2=.75), 0.33, "Conjunctive")
#> S1 S2 S1.theta S2.theta Effective.theta
#> 1 High High 0.9674216 0.9674216 1.718031
#> 2 Medium High 0.0000000 0.9674216 -0.330000
#> 3 Low High -0.9674216 0.9674216 -3.706633
#> 4 High Medium 0.9674216 0.0000000 -0.330000
#> 5 Medium Medium 0.0000000 0.0000000 -0.330000
#> 6 Low Medium -0.9674216 0.0000000 -3.706633
#> 7 High Low 0.9674216 -0.9674216 -2.378031
#> 8 Medium Low 0.0000000 -0.9674216 -2.378031
#> 9 Low Low -0.9674216 -0.9674216 -3.706633
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), c(S1=1.25,S2=.75), 0.33, "Disjunctive")
#> S1 S2 S1.theta S2.theta Effective.theta
#> 1 High High 0.9674216 0.9674216 3.046633
#> 2 Medium High 0.0000000 0.9674216 1.718031
#> 3 Low High -0.9674216 0.9674216 1.718031
#> 4 High Medium 0.9674216 0.0000000 3.046633
#> 5 Medium Medium 0.0000000 0.0000000 -0.330000
#> 6 Low Medium -0.9674216 0.0000000 -0.330000
#> 7 High Low 0.9674216 -0.9674216 3.046633
#> 8 Medium Low 0.0000000 -0.9674216 -0.330000
#> 9 Low Low -0.9674216 -0.9674216 -2.378031
The interpretation of the discrimination parameters in the conjuctive model is not realistic. Consider a mathematical word problem and a model with two skills: mathematical manipulation and mathematical language. Typically, the demands on the two will be different; for example, the demand on on mathematical language might be minimal, while the demand on mathematical manipulation might be moderate. Thus, it seems natural to have two different difficulty parameters.
The Offset Conjunctive and Offset Disjunctive models use one difficulty parameter for each parent variable. To reduce the overall number of parameters, only a single common discrimination parameter is used. This parameterization is much more natural because the discrimination parameter is often related to construct irrelevant sources of variability which affect all skills equally.
The new equations are: $$ \alpha
\min_{k=1}^K (\tilde{\theta_k} - \beta_k) \
,$$ for OffsetConjunctive()
and $$ \alpha \max_{k=1}^K (\tilde{\theta_k} - \beta_k)
\
,$$ for OffsetDisjunctive()
. Note that the
signatures of the OffsetConjunctive()
and
Conjunctive()
functions are the same, but the former
expects beta
to be a vector and alphas
a
scalar, while the reverse is true for the latter.
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), 1.0, c(S1=0.25,S2=-0.25),
"OffsetConjunctive")
#> S1 S2 S1.theta S2.theta Effective.theta
#> 1 High High 0.9674216 0.9674216 1.9501540
#> 2 Medium High 0.0000000 0.9674216 -0.6795705
#> 3 Low High -0.9674216 0.9674216 -3.3092949
#> 4 High Medium 0.9674216 0.0000000 0.6795705
#> 5 Medium Medium 0.0000000 0.0000000 -0.6795705
#> 6 Low Medium -0.9674216 0.0000000 -3.3092949
#> 7 High Low 0.9674216 -0.9674216 -1.9501540
#> 8 Medium Low 0.0000000 -0.9674216 -1.9501540
#> 9 Low Low -0.9674216 -0.9674216 -3.3092949
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), 1.0, c(S1=0.25,S2=-0.25),
"OffsetDisjunctive")
#> S1 S2 S1.theta S2.theta Effective.theta
#> 1 High High 0.9674216 0.9674216 3.3092949
#> 2 Medium High 0.0000000 0.9674216 3.3092949
#> 3 Low High -0.9674216 0.9674216 3.3092949
#> 4 High Medium 0.9674216 0.0000000 1.9501540
#> 5 Medium Medium 0.0000000 0.0000000 0.6795705
#> 6 Low Medium -0.9674216 0.0000000 0.6795705
#> 7 High Low 0.9674216 -0.9674216 1.9501540
#> 8 Medium Low 0.0000000 -0.9674216 -0.6795705
#> 9 Low Low -0.9674216 -0.9674216 -1.9501540
The Almond, et al (2001) paper (see also Almond, et al, 2015) also included a special asymmetric combination function called the inhibitor. Once again consider a mathematical word problem written in English. Here knowledge of English is an inhibitor skill, a certain minimal amount of English is needed to understand the goals of the question. Once that threshold is met, then the other (mathematical) skills determine the probability of success. If the English language comprehension threshold is not met, then the probability of success will be low (guessing).
This can be expressed mathematically as:
$$ \begin{cases} \beta_0 & \mbox{if} \tilde{\theta_1} < \tilde{\theta_1}^* \\ \alpha_2 \tilde{\theta_2} - \beta_2 & \mbox{if} \tilde{\theta_1} \ge \tilde{\theta_1}^* \\ \end{cases}\ .$$
No Inhibitor()
function was included in
CPTtools
because of the difficulty in generalizing this
formula. First, the threshold parameter, $^* $ doesn’t fit naturally
into either alphas
or beta
; so the signature
of the function does not match. Second, the Inhibitor model does not
generalize when there is more than two parent variables: another
combination rule would be needed to collapse the remaining dimensions
onto a single dimenson.
This is a good place to remark on the extensibility of the
combination functions in the Discrete Partial Credit framework. The
various functions aligned with the framework (e.g.,
eThetaFrame
, calcDPCFrame
, and
mapDPC
) accept a function (or a character value giving the
name of a function) which does the combination. This function should
have three formal parameters:
theta
— This is a matrix of effective theta values
produces by expand.grid
. For example,
thetas <- expand.grid(list(S1=seq(1,-1), S2 = seq(1,-1)))
.
alphas
— This is a vector of discrimination
parameters. As several functions work with log(alphas)
,
these should all be strictly positive.
beta
— This is a vector of difficulty
parameters.
Generally, the theta
parameter is generated internally
by the CPTtools
functions, while the alphas
(or log(alphas)
) and beta
are passed in by the
user. The Peanut
package, in particular, allows associating
lnAlphas
and betas
with a node in a graph. The
alphas
and beta
generally have one of two
shapes
Compensatory-shape—There is one alpha
for each
parent and a single beta
.
Offset-shape—There is one beta
for each parent and a
single alpha
.
The function isOffsetRule()
can check whether a named
rule is Offset-shpae or Compensatory-shape. There is an internal list of
offset rules which can be inspected with getOffsetRule()
and manipulated with setOffsetRule()
.
Currently, CPTtools
supports the following rules:
Compensatory-shape:
Compensatory
,Conjunctive
,Disjunctive
Offset-shape:
OffsetConjunctive
,OffsetDisjunctive
In DiBello’s models, the effective theta for an item represents the ability of an examinee to solve the particular problem posed in the task. This is a value that runs from negative to positive infinity, with higher values indicating a more successful outcome. The next step is to map these onto probability of success. Following the generalized linear modeling usage, these are called link functions.
DiBello’s original idea was to press ideas models from IRT into service for this step. The first one implemented was Samejima’s graded response model. Although this model worked well for observables, but not so well for intermediate proficiency variables. This inspired a new normal link function which worked more like a regression model. The graded response model has certain restrictions. In particular, all transitions must have the same discrimination. The partial credit link function was introduced to relax that restriction, and enabled the use of more combination rules, including different combination rules for each transition.
If the child variable only has two states, then both the graded response and generalized partial credit models collapse into the the 2-parameter logistic (2PL) model. This is a common model from item response theory (IRT), which states that the probability that Examinee i gets Item j correct is:
$$ P(X_{ij}|\tilde{\theta_{i}}) = \frac{\exp(D\alpha_j(\tilde{\theta_i}-\beta_j))}{1 + \exp(D\alpha_j(\tilde{\theta_i}-\beta_j))} .$$
The constant D = 1.7 is chosen so that the logistic function and the normal ogive curve are nearly identical. This allows θi to be interpreted as a standard normal value, with θ = 0 as the population median and θ = 1 representing an individual one standard deviation above the median. The following example shows the curve.
inv.logit <- function (z) {1/(1+exp(-1.7*z))}
a <- 1 ## Discrimination
b <- 0 ## Difficulty
curve(inv.logit(a*(x-b)),xlim=c(-3,3),ylim=c(0,1),
main=paste("2 Parameter Logistic: a=",round(a,2),
" b=",round(b,2)),
xlab="Ability (theta)", ylab="Probability of success.")
Note that the difficulty parameter is on the same scale as the ability parameter and represents the ability at which examinees will have a 50-50 chance of success. The discrimination describes how quickly the probability rises with increasing ability, and is often related to how many non-focal knowledge, skills and abilities are required to solve the problem.
Note that the model can be rewritten as $P(X_{ij}|) = 1/(1+(-DZ_j())). Here Zj(⋅) is the combination function, which has the difficulty and discrimination parameters built into it. This more cleanly separates the link function from the combination rules.
The graded response model is a generalization of the 2PL model for ordered categorical data introduces by Samejima (1969). Let the possible values for the observable Xij be $\{0, 1, \ldots, K}$. Now, model each of the events Xij ≥ k is modeled with a logistic curve: $$ \Pr(X_{ij} \ge k$ | \tilde{\theta_{i}}) = 1/(1+\exp(-D\cdot Z_{jk}(\tilde{\theta_{i}}))) ,$$ for k = 1, …, K. The probability that Xij = k can be found by differencing adjacent curves.
Almond, R.G., Mislevy, R.J., Steinberg, L.S., Yan, D. and Williamson, D.M. (2015). Bayesian Networks in Educational Assessment. Springer. Chapter 8.
Almond, R.G., DiBello, L., Jenkins, F., Mislevy, R.J., Senturk, D., Steinberg, L.S. and Yan, D. (2001) Models for Conditional Probability Tables in Educational Assessment. Artificial Intelligence and Statistics 2001 Jaakkola and Richardson (eds)., Morgan Kaufmann, 137–143.
Muraki, E. (1992). A Generalized Partial Credit Model: Application of an EM Algorithm. Applied Psychological Measurement, 16 159-176. DOI: 10.1177/014662169201600206
Samejima, F. (1969) Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17, 34, (No. 4, Part 2).