---
title: "Discrete Partial Credit Models"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Discrete Partial Credit Models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(CPTtools)
```

# DiBello Framework

In educational settings, conditional probability tables (CPTs) are
generally monotonic:  the more of a skill that the student possess,
the more likely the student is to do well when posed with a task that
uses that skill.  The `CPTtools` package contains a framework for
building conditional probability tables for use in Bayes net models
that statisfy the monotonicity constraint.  These are generally called
_DiBello models_ after a suggestion by Lou DiBello (Almond et al,
2001; Almond et al, 2015).  

DiBello's idea was that each observable outcome variable in an
educational setting corresponded to a direction in latent space, which
he called the _effective theta_ (in item response theory, IRT,
$\theta$ is commonly use to represent the ability being measured).
DiBello's idea was to map each configuration of the parent variables
(in educational settings, often representing configurations of skills
the student is thought to possess) to a point in this effective theta
dimension.  Then standard models from IRT (e.g., the graded repsonse
and generalized partial credit models) could be used to calculate the
conditional probabilities to put in the table. 

The general procedure has three steps:

1. Map the state of the parent variables to points on a real number
   line.  In particular, let $\tilde\theta_{km}$ be the number
   associate with State $m$ for Parent $k$.  Let $\tilde\theta_{i'}$
   be the vector of such variables that correspond to one row of the
   table. 

2. Combine the parent effective thetas using a combination function,
   $Z_{js}(\tilde\theta_{i'})$. This yields an _effective theta_ for
   each cell of the conditional probability table.  

3. Apply a link function, $g(\cdot)$, to go from the effective thetas
   to the conditional probabilities. 

As R is a functional language, the combination function and link
function are passed as arguments to the key function: `calcDPCTable`
or `calcDPCFrame`.  These can be passed by name or the actual function
can be passed.  The `CPTtools` package supplies the most commonly used
combination and link functions, but others are possible. 

Each step is described in more detail below.

# Effective Thetas

In item response theory (IRT), the scale of the latent dimension,
$\theta$ is not identified.  Commonly, to identify the scale,
psychometricians will assume that $\theta$ has a unit normal
distribution in the target population.  Thus a person who is a 0 on
the theta scale is at the median for the population and a person who
is at 1 is better than 5/6 of the people in the target population on
the target skill. 

Almond et al. (2015) using equally spaced quantiles of the normal
distribution for effective thetas.  The function `effectiveThetas()`
does this.  It takes a single argument, the number of states and
returns a vector of effective thetas.   

```{r effectiveTheta, echo=TRUE}
round(effectiveThetas(2),3)
round(effectiveThetas(3),3)
round(effectiveThetas(4),3)
round(effectiveThetas(5),3)
```

The effective values are passed to the `calcDPCTable()` function via the
`tvals` argument.  This should be a list of vectors of effective
thetas, one vector for each parent variable.  The default value simply
applies the `effectiveTheta()` function to the length of the number of
skills in each parent variable. 

The function `eThetaFrame()`, although designed to test/illustate
combination functions, is useful for understanding effective thetas.
Here the `Compensatory` combination function is set up to take the
average of the parent values. 

```{r eThetaFrame}
skill1 <- c("High","Medium","Low")
skill2 <- c("Master","Non-master")
eThetaFrame(list(S1=skill1,S2=skill2), log(c(S1=1,S2=1)), 0,"Compensatory") 

```
The last column gives the _effective theta_ for the row.  This is the
number the corresponds to the ability of a person who has the skills
marked in the row to complete the target task.  For example, a person
who is "High" on Skill 1 and has "Mastered" Skill 2 is 1.16 standard
deviations above the average ability to achieve a good outcome on the
specified observable. 

# Combination Functions

Lou DiBello suggested that different observable could be summarized in
different ways using a _combination function_ (or structure function,
or rule), $Z_j(\cdot)$.  Note that this can be different functional
forms for different variables, with pedagogical experts free to choose
a structure function depending on how they thought a student would
approach a task. 

The originally proposed structure functions were 

* `Compensatory` -- Having more of one skills compensates for having
  less of another, so performance will be related to a (weighted)
  average of the skills. 

* `Conjunctive` -- Having all skills are generally necessary, so
  performance will be related to the weakest skill. 

* `Disjunctive` -- Skills each represent an alternative solution path
  and students will chose the strongest skill.  So performance will be
  related to the strongest skill. 

* `Inhibitor` -- One skill is necessary, but only at a minimal level.
  Once that threshold is met, the other skills determine performance.
  For example, a mathematical word problem requires sufficient reading
  skill to understand the prompt, but after that, additional reading
  skills are not relevant. 

The general signature of a structure function is
`Z(theta,alphas,beta)` where `theta` is a matrix of effective values
(see above), `alphas` is a (collection of) slope parameter(s) and
`beta` is a (collection of) difficulty (negative intercept)
parameter(s). The output should be a vector of effective theta values
corresponding to the rows of `theta`.  See the function `eThetaFrame()`
for examples.  The `Compensatory()` structure function is basically the
linear predictor of a generalized linear model, so is the basis for
understanding other combination functions. `Conjunctive()` and
`Disjunctive` are variants on the idea.  `OffsetConjunctive()` and
`OffsetDisjunctive()` are improvements which use different sets of
alphas and betas. 

## Compensatory

Let $\tilde{\theta_k}$ be the effective theta associated with the $k$th
parent variable for a particular individual (or row in the CPT), and
let $\alpha_k$ be the discrimination parameter associated with the
$k$th parent variable, and let $\beta$ be a difficulty parameter.
Then the combined effective theta is given as 

$$ \frac{1}{\sqrt{K}} \sum_{k=1}^K \alpha_k \tilde{\theta_k} - \beta \
.$$
This is essentially a generalized linear model (in the case of binary
outcome, a logistic regression).  The $1/\sqrt{K}$ term is a variance
stablization term:  in ensures that the variance of the linear
predictor is related to the average of the discriminations instead of
growing as the number of parent variables increases.  

This the the basic combination function, and probably the easiest to
explain as intuition from regression works pretty well here.  A
discrimination value of 1 corresponds to average importance; higher
values mean that skill is more important and lower values mean that
the skill is less important.  For educational models, it is customary
to restrict the discriminations to be positive (this identifies the
direction of the latent scale), although negative discriminations
might make sense if the skill variables represent attitudes or other
psychological traits or states.  For that reason, log descrimination
parameters are often used instead of discrimination.  On the log
scale, a log descrimination of 0 corresponds to average importance.

Note that the difficulty is the negative of the interecept. The value
is related to the probability that a person who is average in all of
the input skills has of answering the question.  This is roughly on an
inverse normal scale, so a 0 corresponds to a 50-50 chance of solving
the problem (or obtaining that level).

The psychological intuition is that the parent variables represent
skills which complement and can to a certain degree substitute for
each other in solving the problem.  For example, consider a Physics
problem which can be solved either by working through the force
vectors and Newton's laws of motion, or by writing down the engergy
equations and solving hem.  Students who were comfortable with both
techniques would have an even better chance of success because they
could solve the problem with one technique and use the other to check
their work.


The function `eThetaFrame()` is useful for inspecting/testing the
combination function.  The example below shows a typical use of the
combination function.  Note that in each case, the combined value is a
weighted average of the two inputs.

```{r Compensatory, echo=TRUE}
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), c(S1=1.25,S2=.75), 0.33, "Compensatory")
```

The term _difficulty_ used for the negative intercept parameter has a
slightly different meaning from the lay definition of difficulty.  In
the lay definition, a task is difficult if a typical member of the
population ($\theta=0$) has a low probability of success.  The
difficulty parameter determines the ability level (along the effective
theta dimension) where the probabiliy of success is 50/50.  Thus, it
is really determining _demand_ for the skill (combination).  What
determines the lay difficulty is a combination of the difficulty and
discrimination.

## Conjunctive, Disjunctive

To get the conjunctive and disjunctive models, replace the sum in the
equation above with a maximum or minimum.  Thus the `Conjunctive()`
function is:
$$ \min_{k=1}^K \alpha_k \tilde{\theta_k} - \beta \
,$$
and the `Disjunctive()` function is:
$$ \max_{k=1}^K \alpha_k \tilde{\theta_k} - \beta \
.$$
The variance stablization term is dropped, as the min and max
functions will not increase the variance as the number of parents
increases.

The Pyschological justification is that all skills are necessary for
the _conjunctive_ model, so the weakest skill will drive importance.
The _disjunctive_ model corresponds to alternate solution paths.  If
the students knows what their strongest skills are, then these should
dominate the performance.

Again, the function `eThetaFrame()` is used to illustrate the 
combination functions.  The examples below show a typicals use of the
conjunctive and disjunctive function.  Note that in each case, the combined value is a weighted min or max of the two inputs.

```{r Conjunctive, echo=TRUE}
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), c(S1=1.25,S2=.75), 0.33, "Conjunctive")
```

```{r Disjunctive, echo=TRUE}
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), c(S1=1.25,S2=.75), 0.33, "Disjunctive")
```

## OffsetConjunctive, OffsetDisjunctive

The interpretation of the discrimination parameters in the conjuctive
model is not realistic. Consider a mathematical word problem and a
model with two skills:  mathematical manipulation and mathematical language.
Typically, the demands on the two will be different; for example, the
demand on on mathematical language might be minimal, while the demand
on mathematical manipulation might be moderate.  Thus, it seems
natural to have two different difficulty parameters.  

The _Offset Conjunctive_ and _Offset Disjunctive_ models use one
difficulty parameter for each parent variable.  To reduce the overall
number of parameters, only a single common discrimination parameter is
used.  This parameterization is much more natural because the
discrimination parameter is often related to construct irrelevant
sources of variability which affect all skills equally.  

The new equations are: 
$$ \alpha \min_{k=1}^K (\tilde{\theta_k} - \beta_k) \
,$$
for `OffsetConjunctive()` and 
$$ \alpha \max_{k=1}^K (\tilde{\theta_k} - \beta_k) \
,$$
for `OffsetDisjunctive()`.  Note that the signatures of the
`OffsetConjunctive()` and `Conjunctive()` functions are the same, but
the former expects `beta` to be a vector and `alphas` a scalar, while
the reverse is true for the latter.

```{r OffsetConjunctive, echo=TRUE}
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), 1.0, c(S1=0.25,S2=-0.25),
            "OffsetConjunctive")
```

```{r OffsetDisjuctive, echo=TRUE}
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), 1.0, c(S1=0.25,S2=-0.25),
            "OffsetDisjunctive")
```
## Inhibitor 

The Almond, et al (2001) paper (see also Almond, et al, 2015) also
included a special asymmetric combination function called the
_inhibitor_.  Once again consider a mathematical word problem written
in English.  Here knowledge of English is an inhibitor skill, a
certain minimal amount of English is needed to understand the goals of
the question.  Once that threshold is met, then the other
(mathematical) skills determine the probability of success.  If the
English language comprehension threshold is not met, then the
probability of success will be low (guessing).

This can be expressed mathematically as:

$$ \begin{cases} 
\beta_0 & \mbox{if} \tilde{\theta_1} < \tilde{\theta_1}^* \\
\alpha_2 \tilde{\theta_2} - \beta_2  &
\mbox{if} \tilde{\theta_1} \ge \tilde{\theta_1}^* \\
\end{cases}\ .$$

No `Inhibitor()` function was included in `CPTtools` because of the
difficulty in generalizing this formula.  First, the threshold
parameter, $\tilde{\theta_1}^* $ doesn't fit naturally into either
`alphas` or `beta`; so the signature of the function does not match.
Second, the Inhibitor model does not generalize when there is more
than two parent variables:  another combination rule would be needed
to collapse the remaining dimensions onto a single dimenson.

This is a good place to remark on the extensibility of the combination
functions in the Discrete Partial Credit framework.  The various
functions aligned with the framework (e.g., `eThetaFrame`,
`calcDPCFrame`, and `mapDPC`) accept a function (or a character value
giving the name of a function) which does the combination.  This
function should have three formal parameters:

* `theta` --- This is a matrix of effective theta values produces by
  `expand.grid`.  For example, `thetas <-
  expand.grid(list(S1=seq(1,-1), S2 = seq(1,-1)))`.
  
* `alphas` --- This is a vector of discrimination parameters.  As
  several functions work with `log(alphas)`, these should all be
  strictly positive.  
  
* `beta` --- This is a vector of difficulty parameters.

Generally, the `theta` parameter is generated internally by the
`CPTtools` functions, while the `alphas` (or `log(alphas)`) and `beta`
are passed in by the user.  The `Peanut` package, in particular,
allows associating `lnAlphas` and `betas` with a node in a graph.
The `alphas` and `beta` generally have one of two shapes

* Compensatory-shape---There is one `alpha` for each parent and a
  single `beta`.
  
* Offset-shape---There is one `beta` for each parent and a single
  `alpha`.
  
The function `isOffsetRule()` can check whether a named rule is
Offset-shpae or Compensatory-shape.  There is an internal list of
offset rules which can be inspected with `getOffsetRule()` and
manipulated with `setOffsetRule()`.

Currently, `CPTtools` supports the following rules:

* Compensatory-shape:  `Compensatory`,`Conjunctive`,`Disjunctive`

* Offset-shape:  `OffsetConjunctive`,`OffsetDisjunctive`

# Link Functions

In DiBello's models, the _effective theta_ for an item represents the
ability of an examinee to solve the particular problem posed in the
task.  This is a value that runs from negative to positive infinity,
with higher values indicating a more successful outcome.  The next
step is to map these onto probability of success.  Following the
generalized linear modeling usage, these are called _link functions_.  

DiBello's original idea was to press ideas models from IRT into
service for this step.  The first one implemented was Samejima's
graded response model.  Although this model worked well for
observables, but not so well for intermediate proficiency variables.
This inspired a new normal link function which worked more like a
regression model.  The graded response model has certain
restrictions.  In particular, all transitions must have the same
discrimination.  The partial credit link function was introduced to
relax that restriction, and enabled the use of more combination rules,
including different combination rules for each transition.  

## 2PL

If the child variable only has two states, then both the graded
response and generalized partial credit models collapse into the the
2-parameter logistic (2PL) model.  This is a common model from item
response theory (IRT), which states that the probability that 
Examinee $i$ gets Item $j$ correct is:

$$ P(X_{ij}|\tilde{\theta_{i}}) = \frac{\exp(D\alpha_j(\tilde{\theta_i}-\beta_j))}{1 +
\exp(D\alpha_j(\tilde{\theta_i}-\beta_j))} .$$

The constant $D=1.7$ is chosen so that the logistic function and the
normal ogive curve are nearly identical.  This allows $\theta_i$ to be
interpreted as a standard normal value, with $\theta=0$ as the population
median and $\theta=1$ representing an individual one standard
deviation above the median.  The following example shows the curve.

```{r IRT, echo=TRUE}
inv.logit <- function (z) {1/(1+exp(-1.7*z))}
a <- 1 ## Discrimination
b <- 0 ## Difficulty
curve(inv.logit(a*(x-b)),xlim=c(-3,3),ylim=c(0,1),
      main=paste("2 Parameter Logistic: a=",round(a,2),
                 " b=",round(b,2)),
      xlab="Ability (theta)", ylab="Probability of success.")
```

Note that the difficulty parameter is on the same scale as the ability
parameter and represents the ability at which examinees will have a
50-50 chance of success.  The discrimination describes how quickly the
probability rises with increasing ability, and is often related to how
many non-focal knowledge, skills and abilities are required to solve
the problem.

Note that the model can be rewritten as $P(X_{ij}|\tilde{\theta_{i}})
= 1/(1+\exp(-D\cdot Z_j(\tilde{\theta_i}))).  Here $Z_j(\cdot)$ is the
combination function, which has the difficulty and discrimination
parameters built into it.  This more cleanly separates the link
function from the combination rules.

## Graded Response

The graded response model is a generalization of the 2PL model for
ordered categorical data introduces by Samejima (1969).  Let the
possible values for the observable $X_{ij}$ be $\{0, 1, \ldots, K}$.
Now, model each of the events $X_{ij} \ge k$ is modeled with a
logistic curve:
$$ \Pr(X_{ij} \ge k$ | \tilde{\theta_{i}}) = 1/(1+\exp(-D\cdot
Z_{jk}(\tilde{\theta_{i}}))) ,$$
for $k=1, \ldots, K$.  The probability that $X_{ij}=k$ can be found by
differencing adjacent curves.  


## Generalized Parial Credit

### Multiple Combination Rules

## Normal Offset


# CPT construction Functions

## DPC

## Earlier Graded Response Functions

## Other Models

# Peanut Framework

# References

## Works Cited

Almond, R.G., Mislevy, R.J., Steinberg, L.S., Yan, D. and Williamson, D.M. (2015). _Bayesian Networks in Educational Assessment._ Springer. Chapter 8.

Almond, R.G., DiBello, L., Jenkins, F., Mislevy, R.J., Senturk, D., Steinberg, L.S. and Yan, D. (2001) Models for Conditional Probability Tables in Educational Assessment. _Artificial Intelligence and Statistics 2001_ Jaakkola and Richardson (eds)., Morgan Kaufmann, 137–143.


Muraki, E. (1992).  A Generalized Partial Credit Model:  Application
of an EM Algorithm.  _Applied Psychological Measurement_, **16**
159-176.  DOI: 10.1177/014662169201600206

Samejima, F. (1969) Estimation of latent ability using a
response pattern of graded scores.  _Psychometrika Monograph No.
  17_, **34**, (No. 4, Part 2).


## List of Symbols

* $i$ -- index for individual in the sample.
* $i'$ -- index for configuration of parent variables.
* $j$ -- index for observable outcome (child) variable.
* $k$ -- index for parent variable
* $s$ -- index for state of child (outcome) variable
* $m$ -- index for state of parent variable.
* $\tilde\theta_{km}$ -- effective theta for a the state of a single parent variable.
* $\tilde\theta_{ji'}$ -- (vector values) effective thetas for a configuration of parent variables of Observable Outcome $j$.
* $Z_{js}(\tilde\theta_{i'})$ -- combination function for State $s$ of Observable Outcome $j$.
* $g(\cdot)$ -- link function for converting effective thetas into conditional probabilities.


## List of functions