---
title: "Conditional Probability Frames and Arrays"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Conditional Probability Frames and Arrays}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r loadLibs}
library(CPTtools)
```

# Conditional Probabilities over Discrete variables.

In a discrete Bayesian network, each node, $Y$,  has an assoicated
conditional probability table (_CPT_).  Let $X_1, \ldots, X_K$ be the
parent nodes, each of which has $|X_k|$ states.  The conditional
probablity distribution $\Pr(Y|X_1=x_1,\ldots,X_k=X_K)$ os a distrete
distribution over the $|Y|=M$ states of $Y$.  Note that there are
$|X_1|\times\cdots\times|Y_K|=S$ possible configurations of the parent
variables, so that the conditional proability distribution is actually
a set of $S$ probability distributions.  If they are stacked into a
matrix, $S\times M$ matrix, this is the CPT.  If there are no parents,
the unconditional probability table consists of a single row ($S=1$).

The `CPTtools` package offers two ways of representing conditional
probability distributions (as well as a number of tools for
manipulating them.)

* `"CPF"` (_Conditional Probability Frame_).  This is an R
  `data.frame` whose first $K$ columns are factor variables corresponding
  to the parents, and hence define the condition, and whose last $M$
  columns are numeric variables corresponding to 
  the states of the child variable $Y$.
  
* `"CPA"` (_Conditional Proability Array_).  This is a $K+1$
  dimensional array where the first $K$ dimensions correspond to the
  parent variables and the last dimension the child
  
Note that the contents of the `CPF` and `CPA` are not constrained to
be probability distributions (i.e., each row need not sum to one).  In
particular, contigency tables, tables of counts of cases occuring in
various configurations, are natuaral the natural conjugate the the
CPT, and are often used as data.  These can also be stored in the
`CPF` and `CPA` classes. 

## CPF

The class `"CPF"` is a subclass of `data.frame`; usually, `CPF`
objects have class `c("CPF","data.frame")`.  This means that
operations which operate on data frames should do something sensible
with `CPF`s.  Note that the columns representing the parent variable
states must be of class `factor()`, which might require an explicit
call to `factor()` or `as.factor()`, if the values are character.  The
function `as.CPF()` coerces an object to be a `CPF` 
and `is.CPF()` tests whether or not it is a CPF.

```{r CPF}
# Note:  in R 4.0, the factor() call is required.
arf <- data.frame(A=factor(rep(c("a1","a2"),each=3)),
                  B=factor(rep(c("b1","b2","b3"),2)),
                  C.c1=1:6, C.c2=7:12, C.c3=13:18, C.c4=19:24)
arf <- as.CPF(arf)
arf
```

Note that by convention, the names of the columns for the parent
variables are the names of the parent variables, and the names of the
numeric columns have the format "_childname_`.`_statename_".

### Graphics

The `CPTtools` package supplies a method for the `lattice::barchart()`
generic function for `CPF`s (`barchart.CPF()`).  Each conditional
probability distribution is represented by a separate bar using color
intensity to indicate the states.  Here are some examples.

First, set up some information about the variables, in particular, the
lists of states for each varaible.

```{r setup}

## Set up variables
skill1l <- c("High","Medium","Low") 
skill2l <- c("High","Medium","Low","LowerYet") 
correctL <- c("Correct","Incorrect") 
pcreditL <- c("Full","Partial","None")
gradeL <- c("A","B","C","D","E") 

```

Next, generate some bar charts illustrating the method.  We are using
the function `CPTtools::calcDPCFrame()` to build the `CPF` objects.
(This function is more fully described in the vignette
`DPCModels.Rmd`).

```{r NoParents, fig.caption="Unconditional proability table"}
cpfTheta <- calcDPCFrame(list(),skill1l,numeric(),0,rule="Compensatory",
                         link="normalLink",linkScale=.5)
     
barchart.CPF(cpfTheta)

```

```{r Binary, fig.caption="Binary child variable"}
cptComp <- calcDPCFrame(list(S2=skill2l,S1=skill1l),correctL,
                        lnAlphas=log(c(1.2,.8)), betas=0,
                           rule="Compensatory")
barchart.CPF(cptComp,layout=c(3,1))
```


```{r PartialCredit, fig.caption="Ordered categorial child variable"}     
cptPC1 <- calcDPCFrame(list(S1=skill1l,S2=skill2l),pcreditL,
                       lnAlphas=log(1),
                       betas=list(full=c(S1=0,S2=999),partial=c(S2=999,S2=0)),
                       rule="OffsetDisjunctive")
barchart.CPF(cptPC1,baseCol="slateblue")
```


## CPA

The class `"CPA"` is a subclass of `array`; usually, `CPA`
objects have class `c("CPA","data.frame")`.  This means that
operations which operate on arrays should do something sensible
with `CPA`s.  All of the entries in the CPA are numeric, the names of
the parent variables and the state labels are given in the
`dimnames()` of the array.  The function `as.CPA()` coerces an object
to be a `CPA` and `is.CPA()` tests whether or not it is a CPA. 

```{r CPA}
arr <- array(1:24,c(2,3,4),
             dimnames=list(A=c("a1","a2"),B=c("b1","b2","b3"),
                           C=c("c1","c2","c3","c4")))
arr <- as.CPA(arr)
arr

```

Note that `as.CPF()` and `as.CPA()` can be used to freely convert
between the two formats:

```{r conversion}
cat("The dimensions of this CPA are ",
    paste(dim(as.CPA(arf)),collapse=" x "),
    ".\n")
print(as.CPF(arr))

```

## Accessing data and metadata

A properly labeled `CPF` contains metadata about the names of the
parents and child variables.  The functions `getTableParents()` and
`getTableStates()` get some of that metadata.  The function
`numericPart()` strips out the metadata and just leaves the remaining
numeric values; `factorPart()` strips out the numeric values and
leaves the parent state configurations.

```{r getStates}
getTableStates(arf)
getTableParents(arf)
numericPart(arf)
factorPart(arf)

```

# Hyperdirichlet distribution

`CPF`s and `CPA`s can be used to store three kinds of mathematical
objects:

* _Conditional Probability Tables_.  In this case the rows of the
  table (last dimension of the array) should sum to one, and all
  values should be non-negative.  (Each row is therefore a value over
  the unit simplex.)
  
* _Count Data_ (aka _Contigency Table_).  Each cell represents a
  configuration of parent and child values and the (non-negative)
  entries in the cells represent counts of the number of times this
  combination was observed.  
  
* _Hyperdirechlet Parameters_.  Each row of the table is the
  parameters of a Dirchlet distribution.  Although strictly speaking,
  all parameters of the Dirichlet distribution should be positive,
  zeros are allowed -- they just indicate that the corresponding
  probability is 0.

A conditional probability table corresponds to a conditional
multinomial distribution, or contigency table.  Each row of the CPT
gives the probability for the categories in the corresponding row of
the contigency table.  The sum of each row will depend on how often
that configuration of parent states occurs in the data.

The Dirichlet distribution is the natural conjugate of the multinomial
distribution. The hyperdirichlet distrubtion is a series of
independent Dirichlet distributions, one for each row of the
contigency table.  The name comes from Speigelhalter and Lauritzen
(2000), where they use it to refer to an entire Bayesian network in
which every CPT is parameterized in this way.  In `CPTtools`, the term
is used for any CPT parameterized in this way, and the package
deliberately allows some CPTs to have the hyperdirichlet
distributions, and others to use parametric models (see
DPCmodels.Rmd).

Because of the conjugacy, there are two important relationships with
hyperdirichlet models.  First, the expected conditional probability
table can be found from the hyperdirichlet parameters by dividing each
row by its sum (normalizing the table).  Second, if the prior
distribution parameters are given in a `CPF` and the data (contigency
table) is given in a `CPF` as well, then the posterior parameters will
be the `CPF` produced by adding the numeric parts of the `CPF`s.


## Scaling and Normalization


If the rows of a `CPF` represent a probability simplex, they
should all be non-negative and sum to 1.  Often it is convenient to
force a set of numbers into a probability simplex by simply dividing
by the sum.  The `normalize()` generic function attempts to do this.
It operates on `CPF`s and `CPA`s in a way consistent with their
conditional probability distribution.  It will also operate on a
generic array or matrix (normalizing the last dimension) or
`data.frame` (normalizing rows but ignoring non-numeric columns).

```{r normalization}
normalize(arf)
```
Dividing each row by its sum is one way we can rescale a table.
However, there are other reasons we might want to multiply each row by
a constant as well.  One way to store contingency table data is to
store a probablity vector in each row in the `CPF` and a separate
vector of weights to represent the sample size of each row.  The
function `rescaleTable()` rescales the table by the specified factor.
`normalizeTable()` rescales the table by the row sums, and hence is
equivalent to `normalize.CPF()`.


```{r rescaling}
arf1 <- data.frame(A=factor(rep(c("a1","a2"),each=3)),
                  B=factor(rep(c("b1","b2","b3"),2)),
                  C.c1=rep(1,6), C.c2=rep(1,6), C.c3=rep(1,6),
                  C.c4=rep(1,6))
arf1

rescaleTable(arf1,1:6)

normalizeTable(arf1)

```

## Generating Contingency Tables

As mentioned previously, a contingency table is the natural conjugate
of the hyperdirichlet distribution.  The function `dataTable()` can be
used to construct contigency tables from data.

```{r dataTable}
## State names
skill1l <- c("High","Medium","Low") 
skill3l <- c("High","Better","Medium","Worse","Low") 
correctL <- c("Correct","Incorrect") 

## Read data from file
x <- read.csv(system.file("testFiles", "randomPinned100.csv",
                          package="CPTtools"),
            header=FALSE, as.is=TRUE,
            col.names = c("Skill1", "Skill2", "Skill3",
                          "Comp.Correct", "Comp.Grade",
                          "Conj.Correct", "Conj.Grade",
                          "Cor.Correct", "Cor.Grade",
                          "Dis.Correct", "Dis.Grade",
                          "Inhib.Correct", "Inhib.Grade"
                          ))
## Force variables to be ordered categories
x[,"Skill1"] <- ordered(x[,"Skill1"],skill1l)
x[,"Skill3"] <- ordered(x[,"Skill3"],skill3l)
x[,"Comp.Correct"] <- ordered(x[,"Comp.Correct"],correctL)


tab <- dataTable(x, c("Skill1","Skill3"),"Comp.Correct",correctL)

## Tab is just the numeric part, so use expand.grid to generate
## labels.
data.frame(expand.grid(list(Skill1=skill1l,Skill3=skill3l)),tab)

```

## Acknowledgements

Work on RNetica, CPTtools and Peanut has been sponsored in part by the
following grants:

* Bill & Melinda Gates Foundation grant "Games as Learning/Assessment:
Stealth Assessment" (#0PP1035331, Val Shute, PI)

* National Science Foundation grant "DIP:
Game-based Assessment and Support of STEM-related Competencies"
(#1628937, Val Shute, PI).

* National Science Foundation grant "Mathematical Learning via
Architectual Design and Modeling Using E-Rebuild." (#1720533,
Fengfeng Ke, PI)

* Intitute of Educational Statistics grant "Exploring Adaptive
  Cognitive and Affective Learning Support for Next-Generation STEM
  Learning Games", (R305A170376,Russell Almond, PI)