--- title: "Conditional Probability Frames and Arrays" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Conditional Probability Frames and Arrays} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r loadLibs} library(CPTtools) ``` # Conditional Probabilities over Discrete variables. In a discrete Bayesian network, each node, $Y$, has an assoicated conditional probability table (_CPT_). Let $X_1, \ldots, X_K$ be the parent nodes, each of which has $|X_k|$ states. The conditional probablity distribution $\Pr(Y|X_1=x_1,\ldots,X_k=X_K)$ os a distrete distribution over the $|Y|=M$ states of $Y$. Note that there are $|X_1|\times\cdots\times|Y_K|=S$ possible configurations of the parent variables, so that the conditional proability distribution is actually a set of $S$ probability distributions. If they are stacked into a matrix, $S\times M$ matrix, this is the CPT. If there are no parents, the unconditional probability table consists of a single row ($S=1$). The `CPTtools` package offers two ways of representing conditional probability distributions (as well as a number of tools for manipulating them.) * `"CPF"` (_Conditional Probability Frame_). This is an R `data.frame` whose first $K$ columns are factor variables corresponding to the parents, and hence define the condition, and whose last $M$ columns are numeric variables corresponding to the states of the child variable $Y$. * `"CPA"` (_Conditional Proability Array_). This is a $K+1$ dimensional array where the first $K$ dimensions correspond to the parent variables and the last dimension the child Note that the contents of the `CPF` and `CPA` are not constrained to be probability distributions (i.e., each row need not sum to one). In particular, contigency tables, tables of counts of cases occuring in various configurations, are natuaral the natural conjugate the the CPT, and are often used as data. These can also be stored in the `CPF` and `CPA` classes. ## CPF The class `"CPF"` is a subclass of `data.frame`; usually, `CPF` objects have class `c("CPF","data.frame")`. This means that operations which operate on data frames should do something sensible with `CPF`s. Note that the columns representing the parent variable states must be of class `factor()`, which might require an explicit call to `factor()` or `as.factor()`, if the values are character. The function `as.CPF()` coerces an object to be a `CPF` and `is.CPF()` tests whether or not it is a CPF. ```{r CPF} # Note: in R 4.0, the factor() call is required. arf <- data.frame(A=factor(rep(c("a1","a2"),each=3)), B=factor(rep(c("b1","b2","b3"),2)), C.c1=1:6, C.c2=7:12, C.c3=13:18, C.c4=19:24) arf <- as.CPF(arf) arf ``` Note that by convention, the names of the columns for the parent variables are the names of the parent variables, and the names of the numeric columns have the format "_childname_`.`_statename_". ### Graphics The `CPTtools` package supplies a method for the `lattice::barchart()` generic function for `CPF`s (`barchart.CPF()`). Each conditional probability distribution is represented by a separate bar using color intensity to indicate the states. Here are some examples. First, set up some information about the variables, in particular, the lists of states for each varaible. ```{r setup} ## Set up variables skill1l <- c("High","Medium","Low") skill2l <- c("High","Medium","Low","LowerYet") correctL <- c("Correct","Incorrect") pcreditL <- c("Full","Partial","None") gradeL <- c("A","B","C","D","E") ``` Next, generate some bar charts illustrating the method. We are using the function `CPTtools::calcDPCFrame()` to build the `CPF` objects. (This function is more fully described in the vignette `DPCModels.Rmd`). ```{r NoParents, fig.caption="Unconditional proability table"} cpfTheta <- calcDPCFrame(list(),skill1l,numeric(),0,rule="Compensatory", link="normalLink",linkScale=.5) barchart.CPF(cpfTheta) ``` ```{r Binary, fig.caption="Binary child variable"} cptComp <- calcDPCFrame(list(S2=skill2l,S1=skill1l),correctL, lnAlphas=log(c(1.2,.8)), betas=0, rule="Compensatory") barchart.CPF(cptComp,layout=c(3,1)) ``` ```{r PartialCredit, fig.caption="Ordered categorial child variable"} cptPC1 <- calcDPCFrame(list(S1=skill1l,S2=skill2l),pcreditL, lnAlphas=log(1), betas=list(full=c(S1=0,S2=999),partial=c(S2=999,S2=0)), rule="OffsetDisjunctive") barchart.CPF(cptPC1,baseCol="slateblue") ``` ## CPA The class `"CPA"` is a subclass of `array`; usually, `CPA` objects have class `c("CPA","data.frame")`. This means that operations which operate on arrays should do something sensible with `CPA`s. All of the entries in the CPA are numeric, the names of the parent variables and the state labels are given in the `dimnames()` of the array. The function `as.CPA()` coerces an object to be a `CPA` and `is.CPA()` tests whether or not it is a CPA. ```{r CPA} arr <- array(1:24,c(2,3,4), dimnames=list(A=c("a1","a2"),B=c("b1","b2","b3"), C=c("c1","c2","c3","c4"))) arr <- as.CPA(arr) arr ``` Note that `as.CPF()` and `as.CPA()` can be used to freely convert between the two formats: ```{r conversion} cat("The dimensions of this CPA are ", paste(dim(as.CPA(arf)),collapse=" x "), ".\n") print(as.CPF(arr)) ``` ## Accessing data and metadata A properly labeled `CPF` contains metadata about the names of the parents and child variables. The functions `getTableParents()` and `getTableStates()` get some of that metadata. The function `numericPart()` strips out the metadata and just leaves the remaining numeric values; `factorPart()` strips out the numeric values and leaves the parent state configurations. ```{r getStates} getTableStates(arf) getTableParents(arf) numericPart(arf) factorPart(arf) ``` # Hyperdirichlet distribution `CPF`s and `CPA`s can be used to store three kinds of mathematical objects: * _Conditional Probability Tables_. In this case the rows of the table (last dimension of the array) should sum to one, and all values should be non-negative. (Each row is therefore a value over the unit simplex.) * _Count Data_ (aka _Contigency Table_). Each cell represents a configuration of parent and child values and the (non-negative) entries in the cells represent counts of the number of times this combination was observed. * _Hyperdirechlet Parameters_. Each row of the table is the parameters of a Dirchlet distribution. Although strictly speaking, all parameters of the Dirichlet distribution should be positive, zeros are allowed -- they just indicate that the corresponding probability is 0. A conditional probability table corresponds to a conditional multinomial distribution, or contigency table. Each row of the CPT gives the probability for the categories in the corresponding row of the contigency table. The sum of each row will depend on how often that configuration of parent states occurs in the data. The Dirichlet distribution is the natural conjugate of the multinomial distribution. The hyperdirichlet distrubtion is a series of independent Dirichlet distributions, one for each row of the contigency table. The name comes from Speigelhalter and Lauritzen (2000), where they use it to refer to an entire Bayesian network in which every CPT is parameterized in this way. In `CPTtools`, the term is used for any CPT parameterized in this way, and the package deliberately allows some CPTs to have the hyperdirichlet distributions, and others to use parametric models (see DPCmodels.Rmd). Because of the conjugacy, there are two important relationships with hyperdirichlet models. First, the expected conditional probability table can be found from the hyperdirichlet parameters by dividing each row by its sum (normalizing the table). Second, if the prior distribution parameters are given in a `CPF` and the data (contigency table) is given in a `CPF` as well, then the posterior parameters will be the `CPF` produced by adding the numeric parts of the `CPF`s. ## Scaling and Normalization If the rows of a `CPF` represent a probability simplex, they should all be non-negative and sum to 1. Often it is convenient to force a set of numbers into a probability simplex by simply dividing by the sum. The `normalize()` generic function attempts to do this. It operates on `CPF`s and `CPA`s in a way consistent with their conditional probability distribution. It will also operate on a generic array or matrix (normalizing the last dimension) or `data.frame` (normalizing rows but ignoring non-numeric columns). ```{r normalization} normalize(arf) ``` Dividing each row by its sum is one way we can rescale a table. However, there are other reasons we might want to multiply each row by a constant as well. One way to store contingency table data is to store a probablity vector in each row in the `CPF` and a separate vector of weights to represent the sample size of each row. The function `rescaleTable()` rescales the table by the specified factor. `normalizeTable()` rescales the table by the row sums, and hence is equivalent to `normalize.CPF()`. ```{r rescaling} arf1 <- data.frame(A=factor(rep(c("a1","a2"),each=3)), B=factor(rep(c("b1","b2","b3"),2)), C.c1=rep(1,6), C.c2=rep(1,6), C.c3=rep(1,6), C.c4=rep(1,6)) arf1 rescaleTable(arf1,1:6) normalizeTable(arf1) ``` ## Generating Contingency Tables As mentioned previously, a contingency table is the natural conjugate of the hyperdirichlet distribution. The function `dataTable()` can be used to construct contigency tables from data. ```{r dataTable} ## State names skill1l <- c("High","Medium","Low") skill3l <- c("High","Better","Medium","Worse","Low") correctL <- c("Correct","Incorrect") ## Read data from file x <- read.csv(system.file("testFiles", "randomPinned100.csv", package="CPTtools"), header=FALSE, as.is=TRUE, col.names = c("Skill1", "Skill2", "Skill3", "Comp.Correct", "Comp.Grade", "Conj.Correct", "Conj.Grade", "Cor.Correct", "Cor.Grade", "Dis.Correct", "Dis.Grade", "Inhib.Correct", "Inhib.Grade" )) ## Force variables to be ordered categories x[,"Skill1"] <- ordered(x[,"Skill1"],skill1l) x[,"Skill3"] <- ordered(x[,"Skill3"],skill3l) x[,"Comp.Correct"] <- ordered(x[,"Comp.Correct"],correctL) tab <- dataTable(x, c("Skill1","Skill3"),"Comp.Correct",correctL) ## Tab is just the numeric part, so use expand.grid to generate ## labels. data.frame(expand.grid(list(Skill1=skill1l,Skill3=skill3l)),tab) ``` ## Acknowledgements Work on RNetica, CPTtools and Peanut has been sponsored in part by the following grants: * Bill & Melinda Gates Foundation grant "Games as Learning/Assessment: Stealth Assessment" (#0PP1035331, Val Shute, PI) * National Science Foundation grant "DIP: Game-based Assessment and Support of STEM-related Competencies" (#1628937, Val Shute, PI). * National Science Foundation grant "Mathematical Learning via Architectual Design and Modeling Using E-Rebuild." (#1720533, Fengfeng Ke, PI) * Intitute of Educational Statistics grant "Exploring Adaptive Cognitive and Affective Learning Support for Next-Generation STEM Learning Games", (R305A170376,Russell Almond, PI)