Conditional Probability Frames and Arrays

library(CPTtools)

Conditional Probabilities over Discrete variables.

In a discrete Bayesian network, each node, Y, has an assoicated conditional probability table (CPT). Let X1, …, XK be the parent nodes, each of which has |Xk| states. The conditional probablity distribution Pr (Y|X1 = x1, …, Xk = XK) os a distrete distribution over the |Y| = M states of Y. Note that there are |X1| × ⋯ × |YK| = S possible configurations of the parent variables, so that the conditional proability distribution is actually a set of S probability distributions. If they are stacked into a matrix, S × M matrix, this is the CPT. If there are no parents, the unconditional probability table consists of a single row (S = 1).

The CPTtools package offers two ways of representing conditional probability distributions (as well as a number of tools for manipulating them.)

  • "CPF" (Conditional Probability Frame). This is an R data.frame whose first K columns are factor variables corresponding to the parents, and hence define the condition, and whose last M columns are numeric variables corresponding to the states of the child variable Y.

  • "CPA" (Conditional Proability Array). This is a K + 1 dimensional array where the first K dimensions correspond to the parent variables and the last dimension the child

Note that the contents of the CPF and CPA are not constrained to be probability distributions (i.e., each row need not sum to one). In particular, contigency tables, tables of counts of cases occuring in various configurations, are natuaral the natural conjugate the the CPT, and are often used as data. These can also be stored in the CPF and CPA classes.

CPF

The class "CPF" is a subclass of data.frame; usually, CPF objects have class c("CPF","data.frame"). This means that operations which operate on data frames should do something sensible with CPFs. Note that the columns representing the parent variable states must be of class factor(), which might require an explicit call to factor() or as.factor(), if the values are character. The function as.CPF() coerces an object to be a CPF and is.CPF() tests whether or not it is a CPF.

# Note:  in R 4.0, the factor() call is required.
arf <- data.frame(A=factor(rep(c("a1","a2"),each=3)),
                  B=factor(rep(c("b1","b2","b3"),2)),
                  C.c1=1:6, C.c2=7:12, C.c3=13:18, C.c4=19:24)
arf <- as.CPF(arf)
arf
#>    A  B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1    1    7   13   19
#> 2 a1 b2    2    8   14   20
#> 3 a1 b3    3    9   15   21
#> 4 a2 b1    4   10   16   22
#> 5 a2 b2    5   11   17   23
#> 6 a2 b3    6   12   18   24

Note that by convention, the names of the columns for the parent variables are the names of the parent variables, and the names of the numeric columns have the format “childname.statename”.

Graphics

The CPTtools package supplies a method for the lattice::barchart() generic function for CPFs (barchart.CPF()). Each conditional probability distribution is represented by a separate bar using color intensity to indicate the states. Here are some examples.

First, set up some information about the variables, in particular, the lists of states for each varaible.


## Set up variables
skill1l <- c("High","Medium","Low") 
skill2l <- c("High","Medium","Low","LowerYet") 
correctL <- c("Correct","Incorrect") 
pcreditL <- c("Full","Partial","None")
gradeL <- c("A","B","C","D","E") 

Next, generate some bar charts illustrating the method. We are using the function CPTtools::calcDPCFrame() to build the CPF objects. (This function is more fully described in the vignette DPCModels.Rmd).

cpfTheta <- calcDPCFrame(list(),skill1l,numeric(),0,rule="Compensatory",
                         link="normalLink",linkScale=.5)
     
barchart.CPF(cpfTheta)

cptComp <- calcDPCFrame(list(S2=skill2l,S1=skill1l),correctL,
                        lnAlphas=log(c(1.2,.8)), betas=0,
                           rule="Compensatory")
barchart.CPF(cptComp,layout=c(3,1))

cptPC1 <- calcDPCFrame(list(S1=skill1l,S2=skill2l),pcreditL,
                       lnAlphas=log(1),
                       betas=list(full=c(S1=0,S2=999),partial=c(S2=999,S2=0)),
                       rule="OffsetDisjunctive")
barchart.CPF(cptPC1,baseCol="slateblue")

CPA

The class "CPA" is a subclass of array; usually, CPA objects have class c("CPA","data.frame"). This means that operations which operate on arrays should do something sensible with CPAs. All of the entries in the CPA are numeric, the names of the parent variables and the state labels are given in the dimnames() of the array. The function as.CPA() coerces an object to be a CPA and is.CPA() tests whether or not it is a CPA.

arr <- array(1:24,c(2,3,4),
             dimnames=list(A=c("a1","a2"),B=c("b1","b2","b3"),
                           C=c("c1","c2","c3","c4")))
arr <- as.CPA(arr)
arr
#> , , C = c1
#> 
#>     B
#> A    b1 b2 b3
#>   a1  1  3  5
#>   a2  2  4  6
#> 
#> , , C = c2
#> 
#>     B
#> A    b1 b2 b3
#>   a1  7  9 11
#>   a2  8 10 12
#> 
#> , , C = c3
#> 
#>     B
#> A    b1 b2 b3
#>   a1 13 15 17
#>   a2 14 16 18
#> 
#> , , C = c4
#> 
#>     B
#> A    b1 b2 b3
#>   a1 19 21 23
#>   a2 20 22 24
#> 
#> attr(,"class")
#> [1] "CPA"   "array"

Note that as.CPF() and as.CPA() can be used to freely convert between the two formats:

cat("The dimensions of this CPA are ",
    paste(dim(as.CPA(arf)),collapse=" x "),
    ".\n")
#> The dimensions of this CPA are  2 x 3 x 4 .
print(as.CPF(arr))
#>    A  B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1    1    7   13   19
#> 2 a2 b1    2    8   14   20
#> 3 a1 b2    3    9   15   21
#> 4 a2 b2    4   10   16   22
#> 5 a1 b3    5   11   17   23
#> 6 a2 b3    6   12   18   24

Accessing data and metadata

A properly labeled CPF contains metadata about the names of the parents and child variables. The functions getTableParents() and getTableStates() get some of that metadata. The function numericPart() strips out the metadata and just leaves the remaining numeric values; factorPart() strips out the numeric values and leaves the parent state configurations.

getTableStates(arf)
#> [1] "C.c1" "C.c2" "C.c3" "C.c4"
getTableParents(arf)
#> [1] "A" "B"
numericPart(arf)
#>      C.c1 C.c2 C.c3 C.c4
#> [1,]    1    7   13   19
#> [2,]    2    8   14   20
#> [3,]    3    9   15   21
#> [4,]    4   10   16   22
#> [5,]    5   11   17   23
#> [6,]    6   12   18   24
factorPart(arf)
#>    A  B
#> 1 a1 b1
#> 2 a1 b2
#> 3 a1 b3
#> 4 a2 b1
#> 5 a2 b2
#> 6 a2 b3

Hyperdirichlet distribution

CPFs and CPAs can be used to store three kinds of mathematical objects:

  • Conditional Probability Tables. In this case the rows of the table (last dimension of the array) should sum to one, and all values should be non-negative. (Each row is therefore a value over the unit simplex.)

  • Count Data (aka Contigency Table). Each cell represents a configuration of parent and child values and the (non-negative) entries in the cells represent counts of the number of times this combination was observed.

  • Hyperdirechlet Parameters. Each row of the table is the parameters of a Dirchlet distribution. Although strictly speaking, all parameters of the Dirichlet distribution should be positive, zeros are allowed – they just indicate that the corresponding probability is 0.

A conditional probability table corresponds to a conditional multinomial distribution, or contigency table. Each row of the CPT gives the probability for the categories in the corresponding row of the contigency table. The sum of each row will depend on how often that configuration of parent states occurs in the data.

The Dirichlet distribution is the natural conjugate of the multinomial distribution. The hyperdirichlet distrubtion is a series of independent Dirichlet distributions, one for each row of the contigency table. The name comes from Speigelhalter and Lauritzen (2000), where they use it to refer to an entire Bayesian network in which every CPT is parameterized in this way. In CPTtools, the term is used for any CPT parameterized in this way, and the package deliberately allows some CPTs to have the hyperdirichlet distributions, and others to use parametric models (see DPCmodels.Rmd).

Because of the conjugacy, there are two important relationships with hyperdirichlet models. First, the expected conditional probability table can be found from the hyperdirichlet parameters by dividing each row by its sum (normalizing the table). Second, if the prior distribution parameters are given in a CPF and the data (contigency table) is given in a CPF as well, then the posterior parameters will be the CPF produced by adding the numeric parts of the CPFs.

Scaling and Normalization

If the rows of a CPF represent a probability simplex, they should all be non-negative and sum to 1. Often it is convenient to force a set of numbers into a probability simplex by simply dividing by the sum. The normalize() generic function attempts to do this. It operates on CPFs and CPAs in a way consistent with their conditional probability distribution. It will also operate on a generic array or matrix (normalizing the last dimension) or data.frame (normalizing rows but ignoring non-numeric columns).

normalize(arf)
#>    A  B       C.c1      C.c2      C.c3      C.c4
#> 1 a1 b1 0.02500000 0.1750000 0.3250000 0.4750000
#> 2 a1 b2 0.04545455 0.1818182 0.3181818 0.4545455
#> 3 a1 b3 0.06250000 0.1875000 0.3125000 0.4375000
#> 4 a2 b1 0.07692308 0.1923077 0.3076923 0.4230769
#> 5 a2 b2 0.08928571 0.1964286 0.3035714 0.4107143
#> 6 a2 b3 0.10000000 0.2000000 0.3000000 0.4000000

Dividing each row by its sum is one way we can rescale a table. However, there are other reasons we might want to multiply each row by a constant as well. One way to store contingency table data is to store a probablity vector in each row in the CPF and a separate vector of weights to represent the sample size of each row. The function rescaleTable() rescales the table by the specified factor. normalizeTable() rescales the table by the row sums, and hence is equivalent to normalize.CPF().

arf1 <- data.frame(A=factor(rep(c("a1","a2"),each=3)),
                  B=factor(rep(c("b1","b2","b3"),2)),
                  C.c1=rep(1,6), C.c2=rep(1,6), C.c3=rep(1,6),
                  C.c4=rep(1,6))
arf1
#>    A  B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1    1    1    1    1
#> 2 a1 b2    1    1    1    1
#> 3 a1 b3    1    1    1    1
#> 4 a2 b1    1    1    1    1
#> 5 a2 b2    1    1    1    1
#> 6 a2 b3    1    1    1    1

rescaleTable(arf1,1:6)
#>    A  B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1    1    1    1    1
#> 2 a1 b2    2    2    2    2
#> 3 a1 b3    3    3    3    3
#> 4 a2 b1    4    4    4    4
#> 5 a2 b2    5    5    5    5
#> 6 a2 b3    6    6    6    6

normalizeTable(arf1)
#>    A  B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1 0.25 0.25 0.25 0.25
#> 2 a1 b2 0.25 0.25 0.25 0.25
#> 3 a1 b3 0.25 0.25 0.25 0.25
#> 4 a2 b1 0.25 0.25 0.25 0.25
#> 5 a2 b2 0.25 0.25 0.25 0.25
#> 6 a2 b3 0.25 0.25 0.25 0.25

Generating Contingency Tables

As mentioned previously, a contingency table is the natural conjugate of the hyperdirichlet distribution. The function dataTable() can be used to construct contigency tables from data.

## State names
skill1l <- c("High","Medium","Low") 
skill3l <- c("High","Better","Medium","Worse","Low") 
correctL <- c("Correct","Incorrect") 

## Read data from file
x <- read.csv(system.file("testFiles", "randomPinned100.csv",
                          package="CPTtools"),
            header=FALSE, as.is=TRUE,
            col.names = c("Skill1", "Skill2", "Skill3",
                          "Comp.Correct", "Comp.Grade",
                          "Conj.Correct", "Conj.Grade",
                          "Cor.Correct", "Cor.Grade",
                          "Dis.Correct", "Dis.Grade",
                          "Inhib.Correct", "Inhib.Grade"
                          ))
## Force variables to be ordered categories
x[,"Skill1"] <- ordered(x[,"Skill1"],skill1l)
x[,"Skill3"] <- ordered(x[,"Skill3"],skill3l)
x[,"Comp.Correct"] <- ordered(x[,"Comp.Correct"],correctL)


tab <- dataTable(x, c("Skill1","Skill3"),"Comp.Correct",correctL)

## Tab is just the numeric part, so use expand.grid to generate
## labels.
data.frame(expand.grid(list(Skill1=skill1l,Skill3=skill3l)),tab)
#>    Skill1 Skill3 Correct Incorrect
#> 1    High   High      21         0
#> 2  Medium   High       6         2
#> 3     Low   High       0         0
#> 4    High Better       1         0
#> 5  Medium Better       9         1
#> 6     Low Better       0         0
#> 7    High Medium       0         0
#> 8  Medium Medium       9         0
#> 9     Low Medium       0         0
#> 10   High  Worse       0         0
#> 11 Medium  Worse       8         1
#> 12    Low  Worse       1         0
#> 13   High    Low       0         0
#> 14 Medium    Low       7         4
#> 15    Low    Low      15        15

Acknowledgements

Work on RNetica, CPTtools and Peanut has been sponsored in part by the following grants:

  • Bill & Melinda Gates Foundation grant “Games as Learning/Assessment: Stealth Assessment” (#0PP1035331, Val Shute, PI)

  • National Science Foundation grant “DIP: Game-based Assessment and Support of STEM-related Competencies” (#1628937, Val Shute, PI).

  • National Science Foundation grant “Mathematical Learning via Architectual Design and Modeling Using E-Rebuild.” (#1720533, Fengfeng Ke, PI)

  • Intitute of Educational Statistics grant “Exploring Adaptive Cognitive and Affective Learning Support for Next-Generation STEM Learning Games”, (R305A170376,Russell Almond, PI)