In a discrete Bayesian network, each node, Y, has an assoicated conditional probability table (CPT). Let X1, …, XK be the parent nodes, each of which has |Xk| states. The conditional probablity distribution Pr (Y|X1 = x1, …, Xk = XK) os a distrete distribution over the |Y| = M states of Y. Note that there are |X1| × ⋯ × |YK| = S possible configurations of the parent variables, so that the conditional proability distribution is actually a set of S probability distributions. If they are stacked into a matrix, S × M matrix, this is the CPT. If there are no parents, the unconditional probability table consists of a single row (S = 1).
The CPTtools
package offers two ways of representing
conditional probability distributions (as well as a number of tools for
manipulating them.)
"CPF"
(Conditional Probability Frame). This
is an R data.frame
whose first K columns are factor variables
corresponding to the parents, and hence define the condition, and whose
last M columns are numeric
variables corresponding to the states of the child variable Y.
"CPA"
(Conditional Proability Array). This
is a K + 1 dimensional array
where the first K dimensions
correspond to the parent variables and the last dimension the
child
Note that the contents of the CPF
and CPA
are not constrained to be probability distributions (i.e., each row need
not sum to one). In particular, contigency tables, tables of counts of
cases occuring in various configurations, are natuaral the natural
conjugate the the CPT, and are often used as data. These can also be
stored in the CPF
and CPA
classes.
The class "CPF"
is a subclass of
data.frame
; usually, CPF
objects have class
c("CPF","data.frame")
. This means that operations which
operate on data frames should do something sensible with
CPF
s. Note that the columns representing the parent
variable states must be of class factor()
, which might
require an explicit call to factor()
or
as.factor()
, if the values are character. The function
as.CPF()
coerces an object to be a CPF
and
is.CPF()
tests whether or not it is a CPF.
# Note: in R 4.0, the factor() call is required.
arf <- data.frame(A=factor(rep(c("a1","a2"),each=3)),
B=factor(rep(c("b1","b2","b3"),2)),
C.c1=1:6, C.c2=7:12, C.c3=13:18, C.c4=19:24)
arf <- as.CPF(arf)
arf
#> A B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1 1 7 13 19
#> 2 a1 b2 2 8 14 20
#> 3 a1 b3 3 9 15 21
#> 4 a2 b1 4 10 16 22
#> 5 a2 b2 5 11 17 23
#> 6 a2 b3 6 12 18 24
Note that by convention, the names of the columns for the parent
variables are the names of the parent variables, and the names of the
numeric columns have the format
“childname.
statename”.
The CPTtools
package supplies a method for the
lattice::barchart()
generic function for CPF
s
(barchart.CPF()
). Each conditional probability distribution
is represented by a separate bar using color intensity to indicate the
states. Here are some examples.
First, set up some information about the variables, in particular, the lists of states for each varaible.
## Set up variables
skill1l <- c("High","Medium","Low")
skill2l <- c("High","Medium","Low","LowerYet")
correctL <- c("Correct","Incorrect")
pcreditL <- c("Full","Partial","None")
gradeL <- c("A","B","C","D","E")
Next, generate some bar charts illustrating the method. We are using
the function CPTtools::calcDPCFrame()
to build the
CPF
objects. (This function is more fully described in the
vignette DPCModels.Rmd
).
cpfTheta <- calcDPCFrame(list(),skill1l,numeric(),0,rule="Compensatory",
link="normalLink",linkScale=.5)
barchart.CPF(cpfTheta)
The class "CPA"
is a subclass of array
;
usually, CPA
objects have class
c("CPA","data.frame")
. This means that operations which
operate on arrays should do something sensible with CPA
s.
All of the entries in the CPA are numeric, the names of the parent
variables and the state labels are given in the dimnames()
of the array. The function as.CPA()
coerces an object to be
a CPA
and is.CPA()
tests whether or not it is
a CPA.
arr <- array(1:24,c(2,3,4),
dimnames=list(A=c("a1","a2"),B=c("b1","b2","b3"),
C=c("c1","c2","c3","c4")))
arr <- as.CPA(arr)
arr
#> , , C = c1
#>
#> B
#> A b1 b2 b3
#> a1 1 3 5
#> a2 2 4 6
#>
#> , , C = c2
#>
#> B
#> A b1 b2 b3
#> a1 7 9 11
#> a2 8 10 12
#>
#> , , C = c3
#>
#> B
#> A b1 b2 b3
#> a1 13 15 17
#> a2 14 16 18
#>
#> , , C = c4
#>
#> B
#> A b1 b2 b3
#> a1 19 21 23
#> a2 20 22 24
#>
#> attr(,"class")
#> [1] "CPA" "array"
Note that as.CPF()
and as.CPA()
can be used
to freely convert between the two formats:
cat("The dimensions of this CPA are ",
paste(dim(as.CPA(arf)),collapse=" x "),
".\n")
#> The dimensions of this CPA are 2 x 3 x 4 .
print(as.CPF(arr))
#> A B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1 1 7 13 19
#> 2 a2 b1 2 8 14 20
#> 3 a1 b2 3 9 15 21
#> 4 a2 b2 4 10 16 22
#> 5 a1 b3 5 11 17 23
#> 6 a2 b3 6 12 18 24
A properly labeled CPF
contains metadata about the names
of the parents and child variables. The functions
getTableParents()
and getTableStates()
get
some of that metadata. The function numericPart()
strips
out the metadata and just leaves the remaining numeric values;
factorPart()
strips out the numeric values and leaves the
parent state configurations.
getTableStates(arf)
#> [1] "C.c1" "C.c2" "C.c3" "C.c4"
getTableParents(arf)
#> [1] "A" "B"
numericPart(arf)
#> C.c1 C.c2 C.c3 C.c4
#> [1,] 1 7 13 19
#> [2,] 2 8 14 20
#> [3,] 3 9 15 21
#> [4,] 4 10 16 22
#> [5,] 5 11 17 23
#> [6,] 6 12 18 24
factorPart(arf)
#> A B
#> 1 a1 b1
#> 2 a1 b2
#> 3 a1 b3
#> 4 a2 b1
#> 5 a2 b2
#> 6 a2 b3
CPF
s and CPA
s can be used to store three
kinds of mathematical objects:
Conditional Probability Tables. In this case the rows of the table (last dimension of the array) should sum to one, and all values should be non-negative. (Each row is therefore a value over the unit simplex.)
Count Data (aka Contigency Table). Each cell represents a configuration of parent and child values and the (non-negative) entries in the cells represent counts of the number of times this combination was observed.
Hyperdirechlet Parameters. Each row of the table is the parameters of a Dirchlet distribution. Although strictly speaking, all parameters of the Dirichlet distribution should be positive, zeros are allowed – they just indicate that the corresponding probability is 0.
A conditional probability table corresponds to a conditional multinomial distribution, or contigency table. Each row of the CPT gives the probability for the categories in the corresponding row of the contigency table. The sum of each row will depend on how often that configuration of parent states occurs in the data.
The Dirichlet distribution is the natural conjugate of the
multinomial distribution. The hyperdirichlet distrubtion is a series of
independent Dirichlet distributions, one for each row of the contigency
table. The name comes from Speigelhalter and Lauritzen (2000), where
they use it to refer to an entire Bayesian network in which every CPT is
parameterized in this way. In CPTtools
, the term is used
for any CPT parameterized in this way, and the package deliberately
allows some CPTs to have the hyperdirichlet distributions, and others to
use parametric models (see DPCmodels.Rmd).
Because of the conjugacy, there are two important relationships with
hyperdirichlet models. First, the expected conditional probability table
can be found from the hyperdirichlet parameters by dividing each row by
its sum (normalizing the table). Second, if the prior distribution
parameters are given in a CPF
and the data (contigency
table) is given in a CPF
as well, then the posterior
parameters will be the CPF
produced by adding the numeric
parts of the CPF
s.
If the rows of a CPF
represent a probability simplex,
they should all be non-negative and sum to 1. Often it is convenient to
force a set of numbers into a probability simplex by simply dividing by
the sum. The normalize()
generic function attempts to do
this. It operates on CPF
s and CPA
s in a way
consistent with their conditional probability distribution. It will also
operate on a generic array or matrix (normalizing the last dimension) or
data.frame
(normalizing rows but ignoring non-numeric
columns).
normalize(arf)
#> A B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1 0.02500000 0.1750000 0.3250000 0.4750000
#> 2 a1 b2 0.04545455 0.1818182 0.3181818 0.4545455
#> 3 a1 b3 0.06250000 0.1875000 0.3125000 0.4375000
#> 4 a2 b1 0.07692308 0.1923077 0.3076923 0.4230769
#> 5 a2 b2 0.08928571 0.1964286 0.3035714 0.4107143
#> 6 a2 b3 0.10000000 0.2000000 0.3000000 0.4000000
Dividing each row by its sum is one way we can rescale a table.
However, there are other reasons we might want to multiply each row by a
constant as well. One way to store contingency table data is to store a
probablity vector in each row in the CPF
and a separate
vector of weights to represent the sample size of each row. The function
rescaleTable()
rescales the table by the specified factor.
normalizeTable()
rescales the table by the row sums, and
hence is equivalent to normalize.CPF()
.
arf1 <- data.frame(A=factor(rep(c("a1","a2"),each=3)),
B=factor(rep(c("b1","b2","b3"),2)),
C.c1=rep(1,6), C.c2=rep(1,6), C.c3=rep(1,6),
C.c4=rep(1,6))
arf1
#> A B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1 1 1 1 1
#> 2 a1 b2 1 1 1 1
#> 3 a1 b3 1 1 1 1
#> 4 a2 b1 1 1 1 1
#> 5 a2 b2 1 1 1 1
#> 6 a2 b3 1 1 1 1
rescaleTable(arf1,1:6)
#> A B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1 1 1 1 1
#> 2 a1 b2 2 2 2 2
#> 3 a1 b3 3 3 3 3
#> 4 a2 b1 4 4 4 4
#> 5 a2 b2 5 5 5 5
#> 6 a2 b3 6 6 6 6
normalizeTable(arf1)
#> A B C.c1 C.c2 C.c3 C.c4
#> 1 a1 b1 0.25 0.25 0.25 0.25
#> 2 a1 b2 0.25 0.25 0.25 0.25
#> 3 a1 b3 0.25 0.25 0.25 0.25
#> 4 a2 b1 0.25 0.25 0.25 0.25
#> 5 a2 b2 0.25 0.25 0.25 0.25
#> 6 a2 b3 0.25 0.25 0.25 0.25
As mentioned previously, a contingency table is the natural conjugate
of the hyperdirichlet distribution. The function
dataTable()
can be used to construct contigency tables from
data.
## State names
skill1l <- c("High","Medium","Low")
skill3l <- c("High","Better","Medium","Worse","Low")
correctL <- c("Correct","Incorrect")
## Read data from file
x <- read.csv(system.file("testFiles", "randomPinned100.csv",
package="CPTtools"),
header=FALSE, as.is=TRUE,
col.names = c("Skill1", "Skill2", "Skill3",
"Comp.Correct", "Comp.Grade",
"Conj.Correct", "Conj.Grade",
"Cor.Correct", "Cor.Grade",
"Dis.Correct", "Dis.Grade",
"Inhib.Correct", "Inhib.Grade"
))
## Force variables to be ordered categories
x[,"Skill1"] <- ordered(x[,"Skill1"],skill1l)
x[,"Skill3"] <- ordered(x[,"Skill3"],skill3l)
x[,"Comp.Correct"] <- ordered(x[,"Comp.Correct"],correctL)
tab <- dataTable(x, c("Skill1","Skill3"),"Comp.Correct",correctL)
## Tab is just the numeric part, so use expand.grid to generate
## labels.
data.frame(expand.grid(list(Skill1=skill1l,Skill3=skill3l)),tab)
#> Skill1 Skill3 Correct Incorrect
#> 1 High High 21 0
#> 2 Medium High 6 2
#> 3 Low High 0 0
#> 4 High Better 1 0
#> 5 Medium Better 9 1
#> 6 Low Better 0 0
#> 7 High Medium 0 0
#> 8 Medium Medium 9 0
#> 9 Low Medium 0 0
#> 10 High Worse 0 0
#> 11 Medium Worse 8 1
#> 12 Low Worse 1 0
#> 13 High Low 0 0
#> 14 Medium Low 7 4
#> 15 Low Low 15 15
Work on RNetica, CPTtools and Peanut has been sponsored in part by the following grants:
Bill & Melinda Gates Foundation grant “Games as Learning/Assessment: Stealth Assessment” (#0PP1035331, Val Shute, PI)
National Science Foundation grant “DIP: Game-based Assessment and Support of STEM-related Competencies” (#1628937, Val Shute, PI).
National Science Foundation grant “Mathematical Learning via Architectual Design and Modeling Using E-Rebuild.” (#1720533, Fengfeng Ke, PI)
Intitute of Educational Statistics grant “Exploring Adaptive Cognitive and Affective Learning Support for Next-Generation STEM Learning Games”, (R305A170376,Russell Almond, PI)