SIT743 Multivariate and Categorical Data Analysis Assignment-2 Total Marks = 100, Weighting - 40% Due date: 25th September 2019 by 11.30 PM...

1 answer below »
Please use the SID=1 which is required in the assignment.


SIT743 Multivariate and Categorical Data Analysis Assignment-2 Total Marks = 100, Weighting - 40% Due date: 25th September 2019 by 11.30 PM --------------------------------------------------------------------------------------------------------------- INSTRUCTIONS: • For this assignment, you need to submit the following TWO files. 1. A written document (A single pdf only) covering all of the items described in the questions. All answers to the questions must be written in this document, i.e, not in the other files (code files) that you will be submitting. All the relevant results (outputs, figures) obtained by executing your R code must be included in this document. For questions that involve mathematical formulas, you may write the answers manually (hand written answers), scan it to pdf and combine with your answer document. Submit a combined single pdf of your answer document. 2. A separate “.R” file or ‘.txt’ file containing your code (R-code script) that you implemented to produce the results. Name the file as “name-StudentID-Ass2- Code.R" (where `name' is replaced with your name - you can use your surname or first name, and StudentID with your student ID). • All the documents and files should be submitted (uploaded) via SIT 743 Clouddeakin Assignment Dropbox by the due date and time. • Zip files are NOT accepted. All two files should be uploaded separately to the CloudDeakin. • E-mail or manual submissions are NOT allowed. Photos of the document are NOT allowed. ================================================================= Assignment tasks Q1) [32 Marks] A survey has been conducted in Melbourne to study the travel mode choice (M) behavior of people. The list of factors that influence the travel mode choice, along with their possible values, is provided below. A Bayesian network that has been created based on the survey results is shown below, which represents the relationship between these various factors (variables). J (Occupation) ∈{Student, Employee, Individual, Others} A (Age) ∈{<18, 18-35,="" 36-55,="">55} S (Salary – monthly in dollars) ∈ {<2000, 2000-6000,="" 6000-10000,="">10000} V (owning a private car) ∈ {Yes, No} Page 1 of 7 W (Trip purpose) ∈ {Commute to work, other} D (Trip distance in km) ∈ {<1, 1-3,="" 3-6,="">6} P (Trip time period) ∈ {Peak hour, Off-peak hour} U (Trip duration in mins) ∈ {<30, 30-60,="">60} M (Travel mode choice) ∈ {Walking, Bicycle, Public transport, Car} 1.1) Write down the joint distribution ??(??, W, S, A, P, V, D, U, M) for the above network. 1.2) Find the minimum number of parameters required to fully specify the distribution according to the above network. 1.3) How many parameters are required, at a minimum, if there are no independencies among the variables is assumed? Compare with the result of the above question (Q1.2) and comment. 1.4) d-separation method can be used to find two sets of independent or conditionally independent variables in a Bayesian network. For each of the statements given below from (a) to (c), perform the following: • List all the possible paths from the first (set of) node/s to the second (set of) node/s. • State if each of those paths is blocking or non-blocking with reasons. Page 2 of 7 • Hence, mention if the statement is true or false. a) ?? ⊥ V | ∅ (W is marginally independent of V) b) ?? ⊥ M | {D, W} (A is conditionally independent of M given {D, W}) c) {??,??} ⊥ D | V 1.5) Write a R-Program to produce the above Bayesian network, and perform the d-separation tests for all of the above cases mentioned in Q1.4 (a) to (c). Show the plot of the network you obtained and the output (of d-separation test) from your program. 1.6) Show the step by step process to perform variable elimination to compute ??(?? | ?? = ???????? − ????????, ?? = ??????, ?? = ???????? ????????, ?? = < ).="" use="" the="" following="" variable="" ordering="" for="" the="" elimination="" process:="" j,="" w,="" a,="" u.="" [marks="" 2+5+3+11+3+8="32]" q2)="" [16="" marks]="" implementing="" a="" bayesian="" network="" in="" r="" and="" performing="" inference="" a="" belief="" network="" models="" the="" relation="" between="" the="" variables="" oil;="" inf;="" eh;="" bp;="" rt,="" which="" stand="" for="" the="" price="" of="" oil,="" inflation="" rate,="" economy="" health,="" british="" petroleum="" stock="" price,="" and="" retailer="" stock="" price="" respectively.="" each="" variable="" takes="" different="" states="" as="" given="" below.="" (????????????????="" ℎ????????ℎ)="" ∈="" {??????,ℎ????ℎ}="" (??????????="" )="" ∈="" {??????,ℎ????ℎ}="" (??????????????????="" )="" ∈="" {??????,ℎ????ℎ}="" (????????????ℎ="" )="" ∈="" {??????,="" (????),??????????="" (????),ℎ????ℎ}="" (????????????????="" )="" ∈="" {??????,ℎ????ℎ}="" the="" belief="" network="" that="" models="" these="" variables="" has="" (probability)="" tables="" as="" shown="" below.="" page="" 3="" of="" 7="" 2.1)="" use="" the="" below="" libraries="" in="" r="" to="" create="" this="" belief="" network="" in="" r="" along="" with="" the="" probability="" values="" as="" shown="" in="" the="" above="" table.="" you="" may="" use="" the="" following="" libraries="" for="" this:="" source("https://bioconductor.org/bioclite.r")="" bioclite("rbgl")="" library(rbgl)="" library(grbase)="" library(grain)="" bioclite("rgraphviz")="" #define="" the="" appropriate="" network="" and="" use="" the="" “compilecpt()”function="" to="" compile="" list="" of="" conditional="" probability="" tables="" and="" create="" the="" network.="" a)="" show="" the="" obtained="" belief="" network="" for="" this="" distribution="" b)="" show="" the="" probability="" tables="" obtained="" from="" the="" r="" output,="" (and="" verify="" with="" the="" above="" table).="" 2.2)="" use="" r="" program="" to="" compute="" the="" following="" probabilities:="" a)="" given="" that="" the="" oil="" price="" is="" low="" and="" the="" retailer="" stock="" price="" is="" low,="" what="" is="" the="" most="" possible="" state="" of="" the="" british="" petroleum="" stock="" price?="" b)="" given="" that="" the="" inflation="" rate="" is="" low,="" what="" is="" the="" probability="" that="" retailer="" stock="" price="" is="" high?="" c)="" find="" the="" marginal="" distribution="" of="" .="" d)="" find="" the="" joint="" distribution="" of="" inflation,="" and="" british="" petroleum="" stock="" price.="" [marks:="" (3+5)="" +="" (2+2+2+2)="16]" page="" 4="" of="" 7="" q3)="" [16="" marks]="" consider="" five="" binary="" variables="" a,="" b,="" c,="" d,="" e.="" the="" directed="" acyclic="" graph="" (dag)="" shown="" below="" describes="" the="" relationship="" between="" these="" variables="" along="" with="" their="" conditional="" probability="" tables="" (cpt).="" 3.1)="" in="" the="" above="" network,="" state="" why="" a="" is="" independent="" of="" b,="" i.e.,="" a⊥b.="" 3.2)="" hence,="" find="" an="" expression="" (in="" a="" simplified="" form)="" for="" (??="??|??" =="" ,??="??)" in="" terms="" of="" .="" 3.3)="" the="" table="" shown="" below="" provides="" 20="" simulated="" data="" obtained="" for="" the="" above="" bayesian="" network.="" use="" this="" data="" to="" find="" the="" maximum="" likelihood="" estimates="" of="" ,="" ,="" and="" .="" page="" 5="" of="" 7="" 3.4)="" find="" the="" value="" of="" (??="??|??" =="" ,??="??)" using="" the="" values="" obtained="" for="" from="" the="" above="" question="" q3.3.="" [marks="" 2+="" 8="" +="" 4="" +="" 2="16]" q4)="" bayesian="" structure="" learning="" [30="" marks]="" for="" this="" question,="" you="" will="" be="" using="" a="" dataset,="" called="" “child”,="" which="" contains="" 20="" variables.="" this="" dataset="" provides="" information="" about="" diagnosing="" congenital="" heart="" disease="" in="" a="" new="" born="" "blue="" baby".="" the="" csv="" file="" (“child10k.csv”)="" containing="" the="" dataset="" can="" be="" downloaded="" from="" clouddeakin.="" use="" the="" following="" r="" code="" to="" load="" the="" child="" dataset:="" childdata=""><- read.csv(file="child10k.csv", header=true, sep=",") the true network structure of this dataset can be viewed (plot) using the following r code. use r programming, as appropriate, to answers the following questions. 4.1) use the child dataset to learn bayesian network structures using hill-climbing (hc) algorithm, utilizing two different scoring methods, namely bayesian information criterion score (bic score) and the bayesian dirichlet equivalent (bde score), for each of the following sample sizes of the data: a) 100 (first 100 data) b) 500 (first 500 data) c) 1000 (first 1000 data) d) 5000 (first 5000 data) library(bnlearn) #create and plot the network structure. modelstring = paste0("[birthasphyxia][disease|birthasphyxia][lvh|disease][ductflow|disease]", "[cardiacmixing|disease][lungparench|disease][lungflow|disease][sick|disease]", "[hypdistrib|ductflow:cardiacmixing][hypoxiaino2|cardiacmixing:lungparench]", "[co2|lungparench][chestxray|lungparench:lungflow][grunting|lungparench:sick]", "[lvhreport|lvh][age|disease:sick][lowerbodyo2|hypdistrib:hypoxiaino2]", "[ruqo2|hypoxiaino2][co2report|co2][xrayreport|chestxray][gruntingreport|grunting]") dag = model2network(modelstring) par(mfrow = c(1,1)) #source("https://bioconductor.org/bioclite.r") #bioclite("rgraphviz") graphviz.plot(dag) page 6 of 7 for each of the above cases, • provide the scores obtained for bic and bde, • plot the network structure obtained for the bic and bde scores. 4.2) based on the results obtained for the above question (q 4.1), discuss how the bic score compare with bde score for different sample sizes in terms of structure and score of the learned network. 4.3) a) find the bayesian network structures utilising the full dataset, and using both bic and bde scores. show the scores and the obtained networks. b) compare the networks obtained above (in q4.3.a) for each bic and bde scoring methods with the true network structure and comment. use the “compare()” function and “graphviz.compare()” function available in the “bnlearn” r package to perform the read.csv(file="CHILD10k.csv" ,="" header="TRUE," sep="," )="" the="" true="" network="" structure="" of="" this="" dataset="" can="" be="" viewed="" (plot)="" using="" the="" following="" r="" code.="" use="" r="" programming,="" as="" appropriate,="" to="" answers="" the="" following="" questions.="" 4.1)="" use="" the="" child="" dataset="" to="" learn="" bayesian="" network="" structures="" using="" hill-climbing="" (hc)="" algorithm,="" utilizing="" two="" different="" scoring="" methods,="" namely="" bayesian="" information="" criterion="" score="" (bic="" score)="" and="" the="" bayesian="" dirichlet="" equivalent="" (bde="" score),="" for="" each="" of="" the="" following="" sample="" sizes="" of="" the="" data:="" a)="" 100="" (first="" 100="" data)="" b)="" 500="" (first="" 500="" data)="" c)="" 1000="" (first="" 1000="" data)="" d)="" 5000="" (first="" 5000="" data)="" library(bnlearn)="" #create="" and="" plot="" the="" network="" structure.="" modelstring="paste0("[BirthAsphyxia][Disease|BirthAsphyxia][LVH|Disease][DuctFlow|Disease]"," "[cardiacmixing|disease][lungparench|disease][lungflow|disease][sick|disease]",="" "[hypdistrib|ductflow:cardiacmixing][hypoxiaino2|cardiacmixing:lungparench]",="" "[co2|lungparench][chestxray|lungparench:lungflow][grunting|lungparench:sick]",="" "[lvhreport|lvh][age|disease:sick][lowerbodyo2|hypdistrib:hypoxiaino2]",="" "[ruqo2|hypoxiaino2][co2report|co2][xrayreport|chestxray][gruntingreport|grunting]")="" dag="model2network(modelstring)" par(mfrow="c(1,1))" #source("https://bioconductor.org/bioclite.r")="" #bioclite("rgraphviz")="" graphviz.plot(dag)="" page="" 6="" of="" 7="" for="" each="" of="" the="" above="" cases,="" •="" provide="" the="" scores="" obtained="" for="" bic="" and="" bde,="" •="" plot="" the="" network="" structure="" obtained="" for="" the="" bic="" and="" bde="" scores.="" 4.2)="" based="" on="" the="" results="" obtained="" for="" the="" above="" question="" (q="" 4.1),="" discuss="" how="" the="" bic="" score="" compare="" with="" bde="" score="" for="" different="" sample="" sizes="" in="" terms="" of="" structure="" and="" score="" of="" the="" learned="" network.="" 4.3)="" a)="" find="" the="" bayesian="" network="" structures="" utilising="" the="" full="" dataset,="" and="" using="" both="" bic="" and="" bde="" scores.="" show="" the="" scores="" and="" the="" obtained="" networks.="" b)="" compare="" the="" networks="" obtained="" above="" (in="" q4.3.a)="" for="" each="" bic="" and="" bde="" scoring="" methods="" with="" the="" true="" network="" structure="" and="" comment.="" use="" the="" “compare()”="" function="" and="" “graphviz.compare()”="" function="" available="" in="" the="" “bnlearn”="" r="" package="" to="" perform="">
Answered Same DaySep 24, 2021SIT743Deakin University

Answer To: SIT743 Multivariate and Categorical Data Analysis Assignment-2 Total Marks = 100, Weighting - 40%...

Abr Writing answered on Sep 25 2021
164 Votes
assignment.R
library(bnlearn)
# source("https://bioconductor.org/biocLite.R")
# biocLite("RBGL")
library(RBGL)
library(gRbase)
library(gRain)
# biocLite("Rgraphviz")
library(Rgraphviz)
library(visNetwork)## Problem 1.5bn <- model2network("[J][W|J][P|W][S|J][A][V|A:S][D|V][U|P:D][M|U]")
plot(bn)### a)dsep(bn, "W", "V")### b)dsep(bn, "A", "M", c("D", "W"))### c)dsep(bn, "A", "D", "V")dsep(bn, "W", "D", "V")
### Problem 2.1.alh <- c("low", "high")
lh4 <- c("low", "LM", "UM", "high")
e <- cptable(~eh, values=c(30, 70),levels=lh)
o.e <- cptable(~oil|eh, values=c(30, 70, 60, 40), levels=lh)
b.e <- cptable(~bp|eh, values=c(20, 45, 30, 05, 20, 50, 20, 10), levels=lh4)
i.oe <- cptable(~inf|oil:eh, values=c(10, 90, 60, 40, 50, 50, 40, 60), levels=lh)
r.ie <- cptable(~rt|inf:eh, values=c(20, 80, 60, 40, 70, 30, 20, 80), levels=lh)
plist <- compileCPT(list(e, o.e, b.e, i.oe
, r.ie))
plistnet <- grain(plist)
net### Problem 2.1.bplist$eh
plist$oilplist$bpplist$infplist$rt
## Problem 2.2### Problem 2.2.a
querygrain(setEvidence(net,
evidence = list(
oil="low",
rt="low"
)),
nodes = c("bp"),
type = "marginal")
### Problem 2.2.b
querygrain(setEvidence(net,
evidence = list(
inf="low"
)),
nodes = c("rt"),
type = "marginal")
### Problem 2.2.c
querygrain(net, nodes = c("inf", "bp"), type="marginal")
### Problem 2.2.d
querygrain(net, nodes = c("inf", "bp"), type="joint")
# Problem 3## Problem 3.1
bn <- model2network("[A][B][C|A:B]")
plot(bn)dsep(bn, "A", "B")
## Problem 3.2## Problem 3.3
df <- data.frame(
a = c(1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1),
b = c(1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1),
c = c(1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1),
d = c(0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1),
e = c(0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1)
)
dfcat("Alpha = ",
sum(df$a == 0)/nrow(df),
"\n",
sep = "")
cat("Beta = ",
sum(df$c == 0 & df$e == 0)/sum(df$c == 0),
"\n",
sep = "")
cat("Gamma = ",
sum(df$c == 0 & df$d == 0)/sum(df$c == 0),
"\n",
sep = "")
cat("Theta = ",
sum(df$b == 0)/nrow(df),
sep = "")
## Problem 3.4
cat("P(E=1|A=1, B=0) = ",
0.2*sum(df$c == 0 & df$e == 0)/sum(df$c == 0)+0.56,
sep="")
# Problem 4
ChildData <- read.csv(file="CHILD10k.csv", header=TRUE, sep=",")#create and plot the network structure.
modelstring = paste0(
"[BirthAsphyxia][Disease|BirthAsphyxia][LVH|Disease][DuctFlow|Disease]",
"[CardiacMixing|Disease][LungParench|Disease][LungFlow|Disease][Sick|Disease]",
"[HypDistrib|DuctFlow:CardiacMixing][HypoxiaInO2|CardiacMixing:LungParench]",
"[CO2|LungParench][ChestXray|LungParench:LungFlow][Grunting|LungParench:Sick]",
"[LVHreport|LVH][Age|Disease:Sick][LowerBodyO2|HypDistrib:HypoxiaInO2]",
"[RUQO2|HypoxiaInO2][CO2Report|CO2][XrayReport|ChestXray][GruntingReport|Grunting]")
dag = model2network(modelstring)
par(mfrow = c(1,1))
graphviz.plot(dag)
## Problem 4.1
# plot network func
# using the visNetwork package to plot the network because it looks very nice.
plot.network <- function(structure, ht = "400px"){
nodes.uniq <- unique(c(structure$arcs[,1], structure$arcs[,2]))
nodes <- data.frame(id = nodes.uniq,
label = nodes.uniq,
color = "darkturquoise",
shadow = TRUE)

edges <- data.frame(from = structure$arcs[,1],
to = structure$arcs[,2],
arrows = "to",
smooth = TRUE,
shadow = TRUE,
color = "black")

return(visNetwork(nodes, edges, height = ht, width = "100%"))
}
### Problem 4.1.a
structure.a.1 <- hc(ChildData[1:100,], score = "bic")
plot.network(structure.a.1)structure.a.2 <- hc(ChildData[1:100,], score = "bde")
plot.network(structure.a.2)
### Problem 4.1.b
structure.b.1 <- hc(ChildData[1:500,], score = "bic")
plot.network(structure.b.1)structure.b.2 <- hc(ChildData[1:500,], score = "bde")
plot.network(structure.b.2)
### Problem 4.1.c
structure.c.1 <- hc(ChildData[1:1000,], score = "bic")
plot.network(structure.c.1)structure.c.2 <- hc(ChildData[1:1000,], score = "bde")
plot.network(structure.c.2)
### Problem 4.1.d
structure.d.1 <- hc(ChildData[1:5000,], score = "bic")
plot.network(structure.d.1)structure.d.2 <- hc(ChildData[1:5000,], score = "bde")
plot.network(structure.d.2)
## Problem 4.2
cat("BIC Score for first 100 data = ",
score(structure.a.1, ChildData, type = "bic"),
"\n",
sep="")
cat("BDE Score for first 100 data = ",
score(structure.a.2, ChildData, type = "bde"),
"\n",
sep="")
cat("BIC Score for first 500 data = ",
score(structure.b.1, ChildData, type = "bic"),
"\n",
sep="")
cat("BDE Score for first 500 data = ",
score(structure.b.2, ChildData, type = "bde"),
"\n",
sep="")
cat("BIC Score for first 1000 data = ",
score(structure.c.1, ChildData, type = "bic"),
"\n",
sep="")
cat("BDE Score for first 1000 data = ",
score(structure.c.2, ChildData, type = "bde"),
"\n",
sep="")
cat("BIC Score for first 5000 data = ",
score(structure.d.1, ChildData, type = "bic"),
"\n",
sep="")
cat("BDE Score for first 5000 data = ",
score(structure.d.2, ChildData, type = "bde"),
"\n",
sep="")
## Problem 4.3### Problem 4.3.a
structure.1 <- hc(ChildData, score = "bic")
plot.network(structure.1)
cat("BIC Score = ",
score(structure.1, ChildData, type = "bic"),
"\n",
sep="")structure.2 <- hc(ChildData, score = "bde")
plot.network(structure.2)
cat("BDE Score = ",
score(structure.2, ChildData, type = "bde"),
"\n",
sep="")
### Problem 4.3.b
graphviz.compare(structure.1, dag)graphviz.compare(structure.2, dag)
### Problem 4.3.c
bn.mod <- bn.fit(structure.1, data=ChildData)
bn.mod
### Problem 4.3.d
cpquery(bn.mod,
(Disease == "Lung"),
(CO2 == "High" & LungParench == "Abnormal"))
assignment.pdf
SIT743 Multivariate and Categorical Data Analysis
Assignment 2
25/09/2019
Problem 1
Problem 1.1
Using the Chain Rule of probability, we can write:
P (J,W, S,A, P, V,D,U,M) = P (M |U)P (U |P,D)P (U |P )P (P |W )P (W |J)
P (D|V )P (V |S,A)P (S|J)P (J)P (A)
Problem 1.2
The minimum number of parameters required to fully the distribution according to the above network is
evaluated below:
M = 3× (4− 1) = 9
U = 2× 4× (3− 1) = 16
P = 2× (2− 1) = 2
W = 4× (2− 1) = 4
J = (4− 1) = 3
D = 2× (4− 1) = 6
V = 4× 4× (2− 1) = 16
A = 4− 1 = 3
S = 4× (4− 1) = 12
Therefore, the minimum number of parameters required are:
Total parameters = 9 + 16 + 2 + 4 + 3 + 6 + 16 + 3 + 12 = 71
Problem 1.3
The total number of paramters required if there no independencies among the variables is assumed:
Total parameters = 4× 4× 4× 2× 2× 4× 2× 3× 4− 1 = 24575
We can see that the number of parameters required for defining the joint probability distribution is very low
in the directed bayesian network in comparison to case were we ignore all the independencies among the
variables.
1
Problem 1.4
a) W is marginally not independent of V as can be seen from the following path:
W ← J → S → V
Any change in V affects change in S and therefore J which will finally affects W. Therefore, W in
marginally not independent of V.
b) A is conditionally independent of M given {D, W}).Since, there is two paths between these two as
shown below:
M ← U ← D ← V ← A
And,
M ← U ← P ←W ← J → S → V ← A
c) A and W are conditionally independent of D given V. The path between W and D is shown below.
W ← J → S → V → D
And the path between A and D is shown below:
D ← V ← A
Therefore, in both the cases, V Is blocking the dependence between the two.
Problem 1.5
bn <- model2network("[J][W|J][P|W][S|J][A][V|A:S][D|V][U|P:D][M|U]")
plot(bn)
2
A
D
J
M P
S
U
V
W
a)
dsep(bn, "W", "V")
[1] FALSE
b)
dsep(bn, "A", "M", c("D", "W"))
[1] TRUE
c)
dsep(bn, "A", "D", "V")
[1] TRUE
dsep(bn, "W", "D", "V")
[1] TRUE
Problem 1.6
P (M |S = 3000− 6000, V = Y es, P = Peak hour,D <= 1) = P (U |P = Peak hour,D <= 1)P (M |U)
3
Problem 2
Problem 2.1
Problem 2.1.a
lh <- c("low", "high")
lh4 <- c("low", "LM", "UM", "high")
e <- cptable(~eh, values=c(30, 70),levels=lh)
o.e <- cptable(~oil|eh, values=c(30, 70, 60, 40), levels=lh)
b.e <- cptable(~bp|eh, values=c(20, 45, 30, 05, 20, 50, 20, 10), levels=lh4)
i.oe <- cptable(~inf|oil:eh, values=c(10, 90, 60, 40, 50, 50, 40, 60), levels=lh)
r.ie <- cptable(~rt|inf:eh, values=c(20, 80, 60, 40, 70, 30, 20, 80),...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here