2. Download the dataset DNASeqs.RData and then load it into your R workspace. The data frame dna.seq has 11 rows and 1620 columns, where each row vector is the DNA sequence of some protein. The length...

2. Download the dataset DNASeqs.RData and then load it into your R workspace. The data frame dna.seq has 11 rows and 1620 columns, where each row vector is the DNA sequence of some protein. The length of the sequence is 1620, and thus each column represents one nucleotide. Run the following code to compute the distance matrix, dna.dist. load("MorderGE.RData") #header(Morder) #type morder.kmeans hamming } load("DNASeqs.RData") n D for (j in 1:(i-1)){ D[i, j] } } dna.dist = as.dist(D) Answer the following questions and attach your plots. (i) Note that we defined the Hamming distance ourselves. For two DNA sequences, x and y, the Hamming distance is the number of positions where x and y are different. Why can’t we use the Euclidean distance for our data? (ii) Plot the UPGMA phylogenetic tree (i.e. average linkage) using dna.dist as the input distance matrix. (iii) Which two proteins are clustered in the first step of building the UPGMA tree? (iv) What is the maximum node height in the UPGMA tree? (v) Plot the phylogenetic tree using dna.dist and complete linkage. (vi) For the tree built using complete linkage, observe that the two proteins connected at the first step are the same as those for the UPGMA tree. Why? (vii) The maximum node height for the tree built using complete linkage is obviously greater than that for the UPGMA tree. Why? (An intuitive argument would suffice.) (viii) Draw the phylogenetic tree using neighbor joining. (Note: To use function nj, you need to first install and load the package ape.)
Jan 16, 2022
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here