MA5810-Assessment 2 Weighting: 30% Total marks: 100 Due date: Week 5 - Sunday, Overview This assessment focuses on machine learning techniques covered during Weeks 2-5 with primary focus on topics of...

2 answer below »
files attached


MA5810-Assessment 2 Weighting: 30% Total marks: 100 Due date: Week 5 - Sunday, Overview This assessment focuses on machine learning techniques covered during Weeks 2-5 with primary focus on topics of 3,4, and 5. Wherever required you must show evidence of your work using R-code and output, as part of your RMarkdown submission. The purpose of the assignment is for you to: • Demonstrate sound knowledge of the basic theory, principles and concepts that underpin data mining and exemplify the most common tasks and types of data mining problems. • Apply classic supervised and/or unsupervised data mining methods to analyse and evaluate descriptive analytics tasks. Submission You will need to submit the following: • A PDF file clearly shows the assignment question, the associated answers, any relevant R outputs, analyses and discussions. The assignment must be presented in 12 font on A4 pages using single line spacing. • R script file to reproduce your work. Please attach the code in Appendix; • The task cover sheet. The assignment should not exceed 12-A4 pages. Appendices do not form part of the page limit. You have up to three attempts to submit your assessment, and only the last submission will be graded. A word on plagiarism: Plagiarism is the act of using another’s words, works or ideas from any source as one’s own. Plagiarism has no place in a University. Student work containing plagiarised material will be subject to formal university processes. 1 Question 1 - Total Marks 40 In Assessment 1 we used the Breast Cancer Wisconsin (Diagnostic) Data Set (wdbc.data). Thirty features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A quick recall of the Attributes- Assignment tasks: 2 1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32) Ten real-valued features are computed for each cell nucleus: a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1) The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius. Import the data into your session. 1. Partition the data into 90% training and remaining as test samples. Fit a logistic regression model for Diagnosis against all numeric features to the training sample. Marks 6 2. From the summary of fitted model interpret the relationship between Diagnosis and the features Texture and Concavity. Report and discuss the confidence interval for this relationship. Show relevant R output. Marks 4 3. Return to the unpartitioned data. Use descriptive methods to investigate the correlation between the 30 numeric features on the BC data. Show relevant output. Marks 4 4. Suggest and implement an unsupervised learning method to derive secondary features that address inter-feature correlation. Show R-code. Marks 6 5. Select a subset (filter(.)) of secondary features obtained in 4) • a. Justify your approach using result(s) obtained in 4). Marks 4 • b. Partition the data containing secondary features into training (90%) versus test samples. Use the data obtained in 5a) to fit a logistic regression model with Diagnosis as response on this new training sample. Marks 6 • c. Use the same features to fit a quadratic discriminant analysis to Diagnosis. Marks 2 6. Implement both models on the test data along with the logistic regression model with all features (as in Q 1) Marks 8 • a. Provide accuracy measures for each case. • b. Discuss your findings. Question 2 - Total Marks 20 (a) Discuss the hierarchical clustering algorithms including similarities and difference(s)- a) single linkage b) complete linkage. Marks 6 (b) Please do not use R or any programming languages for this question. Please solve the problem manually. Suppose that you have a dissimilarity matrix of 5 observations as follows D =  0 0.2 0.45 0.7 0.8 0.2 0 0.1 0.5 0.35 0.45 0.1 0 0.55 0.6 0.7 0.5 0.55 0 0.3 0.8 0.35 0.6 0.3 0  The D matrix implies that the dissimilarity between the first and the third obser- vation is 0.45, and the dissimilarity between the second and the fourth observation is 0.5. • Sketch the dendrogram using the complete linkage clustering approach. Ex- plain and provide the detailed procedure of obtaining the height at which each fusion occurs, and the observations associated with each leaf in the dendro- gram. Marks 7 • Sketch the dendrogram using the single linkage clustering approach. Explain and provide the detailed procedure of obtaining the height at which each fusion occurs, and the observations associated with each leaf in the dendrogram. Marks 7 Question 3 - Total Marks 40 Clustering is a common exploratory technique used in bioinformatics where researchers aim to identify subgroups within diseases using gene expression. Imagine you are asked to analyse the gene expression dataset available in the leukemia dat.Rdata file. This data was originally generated by [Golub et al., Science, 1999]https://science.sciencemag. org/content/sci/286/5439/531.full.pdf and contains the expression level of 1867 se- lected genes from 72 patients with different types of leukemia. The data in each column are summarized as follows: • Column 1: patient id = a unique identifier for each patient (observation) • Column 2: type = A factor variable with two subtypes of leukemia; acute lym- phoblastic leukemia (ALL, n = 47) and acute myeloblastic leukemia (AML, n = 25). 3 https://science.sciencemag.org/content/sci/286/5439/531.full.pdf https://science.sciencemag.org/content/sci/286/5439/531.full.pdf • Columns 3: - 1869. Gene expression data for 1867 genes, Gene 1, ..., Gene 1867. Assignment Tasks: The researchers hypothesized that patient samples will cluster by subtype of leukemia based on gene expression. Your task is to use a clustering technique to address this scientific hypothesis and report your results back to the researcher. (a) Select a clustering technique to apply. Justify your choice. Marks 5 (b) Implement your chosen clustering technique in R. Describe your implementation (You need to provide details of all steps relating to the implementation of the clus- tering algorithms, such as data preparation including any transformations performed on the data prior to clustering, training the model and evaluating the performance of the model.) Marks 25 (c) Produce two or more plots to visualize your results. Describe your results as you would in your report to the researcher. 4 Marks 10
Answered 5 days AfterApr 14, 2022

Answer To: MA5810-Assessment 2 Weighting: 30% Total marks: 100 Due date: Week 5 - Sunday, Overview This...

Robert answered on Apr 20 2022
90 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here