Files submittedAssessment 1: Naive Bayes classifier and Discriminant Analysis Issued: Sunday of...

Question

Files submittedAssessment 1: Naive Bayes classifier and Discriminant Analysis  Issued:  Sunday of Week 1  Due:        11:59 PM AEST Sunday of Week 3  Weight: 30 %  Maximum score:    50 Marks   Overview  During this assessment you will insert R code and written discussions with justifications to this  template file. This assessment implements and explores techniques mainly covered in Week 1 and  Week 2.  The assessment is segmented into three tasks (1)  Comparison of classifiers;  (2) Application  of a classifier; and (3) Implementation of classifiers.  The purpose of the assignment is to enable you to:  1. Code and comment R scripts 2. Implement sub-setting, Bayes classifiers and Discriminant Analysis in RStudio 3. Compare classification algorithms 4. Visually present predictions of classifiers in RStudio Learning outcomes  Related subject learning outcomes:  1. Evaluate, synthesise and apply classic supervised data mining methods for pattern classification. 2. Effectively integrate, execute and apply the studied concepts, algorithms, and techniques to real datasets using the computer language R and the software environment RStudio. 3. Communicate data concepts and methodologies of data science Background Real-world application of classifiers may require that the predictors used for classification be physically  measured and, hence, the inclusion of unnecessary predictors may incur additional costs associated  with sensors, instruments and computing. It should be noted that some variables may even require  human intervention and/or expensive laboratory analyses in order to be measured.  It is important that analysts try to use as few predictors as possible, that is, the smallest set of  predictors that are relevant for the classification task in hand and yet sufficient to provide satisfactory  classification performance. Selecting predictors is an important task called feature selection in data  mining  Assessment submission:  Your submission should include:   An output of the PDF/html file that clearly shows the assignment question, the associated answers, any relevant R outputs, analyses and discussions.  The R-script (code) file as evidence.  The assignment should not exceed 8-A4 pages. Appendices do not form part of the page limit. The assignment must be presented in 12 font on A4 pages using single line spacing.  The task cover sheet.  Note that RMarkdown is not required for this assessment but highly recommended. Upload all submission files in one go. You can upload the assessment up to 3 times, however, only the last  submission is graded.  A word on plagiarism   Plagiarism is the act of using another's words, works or ideas from any source as one's own.   Plagiarism has no place in a University. Student work containing plagiarised material will be subject to  formal university processes in line with procedure described in the subject outline. .  Assessment Task 1: Comparison of classifiers   In this task compare the performance of the supervised learning algorithms Linear Discriminant  Analysis, Quadratic Discriminant Analysis and the Naïve Bayes Classifier using a publicly available  Blood Pressure Data.  The data to be used for this task is provided in the HBblood.csv file in the  Assessment 1 folder.  The HBblood.csv dataset contains values of the percent HbA1c (a measure of the amount of glucose  and haemoglobin joined together in blood) and systolic blood pressure (SBP) (in mm/Hg) for 1,200  clinically healthy female patients within the ages 60 to 70 years.  Additionally, the ethnicity, Ethno,  for each patient was recorded and categorized into three groups, A, B or C, for analysis.  1. Discuss and justify which of the supervised learning algorithms (i.e. Linear Discriminant Analysis, Quadratic Discriminant Analysis and the Naïve Bayes Classifier) would you choose for predicting the response Ethno using HbA1c, and SBP as the feature variables.  Provide any plots/images needed to support your discussion. Hint: Base your answer on the empirical statistical properties of the data in relation to model assumptions. Task 2 on next page  Marks - 10 Assessment Task 2: Application of a classifier  Nursery Data (nursery.csv) was derived from a hierarchical decision model originally developed to rank applications for nursery schools. It was used during several years in 1980's when there was excessive  enrollment to these schools in Ljubljana, Slovenia, and the rejected applications frequently needed an  objective explanation. The dataset given here contains 8 attribute information (column names) and 1  decision variable on nearly 13,000 applications. The attached folder contains description of the data. Please  complete the following tasks. 1. Randomly split the dataset into a training subset and a test subset containing 80% and 20%of the data.  Provide your R-code with appropriate annotation/commentary. 2. With the training data check if the attributes "finance", "social", and "parents" have statistially significant association with the response, "outcome". Hint: You can choose any statistical test from MA5820, for example.  3. Propose a classification methodology to classify a randomly selected nursery application, outcome (response), into categories, accept versus reject using the eight features on this dataset. Please give reasons for your choice based on your learning from Weeks 1 & 2. 4. Implement the classifier proposed in Question 2, on the training data subset you created in Question 1. Provide your R-code with appropriate annotation/commentary.  Fit a classifier with all 8 features. Using relevant R outputs Interpret and discuss the relationships between the predictors and response variables. For discussion/interpretation you can choose any two of the three attributes you investigated in Q2. 5. Discuss the accuracy of the fitted model on the test data. Show relevant R codes and output to support your discussion. Did the fitted model improve prediction compared to a model with no features? Task 3 on next page  Marks 20 Marks 6 Marks 1 Marks 3 Marks 6 Marks 4 https://archive.ics.uci.edu/ml/datasets/Mushroom Assessment Task 3: Implementation of classifiers  In this task, compare the performance of the supervised learning algorithms Linear Discriminant  Analysis and the Naïve Bayes Classifier using the Breast Cancer Wisconsin (Diagnostic) Data Set  (wdbc.data). Thirty features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. 1. Implementation - parts of this question would require you to revise Sec 4 of ISLR but also the  notion of covariance matrix of a multivariate feature vector. a.    For each tumour class (M and B) compute the generalized variance (g.v.) of the feature  vectors (consisting of 30 features). Heuristically (no statistical test) compare the two g.vs and  comment which type of discriminant analysis is more appropriate for this data. Hint- The  generalized variance of a multivariate feature vector is the determinant of its covariance  matrix. Generalized variance is the multivariate equivalent of variance. b. Use a randomly 90% sub-sample of this data as your training sample. c. Implement your recommended DA and the Naive Bayes classifiers on the training  sample to classify tissue samples into classes- M or B. Show the R-code with  annotation, and model summaries for both algorithm. 2. For each algorithm -NB and DA - show the true positive, false positive and accuracy rates for the training and test samples. 3. Based on Q2 recommend the right choice of algorithm for this data. Attribute Information: 1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32) Ten real-valued features are computed for each cell nucleus: a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1) The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.  For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius. Marks 20 Marks 4 Mark 1 Marks 6 Marks 6 Marks 3 Marking Criteria and Rubric: MA5810 Assessment 1  Criterion High Distinction Distinction Credit Pass Fail  R code  (20%)  Code submitted  Code works correctly,  meets the specifications,  produces the correct results  and displays them correctly.  Code is exceptionally well  organised and very easy to  follow.  Code always very  well commented so the  purpose of each block of  code readily understood  and what question part it  corresponds to. Variable  names give the purpose of  the variable.  Code submitted  Code works correctly,  meets the specifications,  and produces correct  results but may not display  all of it correctly.  Code is clean,  understandable and well- organised, with just some  minor errors. Code is well  commented so that there is  very little ambiguity of the  code purpose. One or two  places could benefit from  comments, or the code is  overly commented. Variable  names clearly describe the  purpose of the variable.  Code submitted  Code mostly works  correctly, but functions  incorrectly on some inputs.  Minor details of the  specification are violated.  Code is fairly easy to read,  although contains at least  one major issue that  detracts from clarity.  The  comments leave some code  block ambiguous as to the  purpose.  One or two places  could benefit from  comments, or the code is  overly commented. Variable  names do not describe the  purpose of the variable  Code only provided in  answer document but looks  correct  Code often exhibits  incorrect behaviour.  Significant details of  specification are violated.  Code contains more than  one major issue that makes  it difficult to read. The code  is readable only by  someone who already  knows what it is supposed  to be doing. Comments not  sufficient to see what the  code is doing. Significant  lack of comments makes it  difficult to understand  code.  Code not submitted  Code not provided in  answer document. Code  produces incorrect results,  does not compile, or  significant errors occur.  Code is poorly organised  and very difficult to read.  Code has no comments.  Methodology  (40%)  The methodology  implemented is expertly  documented and justified.  The methodology  implemented reflects a  sophisticated and nuanced  understanding of relevant  concepts.  All assumptions  validated and  communicated concisely.   The methodology

Mohd · Accepted Answer

Answer Attached Below:

Assessment 1: Naive Bayes classifier and Discriminant Analysis Issued: Sunday of Week 1 Due: 11:59 PM AEST Sunday of Week 3 Weight: 30 % Maximum score: 50 Marks Overview During this assessment you...

Answer To: Assessment 1: Naive Bayes classifier and Discriminant Analysis Issued: Sunday of Week 1 Due: 11:59...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment

Assessment 1: Naive Bayes classifier and Discriminant Analysis Issued: Sunday of Week 1 Due: 11:59 PM AEST Sunday of Week 3 Weight: 30 % Maximum score: 50 Marks Overview During this assessment you...

Answer To: Assessment 1: Naive Bayes classifier and Discriminant Analysis Issued: Sunday of Week 1 Due: 11:59...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment

Assessment 1: Naive Bayes classifier and Discriminant Analysis Issued: Sunday of Week 1 Due: 11:59 PM AEST Sunday of Week 3 Weight: 30 % Maximum score: 50 Marks Overview During this assessment you...