Applied Machine Learning Online exam Thursday7/January/2021 at 09:30 Irish Time
2020 Sample Exam Paper Open Book.pdf Answer any 3 questions from 4, all questions carry equal marks. Question 1 (A) The following table presents the Pearson coefficients from a data-set. (i) Evaluate the table for potential multi-colinear attributes. Explain the reasoning behind the choices you have made. (ii) Evaluate the table for attribute selection. Explain the reasoning behind the potential attributes that you have selected, based on the Pearson coefficients. (6 Marks) (B) The histogram presented, represents 398 cars surveyed for their fuel efficiency (miles per gallon MPG). (i) Evaluate the Histogram for potential outliers or deemed missing data. Explain the reasoning behind the choices you have made (provide the steps and calculations you used to support your decisions) (ii) Explain how you would deal with the evaluation findings from part (i). (8 Marks) Mean 23.3 Standard Deviation 7.05 Minimum 0 Maximum 43.6 Number of Instances 398 Question 1 contd (C) The pre-examination of the class distribution is an important exercise before developing classification models . (i) Discuss this statement explaining why this is an appropriate pre-examination technique, and discuss the implications of not conducting this technique. (ii) Provide examples of problem situations where this technique would be useful when examining the model's performance. (6 Marks) Question 2 (A) The terms type I error and a type II error are often discussed when model performance is presented. (i) Explain how you would evaluate a type I error and a type II error. (ii) Given that the model is trying to identify patients with a life treating disease, discuss this problem situation concerning both types of errors, also explaining which you think is a more important error in this case and why. (6 Marks) (B) The following table contains the performance results for classification models a and b, (Accuracy, Sensitivity and Specificity). Where both models are trying to identify sports injuries before they happen Model A Model B Assumption: B is the most suitable to predict sports injuries before they happen . (i) Explain why someone would make this incorrect assumption, using the values presented in the table above to aid your answer. (ii) Explain your reason why Model A is the most suitable model for predicting sports injuries before they happen, using the values presented in the table above to aid your answer. (6 Marks) (C) Ten-fold Cross Validation Machine Learning model validation techniques (the best technique to use). (i) Explain what is the most important (ii) Explain an alternative to Ten-fold Cross Validation? Compare and contrast the two techniques (10-fold Cross Validation and the alternative technique), giving examples of problem situations where each technique may be more suitable. (8 Marks) Question 3 (A) The k-value in the KNN classification algorithm can be selected using the elbow method. (i) Explain why you would initially decide a k-value to be even or odd? (ii) Explain how you would evaluate the most appropriate k-value for a KNN algorithm using the above figure and the elbow method. (6 Marks) (B) The naïve Bayes Machine Learning Algorithm is often, a high performing classification algorithm. (i) Explain why the Baysian based algorithm, includes in the title and how this may affect the models performance (ii) Compare and contrast the naïve Bayes algorithm with two other Machine Learning Algorithm . (8 Marks) (C) Semi-supervised learning is an approach that is sometimes required, combining both supervised learning and unsupervised learning . (i) Describe a problem situation where semi-supervised learning is required. (ii) Explain why this approach is needed for the answer in part (i), describing why alone, supervised learning or unsupervised learning would be unable to address the problem situation mentioned. (6 Marks) Question 4 (A) The Hyperparameters batch size and epochs are fundamental in the development of an Artificial Neural Network (ANN). (i) Explain how you would evaluate and select a suitable batch size and epochs. (6 Marks) (B) Bootstrap is a statistical estimation technique where a statistical quantity like a mean is estimated from multiple random samples of your data (with replacement) (i) Discuss this statement, explaining in your own words how the Bootstrap pre-processing technique works. (ii) Provide an example of a problem situation where this technique should be considered, explaining why you think Bootstrap, is suitable for this problem situation. (6 Marks) (C) Statistical testing is often used to compare the performance of two or more machine learning models. (i) Compare and contrast any two methods of statistical testing for comparing two or more Machine Learning models. When reporting a statistical test result, many people do not present the entire picture, calling into question the findings. (ii) Discuss what are the most important parts of a statistical test to report, so that there is no ambiguity in the findings, explaining your reason for each part selected. (8 Marks) SampleExam2019 Sample Exam Paper (Traditional in person).pdf Answer any 3 questions from 4, all questions carry equal marks. Question 1 (A) The following table presents the Pearson coefficients from a data-set. What is this table often used for in data pre-processing? Identify one pair of attributes from the table and explain their values and what they mean in reference to data pre-processing. (6 Marks) (B) The histogram presented, represents 398 cars surveyed for their fuel efficiency (miles per gallon MPG). Describe this histogram. What values if any, would you identify as concerns (consider for marking)? Finally, what steps (and the rationale for your choice) would you take for each identified concern? (14 Marks) Mean 23.3 Standard Deviation 8.05 Minimum 0 Maximum 46.6 Number of Instances 398 Question 1 contd (C) Explain why a pre-examination of the class distribution is an important exercise prior to running classification models. Give an example of a concern that may arise if this exercise is not conducted. (4 Marks) Question 2 (A) Explain what the terms type I error and a type II error. Given that the model is trying to identify patients with a life treating disease, which is the most important measure to identify correctly and why? (4 Marks) (B) a and b, calculate: appropriate and why, considering that both models are trying to identify sports injuries before they happen a b <- classified="" as="" 40="" 30="" a="0" 50="" 200="" b="1" model="" a="" model="" b="" (10="" marks)="" (c)="" discuss="" in="" detail="" with="" the="" aid="" of="" a="" diagram,="" the="" 10="" fold="" cross="" validation="" technique.="" (10="" marks)="" a="" b="">-><- classified as 5 65 a = 0 2 248 b = 1 question 3 (a) describe how the k-value in the knn classification algorithm is selected. use a diagram to aid your explanation. (8 marks) (b) discuss the naïve bayes classification model; describe its strengths and weaknesses as a classification model, and specifically mention the type of output that the classifier returns in addition to the classification. (10 marks) (c) describe what is semi-supervised learning. where would it be typically applied and mention one technique that implements semi-supervised learning. (6 marks) question 4 (a) what do the parameters batch size and epochs represent in terms of an artificial neural network? describe (an overview of) how the backpropagation technique is used. (6 marks) (b) what is the bootstrap technique in relation to pre-processing, describe the technique, and typically why and when is it applied? (4 marks) (c) what is the fundamental application - -test and an anova test? what issue arises when multiple student t-tests are administered on the same data set and what is the solution? what does a p-value mean, discuss with a confidence interval/value of 95%? (8 marks) (d) root mean squared error is often used as a performance measure in regression models, describe this metric. in addition, a pearson correlation coefficient is often a prominent metric of a regression model, explain this metric, using diagrams. (6 marks) classified="" as="" 5="" 65="" a="0" 2="" 248="" b="1" question="" 3="" (a)="" describe="" how="" the="" k-value="" in="" the="" knn="" classification="" algorithm="" is="" selected.="" use="" a="" diagram="" to="" aid="" your="" explanation.="" (8="" marks)="" (b)="" discuss="" the="" naïve="" bayes="" classification="" model;="" describe="" its="" strengths="" and="" weaknesses="" as="" a="" classification="" model,="" and="" specifically="" mention="" the="" type="" of="" output="" that="" the="" classifier="" returns="" in="" addition="" to="" the="" classification.="" (10="" marks)="" (c)="" describe="" what="" is="" semi-supervised="" learning.="" where="" would="" it="" be="" typically="" applied="" and="" mention="" one="" technique="" that="" implements="" semi-supervised="" learning.="" (6="" marks)="" question="" 4="" (a)="" what="" do="" the="" parameters="" batch="" size="" and="" epochs="" represent="" in="" terms="" of="" an="" artificial="" neural="" network?="" describe="" (an="" overview="" of)="" how="" the="" backpropagation="" technique="" is="" used.="" (6="" marks)="" (b)="" what="" is="" the="" bootstrap="" technique="" in="" relation="" to="" pre-processing,="" describe="" the="" technique,="" and="" typically="" why="" and="" when="" is="" it="" applied?="" (4="" marks)="" (c)="" what="" is="" the="" fundamental="" application="" -="" -test="" and="" an="" anova="" test?="" what="" issue="" arises="" when="" multiple="" student="" t-tests="" are="" administered="" on="" the="" same="" data="" set="" and="" what="" is="" the="" solution?="" what="" does="" a="" p-value="" mean,="" discuss="" with="" a="" confidence="" interval/value="" of="" 95%?="" (8="" marks)="" (d)="" root="" mean="" squared="" error="" is="" often="" used="" as="" a="" performance="" measure="" in="" regression="" models,="" describe="" this="" metric.="" in="" addition,="" a="" pearson="" correlation="" coefficient="" is="" often="" a="" prominent="" metric="" of="" a="" regression="" model,="" explain="" this="" metric,="" using="" diagrams.="" (6="">- classified as 5 65 a = 0 2 248 b = 1 question 3 (a) describe how the k-value in the knn classification algorithm is selected. use a diagram to aid your explanation. (8 marks) (b) discuss the naïve bayes classification model; describe its strengths and weaknesses as a classification model, and specifically mention the type of output that the classifier returns in addition to the classification. (10 marks) (c) describe what is semi-supervised learning. where would it be typically applied and mention one technique that implements semi-supervised learning. (6 marks) question 4 (a) what do the parameters batch size and epochs represent in terms of an artificial neural network? describe (an overview of) how the backpropagation technique is used. (6 marks) (b) what is the bootstrap technique in relation to pre-processing, describe the technique, and typically why and when is it applied? (4 marks) (c) what is the fundamental application - -test and an anova test? what issue arises when multiple student t-tests are administered on the same data set and what is the solution? what does a p-value mean, discuss with a confidence interval/value of 95%? (8 marks) (d) root mean squared error is often used as a performance measure in regression models, describe this metric. in addition, a pearson correlation coefficient is often a prominent metric of a regression model, explain this metric, using diagrams. (6 marks)>