2020 Sample Exam Paper Open Book.pdf Answer any 3 questions from 4, all questions carry equal marks. Question 1 (A) The following table presents the Pearson coefficients from a data-set. (i) Evaluate...

1 answer below »
Applied Machine Learning Online exam Thursday7/January/2021 at 09:30 Irish Time


2020 Sample Exam Paper Open Book.pdf Answer any 3 questions from 4, all questions carry equal marks. Question 1 (A) The following table presents the Pearson coefficients from a data-set. (i) Evaluate the table for potential multi-colinear attributes. Explain the reasoning behind the choices you have made. (ii) Evaluate the table for attribute selection. Explain the reasoning behind the potential attributes that you have selected, based on the Pearson coefficients. (6 Marks) (B) The histogram presented, represents 398 cars surveyed for their fuel efficiency (miles per gallon MPG). (i) Evaluate the Histogram for potential outliers or deemed missing data. Explain the reasoning behind the choices you have made (provide the steps and calculations you used to support your decisions) (ii) Explain how you would deal with the evaluation findings from part (i). (8 Marks) Mean 23.3 Standard Deviation 7.05 Minimum 0 Maximum 43.6 Number of Instances 398 Question 1 contd (C) The pre-examination of the class distribution is an important exercise before developing classification models . (i) Discuss this statement explaining why this is an appropriate pre-examination technique, and discuss the implications of not conducting this technique. (ii) Provide examples of problem situations where this technique would be useful when examining the model's performance. (6 Marks) Question 2 (A) The terms type I error and a type II error are often discussed when model performance is presented. (i) Explain how you would evaluate a type I error and a type II error. (ii) Given that the model is trying to identify patients with a life treating disease, discuss this problem situation concerning both types of errors, also explaining which you think is a more important error in this case and why. (6 Marks) (B) The following table contains the performance results for classification models a and b, (Accuracy, Sensitivity and Specificity). Where both models are trying to identify sports injuries before they happen Model A Model B Assumption: B is the most suitable to predict sports injuries before they happen . (i) Explain why someone would make this incorrect assumption, using the values presented in the table above to aid your answer. (ii) Explain your reason why Model A is the most suitable model for predicting sports injuries before they happen, using the values presented in the table above to aid your answer. (6 Marks) (C) Ten-fold Cross Validation Machine Learning model validation techniques (the best technique to use). (i) Explain what is the most important (ii) Explain an alternative to Ten-fold Cross Validation? Compare and contrast the two techniques (10-fold Cross Validation and the alternative technique), giving examples of problem situations where each technique may be more suitable. (8 Marks) Question 3 (A) The k-value in the KNN classification algorithm can be selected using the elbow method. (i) Explain why you would initially decide a k-value to be even or odd? (ii) Explain how you would evaluate the most appropriate k-value for a KNN algorithm using the above figure and the elbow method. (6 Marks) (B) The naïve Bayes Machine Learning Algorithm is often, a high performing classification algorithm. (i) Explain why the Baysian based algorithm, includes in the title and how this may affect the models performance (ii) Compare and contrast the naïve Bayes algorithm with two other Machine Learning Algorithm . (8 Marks) (C) Semi-supervised learning is an approach that is sometimes required, combining both supervised learning and unsupervised learning . (i) Describe a problem situation where semi-supervised learning is required. (ii) Explain why this approach is needed for the answer in part (i), describing why alone, supervised learning or unsupervised learning would be unable to address the problem situation mentioned. (6 Marks) Question 4 (A) The Hyperparameters batch size and epochs are fundamental in the development of an Artificial Neural Network (ANN). (i) Explain how you would evaluate and select a suitable batch size and epochs. (6 Marks) (B) Bootstrap is a statistical estimation technique where a statistical quantity like a mean is estimated from multiple random samples of your data (with replacement) (i) Discuss this statement, explaining in your own words how the Bootstrap pre-processing technique works. (ii) Provide an example of a problem situation where this technique should be considered, explaining why you think Bootstrap, is suitable for this problem situation. (6 Marks) (C) Statistical testing is often used to compare the performance of two or more machine learning models. (i) Compare and contrast any two methods of statistical testing for comparing two or more Machine Learning models. When reporting a statistical test result, many people do not present the entire picture, calling into question the findings. (ii) Discuss what are the most important parts of a statistical test to report, so that there is no ambiguity in the findings, explaining your reason for each part selected. (8 Marks) SampleExam2019 Sample Exam Paper (Traditional in person).pdf Answer any 3 questions from 4, all questions carry equal marks. Question 1 (A) The following table presents the Pearson coefficients from a data-set. What is this table often used for in data pre-processing? Identify one pair of attributes from the table and explain their values and what they mean in reference to data pre-processing. (6 Marks) (B) The histogram presented, represents 398 cars surveyed for their fuel efficiency (miles per gallon MPG). Describe this histogram. What values if any, would you identify as concerns (consider for marking)? Finally, what steps (and the rationale for your choice) would you take for each identified concern? (14 Marks) Mean 23.3 Standard Deviation 8.05 Minimum 0 Maximum 46.6 Number of Instances 398 Question 1 contd (C) Explain why a pre-examination of the class distribution is an important exercise prior to running classification models. Give an example of a concern that may arise if this exercise is not conducted. (4 Marks) Question 2 (A) Explain what the terms type I error and a type II error. Given that the model is trying to identify patients with a life treating disease, which is the most important measure to identify correctly and why? (4 Marks) (B) a and b, calculate: appropriate and why, considering that both models are trying to identify sports injuries before they happen a b <- classified="" as="" 40="" 30="" a="0" 50="" 200="" b="1" model="" a="" model="" b="" (10="" marks)="" (c)="" discuss="" in="" detail="" with="" the="" aid="" of="" a="" diagram,="" the="" 10="" fold="" cross="" validation="" technique.="" (10="" marks)="" a="" b=""><- classified as 5 65 a = 0 2 248 b = 1 question 3 (a) describe how the k-value in the knn classification algorithm is selected. use a diagram to aid your explanation. (8 marks) (b) discuss the naïve bayes classification model; describe its strengths and weaknesses as a classification model, and specifically mention the type of output that the classifier returns in addition to the classification. (10 marks) (c) describe what is semi-supervised learning. where would it be typically applied and mention one technique that implements semi-supervised learning. (6 marks) question 4 (a) what do the parameters batch size and epochs represent in terms of an artificial neural network? describe (an overview of) how the backpropagation technique is used. (6 marks) (b) what is the bootstrap technique in relation to pre-processing, describe the technique, and typically why and when is it applied? (4 marks) (c) what is the fundamental application - -test and an anova test? what issue arises when multiple student t-tests are administered on the same data set and what is the solution? what does a p-value mean, discuss with a confidence interval/value of 95%? (8 marks) (d) root mean squared error is often used as a performance measure in regression models, describe this metric. in addition, a pearson correlation coefficient is often a prominent metric of a regression model, explain this metric, using diagrams. (6 marks) classified="" as="" 5="" 65="" a="0" 2="" 248="" b="1" question="" 3="" (a)="" describe="" how="" the="" k-value="" in="" the="" knn="" classification="" algorithm="" is="" selected.="" use="" a="" diagram="" to="" aid="" your="" explanation.="" (8="" marks)="" (b)="" discuss="" the="" naïve="" bayes="" classification="" model;="" describe="" its="" strengths="" and="" weaknesses="" as="" a="" classification="" model,="" and="" specifically="" mention="" the="" type="" of="" output="" that="" the="" classifier="" returns="" in="" addition="" to="" the="" classification.="" (10="" marks)="" (c)="" describe="" what="" is="" semi-supervised="" learning.="" where="" would="" it="" be="" typically="" applied="" and="" mention="" one="" technique="" that="" implements="" semi-supervised="" learning.="" (6="" marks)="" question="" 4="" (a)="" what="" do="" the="" parameters="" batch="" size="" and="" epochs="" represent="" in="" terms="" of="" an="" artificial="" neural="" network?="" describe="" (an="" overview="" of)="" how="" the="" backpropagation="" technique="" is="" used.="" (6="" marks)="" (b)="" what="" is="" the="" bootstrap="" technique="" in="" relation="" to="" pre-processing,="" describe="" the="" technique,="" and="" typically="" why="" and="" when="" is="" it="" applied?="" (4="" marks)="" (c)="" what="" is="" the="" fundamental="" application="" -="" -test="" and="" an="" anova="" test?="" what="" issue="" arises="" when="" multiple="" student="" t-tests="" are="" administered="" on="" the="" same="" data="" set="" and="" what="" is="" the="" solution?="" what="" does="" a="" p-value="" mean,="" discuss="" with="" a="" confidence="" interval/value="" of="" 95%?="" (8="" marks)="" (d)="" root="" mean="" squared="" error="" is="" often="" used="" as="" a="" performance="" measure="" in="" regression="" models,="" describe="" this="" metric.="" in="" addition,="" a="" pearson="" correlation="" coefficient="" is="" often="" a="" prominent="" metric="" of="" a="" regression="" model,="" explain="" this="" metric,="" using="" diagrams.="" (6="">
Answered 1 days AfterJan 05, 2021

Answer To: 2020 Sample Exam Paper Open Book.pdf Answer any 3 questions from 4, all questions carry equal marks....

Swapnil answered on Jan 07 2021
128 Votes
QUESTION 1
A)
1)
Outliers are the data [points that can be useful to the other data points. SO they are basically unusual values in the dataset. If a value is a certain number of standard deviations away from the mean, that data point is identified as an outlier. The specified number of standard deviations is called the threshold. This method can fail to detect outliers because the outliers increase the standard deviation. The smaller your range or standard deviation, the lower a
nd better your variability is for further analysis. The range is useful, but the standard deviation is considered the more reliable and useful measure for statistical analyses.
2)
Standard deviation measures the spread of a data distribution. The more spread out a data distribution is, the greater its standard deviation. standard deviation cannot be negative. A standard deviation close to 0 indicates that the data points tend to be close to the mean. The further the data points are from the mean, the greater the standard deviation. For an approximately normal data set, the values within one standard deviation of the mean account for about 68% of the set; while within two standard deviations account for about 95%; and within three standard deviations account for about 99.7%.
B)
1)
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations The outlier is identified as the largest value in the data set, 4257, and appears as the circle to the right of the box plot.
2) Outliers should be investigated carefully. It is containing valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear. So finally we will calculate the outliers of our dataset.
C)
1) The statement is true because Multicollinearity is a situation where two or more predictors are highly linearly related. It is appropriate to use the Pearson correlation coefficient when the two variables of interest are scored using interval or ratio measures while the associations of ordinal or nominal variables should be compared using alternative methods.
2) The following pints are supports the above situations.
· Redundancy: two predictors might be providing the same information about the response variable thereby leading to unreliable coefficients of the predictors (especially for linear models).
· The estimate of a predictor on the response variable will tend to be less precise and less reliable.
· An important predictor can become unimportant as that feature has a collinear relationship with other predictors.
QUESTION 2
A)
1)
Accuracy is the terms we use into the statistics that means what we think it does. But sensitivity and specificity are complicated to understand into the statistics. When you work on the above table you will get the outcome is only one that is positive or negative. Basically the model is to predict the outcome better than randomly guessing.
2)
You must use the confusion matric to do the overall calculations. When you train your model B then you must predict the values for the confusion matrix because it will help you to the predict the how many clicks will be clicked for the Netflix system. The accuracy will give you the correct output for that matric that will be availed for the next title. A perfectly accurate model would put every transaction into the boxes of the dataset. A simple proportion is to classified the properties of the true positives and true negatives.
B)
1)
A gold standard is an accepted standard that people can look to as an accurate and reliable reference. In medicine, for example, researchers often refer to blood assay as a gold standard for checking patients’ medication adherence. As with many such standards, however, because it is expensive and time-consuming, researchers search for quicker and less expensive, but still consistent ways of achieving comparable results. They gauge the value of their methods by comparing them to those achieved using the so-called gold standard.
2)
You train your model on the training set and then cross validate with the cross validation set. Once your model is at its highest accuracy using the cross validation set then you evaluate to get the "real" accuracy with the test set. Cross-validation is the most rigorous way of choosing hyper parameters, but it’s time-consuming. One alternative method is simple holdout validation, which reduces the time complexity. Alternatively, you could use...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here