Assignment 1 ENN543, Data Analytics and Optimisation, Semester 2, 2019 This document sets out the four (4) questions you are to complete for ENN543 Assignment 1. The assignment is worth 25% of the...

File attached.


Assignment 1 ENN543, Data Analytics and Optimisation, Semester 2, 2019 This document sets out the four (4) questions you are to complete for ENN543 Assignment 1. The assignment is worth 25% of the overall subject grade. Weights for individual questions are indicated throughout the document. Students should submit their answers in a single separate document (either a PDF or word document), and upload this to TurnItIn. Further Instructions: 1. Data required for this assessment is available on blackboard alongside this document in ENN543 Assessment 1 Data.zip. Please refer to individual questions regarding which data to use for which question. 2. Answers should be submitted via the TurnItIn submission system, linked to on Black- board. In the event that TurnItIn is down, or you are unable to submit via TurnItIn, please email your responses to [email protected]. 3. Matlab code or scripts (or equivilent materials for other languages) should be submitted as supplementary material (i.e. additional files) or appendices. Note that this material will not be directly marked (i.e. marks will not be assigned for code quality). Figures and outputs/results that are critical to question answers should be included in the main question response, and not just be present only in the Matlab (or similar) output. 4. Students who require an extension should lodge their extension application with HiQ (see http://external-apps.qut.edu.au/studentservices/concession/). Please note that teaching staff (including the unit coordinator) cannot grant extensions. 1 Problem 1. Linear Regression (20%). Prediction of residuary resistance of sailing yachts at the initial design stage is of a great value for evaluating the ship’s performance and for estimating the required propulsive power. Essential inputs include the basic hull dimensions and the boat velocity. The Delft data set comprises 308 full-scale experiments, which were performed at the Delft Ship Hydromechanics Laboratory for that purpose. The results of these experiments are in the file yacht.dat. These experiments include 22 different hull forms, derived from a parent form closely related to the “Standfast” designed by Frans Maas. The columns correspond to the following variables (in order): • Residuary resistance per unit weight of displacement, adimensional; • Longitudinal position of the center of buoyancy, adimensional; • Prismatic coefficient, adimensional; • Length-displacement ratio, adimensional; • Beam-draught ratio, adimensional; • Length-beam ratio, adimensional; • Froude number, adimensional. Using this data: 1. Using fitlm in MATLAB, fit a model to predict the resistance per unit weight of displacement as a function of the other variables. Discuss if this is a valid model. 2. Given the above model as a starting point, investigate how it can be improved. In this you should consider: (a) The use of training and validation datasets. The data should be divided such that the split between these two sets is approximately 80% for training and 20% for validation. (b) Are all variables important for the model? 2 Problem 2. Regularised Regression (20%). Web pages collect large volumes of data on page views, page links, etc., to monitor readership. For commercial ventures, this can help inform publishing and layout decisions, as well as advertising. The BlogFeedback dataset contains data on blog readership, and can be used to predict page views in the next 24 hours based on past readership data. You have been supplied with two variants of this data: 1. Files named blogData noBow train.csv and blogData noBow test.csv contain features that capture the average readership information for the blog, and information for the specific post (see blogData Variables.txt for further information); 2. Files named blogData train.csv and blogData test.csv contains all the features of the noBow files alongside 200 bag-of-words features1 that capture the blog post content. Note that the testing data contains examples from later times to the training data, simulating a real-world case where the model is trained on historic data to predict the future. Using this data: 1. Fit a model using Linear regression, Ridge and LASSO regression on noBowdata. With these models consider the following: (a) Determine the best value of λ to use in the Ridge model to obtain the best predictive model. (b) Determine the best value of λ to use in the LASSO model to obtain the best predictive model. 2. Fit a model using Linear regression, Ridge and LASSO regression on the data contain- ing the Bag-of-Words features. With these models consider the following: (a) Determine the best value of λ to use in the Ridge model to obtain the best predictive model. (b) Determine the best value of λ to use in the LASSO model to obtain the best predictive model. 3. Compare the performance of the two Linear, Ridge and LASSO models. You should consider factors such as the errors of the models, the R2 and Adjusted R2, and the model validity in your discussion. Which, if any, models are suitable for use? Justify your response. 1Bag-of-words features capture the number of instances of particular words in a docu- ment. An introduction to Bag-of-Words can be found at https://machinelearningmastery.com/ gentle-introduction-bag-words-model/. Note however that an understanding of bag-of-words is not needed for this question or subject. 3 Problem 3. Clustering I (30%). Understanding power use in the home is increasingly important as society strives to improve energy efficiency. The Household Power Consumption dataset captures energy use in a single home over a period of several years, and can be used to analyse usage patterns and detect periods of abnormal power use. You have been provided data covering a single year (2007) in household power consumption 2007.csv. The columns in this data correspond to the following variables (in order): • date: Date in dd/mm/yyyy format. • time: Time in hh:mm:ss format. • global active power: Household global minute-averaged active power (in kilowatts). • global reactive power: Household global minute-averaged reactive power (in kilo- watts). • voltage: Minute-averaged voltage (in volts). • global intensity: Household global minute-averaged current intensity (in ampere). • sub metering 1: Energy sub-metering No. 1 (in watt-hour of active energy). It cor- responds to the kitchen, containing mainly a dishwasher, an oven and a microwave. • sub metering 2: Energy sub-metering No. 2 (in watt-hour of active energy). It cor- responds to the laundry, containing a washing-machine, a tumble-drier, a refrigerator and a light. • sub metering 3: Energy sub-metering No. 3 (in watt-hour of active energy). It cor- responds to an electric water-heater and an air-conditioner. Using this data, you are to investigate if usage patterns can be identified in the data, and if abnormal behaviours can be detected. In particular you are to: 1. Cluster the data considering the three sub-meter readings only (sub metering 1, sub metering 2, sub metering 3), using the clustering method (and number of clusters) of your choice. Justify your selection for the clustering method and parameters (i.e. number of clus- ters) based on the requirements of this problem, the nature of the data, and the capabilities of the clustering method. 2. With the clustered data investigate: (a) Are trends visible in the clustered data? For example, can changes in use be seen at different times of the year (i.e. summer vs winter), or from a weekday to a weekend? (b) Can any abnormal usage be detected? If abnormalities can be found, show a visual comparison between the abnormal time period and a nearby (i.e. the previous or next day) normal time period. The method to select abnormal samples should classify approximately 1% of the data as abnormal. For the purposes of this problem, a period of abnormal usage is a period of 2 hours (or more) where 50% or more of the samples are abnormal. 4 In completing this question you may also like to consider: 1. Is it reasonable (or practical) to learn the clusters on all the data? 2. Can the data be aggregated in any way to reduce the volume of data? Does such aggregation alter the findings? 5 Problem 4. Clustering II (30%). Sensors such as accelerometers and gyroscopes are be- coming increasingly common in wearable and mobile devices. From these signals, it is pos- sible to detect different activities, and potentially even different people. You have been supplied with three files that capture wearables signal data as follows: • wearables signal.csv contains 3,237 samples of 561 dimensional wireless sensor data; • wearables activity.csv contains the ground truth activity being performed for each of the samples in wearables signal.csv. There are 6 activities (Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, Laying) in total. • wearables subject.csv contains the ground truth subject ID for each of the samples in wearables signal.csv. There are 10 subjects in total. Using this data, you are to investigate if the classes of activity and the users can be separated via clustering. In particular you are to: 1. Cluster the data using HAC with the aim of: (a) Separating the data into the 6 activity classes. Using the provided ground truth, evaluate the accuracy of the clustering result. (b) Separating the data into the 10 identity classes. Using the provided ground truth, evaluate the accuracy of the clustering result. (c) Separating the data 60 clusters such that each cluster corresponds to a particular individual performing a particular activity. Using the provided ground truth, evaluate the accuracy of the clustering result. 2. Repeat the three clustering tasks using DBScan, and compare the performance of the clustering results obtained using DBScan and HAC. Comment on any differences observed between the two methods, and which method is more suitable in this situation. 6
Aug 18, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here