This is
[Title of your report] Introduction The introduction should be able to be understood by a layperson and should include the purpose of the analysis. As a guideline, one paragraph will be sufficient. [Delete instruction text before submitting] [Type your introduction here] Motivation and Methodology Describe the motivation for the analysis methods that you have used. This section must answer the questions what you did, why you did that and how you did it. As a guideline, maximum two paragraphs will be sufficient. [Delete instruction text before submitting] [Type your description of methods here] Results & Discussion Summarise the main results of your analyses in questions 2 and 3. You may use subsections, tables etc. as you see fit. Present and discuss results in a clear and simple way: Present findings of statistical analyses in a logical sequence. Do not include code or dumps of R output. Results should either be incorporated into sentences or formatted appropriately to be neatly presented. Interpret your findings by discussing their practical significance. Discuss shortcomings, if any. As a guideline, maximum three paragraphs will be sufficient. [Delete instruction text before submitting] [Type your results and discussion here] Recommendations & Conclusions Type your recommendations and conclusions here What do you conclude overall about the analysis? As a guideline, one paragraph will be sufficient. Do not introduce any new information in this section, and do not simply repeat statements made elsewhere in your report! [Delete instruction text before submitting] [Type your recommendations and conclusions here] 1 MATH 1081 UO Mathematical Methods for Data Analytics 2 Assessment 2.1 : Project Part A Instructions: • Structure of the assessment: This assessment is worth 25% of your final grade and is due no later than 12 pm on Monday, Week 7. This assessment consists of 3 main questions to answer and a report writing. Your submission will be marked out of 100. • Use of R: This project is a guided case study. It is important that you follow any instructions or guidance in the questions, such as “Use R” where required. You must provide your R codes to get full marks wherever you use R to answer the questions. Upload your R script and screenshot the R codes in your answer sheet. • Save your work: Save your answer sheet as a pdf named “your student ID Assessment 2.1 MATH1081.pdf”. • Show your work: Show all necessary steps so that the reader can follow your solution procedure. • Submit your work: Create a folder with 1. your answer sheet 2. your R script and 3. the final dataset you used for the analysis in “.csv” format. Name your folder with your student ID and upload it as a zip file. • Acknowledgement of work: When submitting online, you acknowledge that the submitted assignment is your own work unless otherwise stated. 1 • Academic integrity: The University’s policy on academic misconduct will be strictly applied. Here are some tips to avoid academic misconduct: – Do not copy from any printed or electronic source or from any person. – Write your own solutions. You may discuss your work with others, but you must write up your solutions yourself. You are not allowed to use some- one else’s written work when writing up your submission. – Do not give inappropriate help. Giving inappropriate help is just as serious as receiving it and will have the same consequences. Do not show your completed exercise to others. Dispose of drafts so that no one can access them. – Acknowledge help and joint work. If you receive any help from another source (for example, students, tutors, friends, internet), you must make a note of it on your submission. • Late submission: Any late submission will attract a penalty of 5 marks avail- able per day for five days. The cut-off time is 12 pm each day. After five days from the assessment due date, no submissions will be marked, and zero marks will be granted. 2 Assessment Task Overview Photo by Luke van Zyl on Unsplash This assessment is based on the data in Melbourne housing.csv file. It con- tains residential building data, including construction cost, sales prices, some project variables, and some economic variables corresponding to real estate in Melbourne, Aus- tralia. The objective is to understand, analyse and develop a model to predict the sales price (Price). A brief description of variables is provided below. Data dictionary Assessment Task Details You have to complete this assessment in two sections. 1. A list of questions to answer that comprising of 70% of the total grade (70 marks). Write your answers clearly in a well-organised manner with accurate notations. Label the questions and sub-questions. 2. A report summarising your analysis in Section 1 that comprising of 30% of the total grade (30 marks). A guide for the project report is provided in learnonline. Section 1: Questions 1. The data is not always cleaned and presented in a working manner. There are some unnecessary columns and variables which do not have full completed entries. In addition, you might have errors in this dataset, and you have to fix them before you start analysing. You can do data cleansing in R or Excel. (a). Choose & filter a single house ‘Type’. Use this for the remainder of the assignment. Provide a dot point summary of corrections made to the dataset, 3 https://unsplash.com/ Variable Description Suburb Suburb Address Street address Rooms number of Rooms Type Type of Housing Price Actual sales price (local currency) Method S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed; N/A - price NA. Type br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential. SellerG Real Estate Agent Date Date sold Distance Distance from CBD in Kilometres Regionname General Region (West, North West, North, North east . . . etc) Propertycount Number of properties that exist in the suburb. Bedroom2 Scraped # of Bedrooms (from different source) Bathroom Number of Bathrooms Car Number of carspots Landsize Land Size in Metres BuildingArea Building Size in Metres YearBuilt Year the house was built CouncilArea Governing council for the area Lattitude: Self explanatory Longtitude Self explanatory Table 1: Data dictionary Melbourne Housing.csv and create a subset dataset with the continuous variables and ‘Postcode”. Hint: Use na.omit function For full marks, provide a screenshot of the first 30 row entries of the cleaned dataset in R. [7 marks] (b). Find the covariance matrix and include it as a screenshot. Exclude the response variable and ‘Number of Rooms’ & ‘Postcode’ when finding the covariance matrix. [4 marks] (c). Find the eigenvalues and eigenvectors of the covariance matrix in part (b). Provide the R output and code as your solution. [5 marks] (d). Provide the diagonalized form of the covariance matrix in part (b), using the results in part (c). [4 marks] [20 marks] 2. Conduct a Principal Component analysis (PCA) and develop a model that predict 4 the response variable, sales price of a property (Price). This is an open question and you need to write up your results. Answer the following guided questions to finish this task. (e). Use R to compute the correlation matrix between the variables. Present a scatterplot between any two strongly correlated variables. Provide an interpretation of the observed relationship with these variables incorporating the correlation coefficient for full marks. [3 marks] (f). Split the subset of Melbourne housing dataset into a training and testing datasets where 80% of dataset is in training set. Hint: Use sample n function [4 marks] (g). Conduct PCA analysis on the training dataset, and present your findings. Discuss the principal components and explained variance. How many (Prin- cipal Components) PCs are you going to keep? [4 marks] (h). Provide a visualization with ggbiplot of the first two PCs. [4 marks] (i). Form a dataset with your training set in terms of the PC components and objective variable. [3 marks] (j). Use lm function in R to develop a linear regression model to predict the response variable of sales price. Present your R output and summary of model. Are the coefficients significant? [5 marks] (k). Is the model in part (j) is a good model? why? [2 marks] (l). Run your model in part (j) for the testing dataset and compare the output to the original sales price in the testing dataset. Is the model tend to under or overestimate sales price? [3 marks] (m). Run the model and predict the value of sales price for postcode 3000. Com- ment on the prediction. [2 marks] [30 marks] 3. Assume that the ‘project-based pricing strategy’ is used for pricing, and it has been determined that the following function C(x, y) is proportional to the sales price and economic cost. (n). Run the gradient descent algorithm in R to find the minimum of the cost function C(x, y) and the values for x and y that produce the minimum cost function starting with (x, y) = (121, 70) given the learning rate is 0.01, and the convergence threshold is 0.005. C(x, y) = 200 − (x− 100)y exp ( − ( (x− 120)2 + ( y 100 )2)) 5 Provide your R code, output and minimum of the cost function C(x, y) and the values for x and y that produce the minimum cost function as the answer. [12 marks] (o). Assume that the variables x and y in the function C(x, y) are the variables Landsize (x) and BuildingArea (y). Predict the sales price from the obtained values for Landize (x) and BuildingArea (y) in part (n), along with the averages of any other needed variables, in your model developed in Question 2 part (j). [7 marks] (p). The sales price of which postcode is the closest to the Sales price you obtained in part (o)? Hint: Use which function in R to get the sales price within ±x where x is a user-defined margin from the value obtained in part (o). [3 marks] [20 marks] Section 2: Report This is a written section to present your results in a report form. This includes the following components: • Introduction [5 marks] • Motivation and Methodology [5 marks] • Results and presentation of main results [10 marks] • Discussion and Conclusions [10 marks] [30 marks] 6