R assignment, 2 partsCEIE 474/574 | Assignment 6 1 CEIE 474/574 Construction Computer...

Question

R assignment, 2 partsCEIE 474/574 | Assignment 6 1        CEIE 474/574 Construction Computer Application and Informatics  Assignment 6 – Data Mining Data mining is the process of discovering patterns in large data sets involving methods at the intersection  of machine learning, statistics, and database systems. Generally, data mining tasks can be categorized into  two groups, namely predictive analysis and descriptive analysis, also known as supervised learning and  unsupervised learning. In this assignment, one predictive problem and one descriptive problem are  provided to help you get insights into data mining applications in the construction industry. For the  predictive problem, the regression approach is used to predict construction equipment maintenance cost.  For the descriptive problem, the clustering approach is used to enhance product quality analysis and  product quality management.    1 Problem 1 – Regression  1.1  Background  It’s beneficial for professionals in the construction industry to accurately predict cost since it can not only  help assure reasonable profits but help ensure projects are delivered within the budget as well. Hour  meters are used to log running time of equipment to assure proper maintenance of expensive machines  or systems. This maintenance typically involves replacing, changing, or checking parts, belts, filters, oil,  lubrication or running condition in engines, motors, blowers, and fans, to name a few.  Equipment maintenance cost prediction is one significant element of the prediction of overall  construction cost. In this assignment, you will practice on preprocessing data, training linear regression  models, selecting the model with good prediction performance through K-fold cross-validation, and  predicting equipment maintenance cost.  1.2  Data Description  A “.xlsx” file is provided for this assignment. Each row in the file represents a specific equipment  maintenance occurrence. Detailed explanations of variables are listed in Table 1:  Table 1: Variables in the maintenance cost file (maintenance cost.xlsx)  Variable name Variable description  Unit ID Equipment unit ID  Hour Meter Reading Reading of hour meter  Labor Cost Labor cost for this specific maintenance   Parts Cost Parts cost for this specific maintenance  Total Cost Total cost is the sum of Labor cost and parts cost for this specific  maintenance  CEIE 474/574 | Assignment 6 2    1.3  Maintenance Cost Prediction Steps  Step 1:  Real-world data is often incomplete, inconsistent, and is likely to contain many errors. Data preprocessing  is a proven method of resolving such issues. Data preprocessing is a data mining technique that involves  transforming raw data into a clean and tidy format. In this step, you are asked to preprocess data to define  the independent variable (input) and the dependent variable (output), which will then be used to train  the maintenance cost prediction model.  1) Delete rows whose “Total Cost” is zero.  2) For each equipment unit, sort the data ascendingly based on “Hour Meter Reading.”  3) For each equipment unit, add a new column called “Cumulative Cost” by accumulating the “Total  Cost” based on the sorted data.  Questions:  (a) For each equipment unit, visualize the relationships between “Hour Meter Reading” and  “Cumulative Cost”.  (b) Interpret these relationships. Discuss the usage and maintenance cost of these equipment units.  Step 2:  Generally, different equipment units are in different operating phases. Therefore, it is difficult to make a  comprehensive prediction by separating these equipment units. In this step, all equipment data is  combined to enrich the dataset based on the assumption that all equipment units have the exact same  characteristics.   1) Combine all equipment “Hour Meter Reading” and “Cumulative Cost” into one dataset, named  “combined dataset”.  2) Sort the “combined dataset” ascendingly based on the column “Hour Meter Reading.”  Questions:  (a) Visualize the relationship between “Hour Meter Reading” and “Cumulative Cost” in the  “combined dataset”.  (b) Compare the relationship in “combined dataset” to the relationships in step 1 (b).  Step 3:  In statistics, linear regression is the simplest approach to model the relationship between a dependent  variable (output) and one or more independent variables (inputs). In this maintenance cost prediction  task, linear and quadratic relationships between the “Hour Meter Reading” and “Cumulative Cost” are  compared. The linear relationship model can be expressed as:  CEIE 474/574 | Assignment 6 3    ?? = ??0 + ??1??  The quadratic relationship model can be expressed as:  ?? = ??0 + ??1?? + ??2??2  where X represents “Hour Meter Reading” and Y represents “Cumulative Cost”.   In this step, you need to select the model that has a better generalization ability through the K-fold cross- validation approach. Here, K is set to be 5.  Questions:  (a) Select a model with a better generalization ability based on the K-fold cross-validation approach.  (b) List the selection metrics used for both models.  Step 4:   Up to now, you have selected the model with a better generalization ability, which is more accurate to  characterize the relationship between “Hour Meter Reading” and “Cumulative Cost”. However, in the  previous cross-validation process, only partial data (4 folds) are used for training, which leads to that the  model parameters obtained are not fully optimized. In this step, the selected model in step 3(a) will be  fully trained using the entire “combined dataset.” The model’s predictability will be quantitatively  evaluated using R2 and MSE.    1) Train the selected model with the whole “combined dataset.”  2) Quantitatively evaluate the model performance using R2 and MSE.   Questions:  (a) Calculate and list the values of R2 and Mean Squared Error (MSE) between predictions and real  observations of “Cumulative Cost.”  (b) Evaluate the model’s predictability using two metrics, namely R2 and MSE.  Step 5:   Once the equipment maintenance cost prediction model is built, practitioners can get a better  understanding and control of future equipment maintenance cost. In this step, you will predict the  cumulative cost for specific hour meter readings.   Questions:  (a) Predict the “Cumulative Cost” when “Hour Meter Reading” are 4000 and 8000, accordingly.  Bonus questions:   CEIE 474/574 | Assignment 6 4    (b) Calculate the 95% confidence intervals for predicted “Cumulative Cost” given 4000 and 8000 for  “Hour Meter Reading.”  (c) Discuss the differences in obtained confidence intervals and tell which prediction is more reliable?  1.4  Marking Scheme  Question  Mark  Report  R Script  Step 1  10   Reasonable visualization of the relationships  between “Cumulative Cost” and “Hour Meter  Reading” (2)  Reasonable interpretation and explanation (3)  Correct presentation using R  script (5)  Step 2  10  Reasonable visualization of the relationship  between “Cumulative Cost” and “Hour Meter  Reading” in the “combined dataset” (2)   Reasonable comparison (3)  Correct presentation using R  script (5)  Step 3  15   Correct model selection (5)   Correct metrics (5)  Correct presentation using R  script (5)  Step 4  15  Correct MSE and R2 (5)  A reasonable interpretation of model  predictability (5)  Correct presentation using R  script (5)  Step 5   10   Correct predictions on “Cumulative Cost”  when “Hour Meter Reading” are 4000 and  8000 (5)  Correct presentation using R  script (5)  Bonus  10  Correct 95% confidence intervals (4)  A reasonable explanation of prediction  differences (4)  Correct presentation using R  script (2)  Total  60+10 CEIE 474/574 | Assignment 6 5    2 Problem 2 – Clustering  2.1  Background  In data mining, cluster analysis or clustering is the process of partitioning a set of objects in such a way  that objects in a cluster are more like one another than the objects in other clusters. An advantage of  clustering is that clustering can automatically lead to the discovery of previously unknown groups within  data. Therefore, product quality performance clustering would group the products that have similar  quality performance into one cluster, which could be used to improve product quality analysis and product  quality management, especially when a vast number of product types is involved. In this assignment, you  will practice on clustering products based on their quality performance, selecting the best cluster number,  and visualizing the clustering result.  2.2  Data Description  A “.CSV” file is provided for this assignment and explanations of variables are listed in Table 1. The quality  performance is defined by the repair rate (q) whose distribution is a beta distribution parameterized by  alpha (α) and beta (β).  ??(??) = ????????(??,??)  Table 2: Variables in the product quality performance file (product_quality.csv)  Variable name Variable description  Weld type ID Weld type ID  Weld type Weld type which is composed by the pipe size, schedule, and material  alpha The first shape parameter of the beta distribution   beta The second shape parameter of the beta distribution  2.3  Product Quality Performance Clustering Steps  Step 1:  In data mining, an object is typically represented by multiple features, such as a point can be represented  by a coordinate (x, y, z) in a 3D space, which are normally used to train the model to mine hidden patterns.  Median is the value separating the higher half from the lower half of a population or probability  distribution. In this step, the median value of a distribution is selected as the only feature to represent  the product quality performance.  Questions:  (a) Calculate and list the median values of quality performance for all product types.  Step 2:  CEIE 474/574 | Assignment 6 6    For the K-Means clustering algorithm, determination of hyperparameter K is a common problem. The  correct choice of K is often ambiguous, with interpretations depending on the shape and scale of the  distribution of points in a dataset and desired clustering resolution of the user. In this step, you are asked  to select the best cluster number based on the elbow method to further perform the K-means clustering  approach.   The error is defined by the distance between the cluster mean and the object that belongs to this cluster.  In this step, the sum of squared errors (SSE) is used as the objective function value.  Questions:  (a) Visualize the relationship between objective function value (SSE) and cluster number K.  (b) Select the best cluster number based on the elbow method.  Step 3:  A good clustering of product quality performance can group products that have similar quality  performance into one cluster

Hemanth · Accepted Answer

# Removing all objects in working directory
rm(list = ls())
# ----- Maintenance Cost Prediction ------
# Loading required packages
library(dplyr)
library(caret)
# Reading data set
cost % 
  arrange(`Unit ID` ,`Hour Meter Reading`) %>% 
  as.data.frame()
# adding a new column called "Cumulative Cost" by accumulating the "Total Cost" based on the sorted data.

CEIE 474/574 | Assignment 6 1 CEIE 474/574 Construction Computer Application and Informatics Assignment 6 – Data Mining Data mining is the process of discovering patterns in large data sets involving...

Answer To: CEIE 474/574 | Assignment 6 1 CEIE 474/574 Construction Computer Application and Informatics...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment