CEIE 474/574 | Assignment 6 1 CEIE 474/574 Construction Computer Application and Informatics Assignment 6 – Data Mining Data mining is the process of discovering patterns in large data sets involving...

1 answer below »
R assignment, 2 parts


CEIE 474/574 | Assignment 6 1 CEIE 474/574 Construction Computer Application and Informatics Assignment 6 – Data Mining Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Generally, data mining tasks can be categorized into two groups, namely predictive analysis and descriptive analysis, also known as supervised learning and unsupervised learning. In this assignment, one predictive problem and one descriptive problem are provided to help you get insights into data mining applications in the construction industry. For the predictive problem, the regression approach is used to predict construction equipment maintenance cost. For the descriptive problem, the clustering approach is used to enhance product quality analysis and product quality management. 1 Problem 1 – Regression 1.1 Background It’s beneficial for professionals in the construction industry to accurately predict cost since it can not only help assure reasonable profits but help ensure projects are delivered within the budget as well. Hour meters are used to log running time of equipment to assure proper maintenance of expensive machines or systems. This maintenance typically involves replacing, changing, or checking parts, belts, filters, oil, lubrication or running condition in engines, motors, blowers, and fans, to name a few. Equipment maintenance cost prediction is one significant element of the prediction of overall construction cost. In this assignment, you will practice on preprocessing data, training linear regression models, selecting the model with good prediction performance through K-fold cross-validation, and predicting equipment maintenance cost. 1.2 Data Description A “.xlsx” file is provided for this assignment. Each row in the file represents a specific equipment maintenance occurrence. Detailed explanations of variables are listed in Table 1: Table 1: Variables in the maintenance cost file (maintenance cost.xlsx) Variable name Variable description Unit ID Equipment unit ID Hour Meter Reading Reading of hour meter Labor Cost Labor cost for this specific maintenance Parts Cost Parts cost for this specific maintenance Total Cost Total cost is the sum of Labor cost and parts cost for this specific maintenance CEIE 474/574 | Assignment 6 2 1.3 Maintenance Cost Prediction Steps Step 1: Real-world data is often incomplete, inconsistent, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing is a data mining technique that involves transforming raw data into a clean and tidy format. In this step, you are asked to preprocess data to define the independent variable (input) and the dependent variable (output), which will then be used to train the maintenance cost prediction model. 1) Delete rows whose “Total Cost” is zero. 2) For each equipment unit, sort the data ascendingly based on “Hour Meter Reading.” 3) For each equipment unit, add a new column called “Cumulative Cost” by accumulating the “Total Cost” based on the sorted data. Questions: (a) For each equipment unit, visualize the relationships between “Hour Meter Reading” and “Cumulative Cost”. (b) Interpret these relationships. Discuss the usage and maintenance cost of these equipment units. Step 2: Generally, different equipment units are in different operating phases. Therefore, it is difficult to make a comprehensive prediction by separating these equipment units. In this step, all equipment data is combined to enrich the dataset based on the assumption that all equipment units have the exact same characteristics. 1) Combine all equipment “Hour Meter Reading” and “Cumulative Cost” into one dataset, named “combined dataset”. 2) Sort the “combined dataset” ascendingly based on the column “Hour Meter Reading.” Questions: (a) Visualize the relationship between “Hour Meter Reading” and “Cumulative Cost” in the “combined dataset”. (b) Compare the relationship in “combined dataset” to the relationships in step 1 (b). Step 3: In statistics, linear regression is the simplest approach to model the relationship between a dependent variable (output) and one or more independent variables (inputs). In this maintenance cost prediction task, linear and quadratic relationships between the “Hour Meter Reading” and “Cumulative Cost” are compared. The linear relationship model can be expressed as: CEIE 474/574 | Assignment 6 3 ?? = ??0 + ??1?? The quadratic relationship model can be expressed as: ?? = ??0 + ??1?? + ??2??2 where X represents “Hour Meter Reading” and Y represents “Cumulative Cost”. In this step, you need to select the model that has a better generalization ability through the K-fold cross- validation approach. Here, K is set to be 5. Questions: (a) Select a model with a better generalization ability based on the K-fold cross-validation approach. (b) List the selection metrics used for both models. Step 4: Up to now, you have selected the model with a better generalization ability, which is more accurate to characterize the relationship between “Hour Meter Reading” and “Cumulative Cost”. However, in the previous cross-validation process, only partial data (4 folds) are used for training, which leads to that the model parameters obtained are not fully optimized. In this step, the selected model in step 3(a) will be fully trained using the entire “combined dataset.” The model’s predictability will be quantitatively evaluated using R2 and MSE. 1) Train the selected model with the whole “combined dataset.” 2) Quantitatively evaluate the model performance using R2 and MSE. Questions: (a) Calculate and list the values of R2 and Mean Squared Error (MSE) between predictions and real observations of “Cumulative Cost.” (b) Evaluate the model’s predictability using two metrics, namely R2 and MSE. Step 5: Once the equipment maintenance cost prediction model is built, practitioners can get a better understanding and control of future equipment maintenance cost. In this step, you will predict the cumulative cost for specific hour meter readings. Questions: (a) Predict the “Cumulative Cost” when “Hour Meter Reading” are 4000 and 8000, accordingly. Bonus questions: CEIE 474/574 | Assignment 6 4 (b) Calculate the 95% confidence intervals for predicted “Cumulative Cost” given 4000 and 8000 for “Hour Meter Reading.” (c) Discuss the differences in obtained confidence intervals and tell which prediction is more reliable? 1.4 Marking Scheme Question Mark Report R Script Step 1 10 Reasonable visualization of the relationships between “Cumulative Cost” and “Hour Meter Reading” (2) Reasonable interpretation and explanation (3) Correct presentation using R script (5) Step 2 10 Reasonable visualization of the relationship between “Cumulative Cost” and “Hour Meter Reading” in the “combined dataset” (2) Reasonable comparison (3) Correct presentation using R script (5) Step 3 15 Correct model selection (5) Correct metrics (5) Correct presentation using R script (5) Step 4 15 Correct MSE and R2 (5) A reasonable interpretation of model predictability (5) Correct presentation using R script (5) Step 5 10 Correct predictions on “Cumulative Cost” when “Hour Meter Reading” are 4000 and 8000 (5) Correct presentation using R script (5) Bonus 10 Correct 95% confidence intervals (4) A reasonable explanation of prediction differences (4) Correct presentation using R script (2) Total 60+10 CEIE 474/574 | Assignment 6 5 2 Problem 2 – Clustering 2.1 Background In data mining, cluster analysis or clustering is the process of partitioning a set of objects in such a way that objects in a cluster are more like one another than the objects in other clusters. An advantage of clustering is that clustering can automatically lead to the discovery of previously unknown groups within data. Therefore, product quality performance clustering would group the products that have similar quality performance into one cluster, which could be used to improve product quality analysis and product quality management, especially when a vast number of product types is involved. In this assignment, you will practice on clustering products based on their quality performance, selecting the best cluster number, and visualizing the clustering result. 2.2 Data Description A “.CSV” file is provided for this assignment and explanations of variables are listed in Table 1. The quality performance is defined by the repair rate (q) whose distribution is a beta distribution parameterized by alpha (α) and beta (β). ??(??) = ????????(??,??) Table 2: Variables in the product quality performance file (product_quality.csv) Variable name Variable description Weld type ID Weld type ID Weld type Weld type which is composed by the pipe size, schedule, and material alpha The first shape parameter of the beta distribution beta The second shape parameter of the beta distribution 2.3 Product Quality Performance Clustering Steps Step 1: In data mining, an object is typically represented by multiple features, such as a point can be represented by a coordinate (x, y, z) in a 3D space, which are normally used to train the model to mine hidden patterns. Median is the value separating the higher half from the lower half of a population or probability distribution. In this step, the median value of a distribution is selected as the only feature to represent the product quality performance. Questions: (a) Calculate and list the median values of quality performance for all product types. Step 2: CEIE 474/574 | Assignment 6 6 For the K-Means clustering algorithm, determination of hyperparameter K is a common problem. The correct choice of K is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a dataset and desired clustering resolution of the user. In this step, you are asked to select the best cluster number based on the elbow method to further perform the K-means clustering approach. The error is defined by the distance between the cluster mean and the object that belongs to this cluster. In this step, the sum of squared errors (SSE) is used as the objective function value. Questions: (a) Visualize the relationship between objective function value (SSE) and cluster number K. (b) Select the best cluster number based on the elbow method. Step 3: A good clustering of product quality performance can group products that have similar quality performance into one cluster
Answered Same DayNov 15, 2021

Answer To: CEIE 474/574 | Assignment 6 1 CEIE 474/574 Construction Computer Application and Informatics...

Hemanth answered on Nov 18 2021
153 Votes
# Removing all objects in working directory
rm(list = ls())
# ----- Maintenance Cost Prediction ------
# Loading required packages
library(dplyr)
library(caret)
# Reading data set
cost <- readxl::read_xlsx('maintenance-cost.xlsx', sheet = 1)
# Showing first s
ix records
head(cost)
# print structure of the data
str(cost)
## Step 1:
# Deleting rows whose "Total Cost" is zero.
cost <- cost[cost$`Total Cost`!=0,]
# For each equipment unit, sorting the data ascendingly based on "Hour Meter Reading."
cost <- cost %>%
arrange(`Unit ID` ,`Hour Meter Reading`) %>%
as.data.frame()
# adding a new column called "Cumulative Cost" by accumulating the "Total Cost" based on the sorted data.
cost <- cost %>%
group_by(`Unit ID`) %>%
mutate("Cumulative Cost" = cumsum(`Total Cost`)) %>%
ungroup() %>%
as.data.frame()
table(cost$`Unit ID`)
unit164 <- cost[cost$`Unit ID`==164,]
unit165 <- cost[cost$`Unit ID`==165,]
unit925 <- cost[cost$`Unit ID`==925,]
unit967 <- cost[cost$`Unit ID`==967,]
unit1054 <- cost[cost$`Unit ID`==1054,]
unit1160 <- cost[cost$`Unit ID`==1160,]
# (a) For each equipment unit, visualize the relationships between "Hour Meter Reading" and "Cumulative Cost".
# for equipment unit id 164
plot(unit164 $ `Hour Meter Reading`, unit164 $ `Cumulative Cost`,
xlab = 'Hour Meter Reading',
ylab = "Cumulative Cost",
main = 'Unit Id 164')
# for equipment unit id 165
plot(unit165 $ `Hour Meter Reading`, unit165 $ `Cumulative Cost`,
xlab = 'Hour Meter Reading',
ylab = "Cumulative Cost",
main = 'Unit Id 165')
# for equipment unit id 925
plot(unit925 $ `Hour Meter Reading`, unit925 $ `Cumulative Cost`,
xlab = 'Hour Meter Reading',
ylab = "Cumulative Cost",
main = 'Unit Id 925')
# for equipment unit id 967
plot(unit967 $ `Hour Meter Reading`, unit967 $ `Cumulative Cost`,
xlab = 'Hour Meter Reading',
ylab = "Cumulative Cost",
main = 'Unit Id 967')
# for equipment unit id 1054
plot(unit1054 $ `Hour Meter Reading`, unit1054 $ `Cumulative Cost`,
xlab = 'Hour Meter Reading',
ylab = "Cumulative Cost",
main = 'Unit Id 1054')
# for equipment unit id 1160
plot(unit1160 $ `Hour Meter Reading`, unit1160 $ `Cumulative Cost`,
xlab = 'Hour Meter Reading',
ylab = "Cumulative Cost",
main = 'Unit Id 1160')
# (b) Interpreting these relationships. Discuss the usage and maintenance cost of these equipment units.
# Interpretation: From the relationship plots, each equipment unit having a positive relation.
# It means if Hour Meter Reading increases then automatically Cumulative cost also...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here