EXPENSE CLAIM/REPORT 1 MA5810- CAPSTONE PROJECT Total marks: 100 Due date: Wednesday, Week 7 (9th of December), 11:59pm AEST OVERVIEW This assessment involves writing a report that summarises a data...

1 answer below »
Capstone project. Not more than 12 pages


EXPENSE CLAIM/REPORT 1 MA5810- CAPSTONE PROJECT Total marks: 100 Due date: Wednesday, Week 7 (9th of December), 11:59pm AEST OVERVIEW This assessment involves writing a report that summarises a data mining related investigation that you have conducted on data that you have collected yourself. The investigation must involve the main topics covered in the subject, most noticeably supervised learning and/or unsupervised learning using R/RStudio. The assessment builds upon the practical knowledge that you should have acquired through the previous two assignments, however neither the dataset nor the detailed steps to be carried out will be provided here, you have to make independent choices and decisions. Submission You will need to submit the following: • A PDF file with R code in Appendix. Please submit everything in one PDF file. The assignment must be presented in 12 font on A4 pages using single line spacing. The assignment must follow the required report structure. • References should be in APA format. • R code to reproduce your work • The task cover sheet. The assignment should not exceed 12-A4 pages. Appendices do not form part of the page limit. You have up to three attempts to submit your assessment, and only the last submission will be marked. A WORD ON PLAGIARISM AND SELF-PLAGIARISM: Plagiarism is the act of using another’s words, works or ideas from any source as one’s own. Plagiarism has no place in a University. Student work containing plagiarised material will be subject to formal university processes.The assessment builds upon the practical knowledge that you should have acquired through the previous two assignments, however neither the dataset nor the detailed steps to be carried out will be provided here, you have to make independent choices and decisions. In case significant portions of your own previous work (e.g., a report for a related subject you did in this or any other university) is recycled in a way that it could be fully or partially graded twice (‘double-dipping’), this is considered self-plagiarism and will not be tolerated. 2 Assessment tasks In this report, you need to demonstrate that: (a) you have grasped important concepts associated with this subject, most noticeably supervised and unsupervised learning; and (b) you can communicate your investigation in a formal written manner. Regarding (a), we expect that your investigation will include at least three machine learning algorithms from the following topics: 1. LDA, QDA and/or Naive Bayes classification 2. Logistic Regression classifiers and/or KNN for classification/regression 3. Principal Component Analysis (PCA) 4. Cluster Analysis 5. Association Rule Mining and Recommender Systems Data You will need to find your own data using good practices. Your dataset cannot be smaller than 1000 observations of five variables, except if the targeted data mining problem to be addressed relates to spatial- temporal data, in which case less than five dimensions could be allowed. Preferably, you should use a dataset relevant to your place of work. Do not use data from textbooks or from R packages. Do not use the same data that have been used in the subject (e.g. UCI repository). Do not use data for which data mining results and analyses can be found online. You can use public data, but the data should be appropriate for addressing a relevant data mining problem, and a solution to a similar problem for the same data should not be available. Report structure Please adhere to the strict report structure format. The report will not be assessed if it is not formatted appropriately. The report should have the following sections marked clearly: • Title: In today’s busy world, it is very important to make the most of your title. Make the title ‘eye- catching’, informative and an accurate representation of the contents of the report. • Abstract: The abstract provides a short sharp overview of the contents in the report and will be around 200 – 300 words. The abstract has five parts: i. Introductory statement: background to the study, important issue(s) the report addresses. (approximately 1-2 sentences) ii. Purpose of the report: state the objectives (1-2 sentences) iii. Methodological approach: overview the data and methods (2-3 sentences) iv. Findings or Achievements: list one or two of the main findings or achievements from your investigation (1-2 sentences) 3 v. Conclusions and Implications: what conclusions can be drawn from your investigation? How can the findings/achievements in your report deliver a benefit to people, things, systems or processes? (1-2 sentences). • Introduction: The introduction sets the scene for the investigative efforts. It provides motivation for the work and relevant background information and references that will enable the reader to put in context the key objectives and achievements in your report. Address the important issues that have motivated your investigation. At the end of the introduction clearly state the objectives of the report. Do not put any results from your investigation in the introduction. Do not discuss details about the data and methods in this section. Do not discuss your conclusions or key findings in the introduction. • Data: This section should provide details about how the data was obtained and what the data represent. You should include information such as (but not limited to) i. What the source of the data is ii. How the data was originally collected (e.g., from an experiment or observational study) iii. The sample size iv. The number and types of variables v. Any known interventions or pre-processing that precede the ones described in your report vi. Any other information that is relevant to the understanding and assessment of your work/report. • Methods: This section should discuss in depth the data mining methods that were used to process and to analyse the data, as well as the software version used to generate the results and report. To cite R-Studio type RStudio.Version() from the command line. The methods should be appropriate to ensure that the objectives of the paper are met. • Results and Discussion: This section presents and discusses the results. The discussion centres on the outputs from the data mining procedures that you have performed. For example, what are the main outcomes? Why are they useful and what for? How are they interesting and why?, and so on. In particular, how do the results align with the goals set in the introduction? What are the main achievements and their implications? • Conclusions: Final remarks about the key achievements of the investigations and what makes them ‘interesting’ or ‘useful’, right now or for future work. Achievements or findings should be contrasted with the original objectives or hypotheses of the project. Make sure that you mention any limitations of your work here. Limit the conclusions to no more than two or three paragraphs. • References. List the sources your investigation has drawn from. Note that all references should be referred to in the text. • Appendices: Add R code and any supporting materials that might be useful to help assess your work. 4 RUBRIC TEMPLATE Please adhere to the report structure requirements. The report will not be assessed if it is not formatted appropriately. Dimension High distinction Pass Fail R code and References 10% Code submitted and attached to Appendix. Code works correctly, meets the specifications, produces the correct results and displays them correctly. Code is well organised and very easy to follow. Code always very well commented so the purpose of each block of code readily understood and what question part it corresponds to. Variable names give the purpose of the variable. All references have been listed, in the right format, and referred to in the appropriate places in the body of the text and listed at the end of the report. At least 4 references have been provided. Code only provided in answer document but looks correct. Code often exhibits incorrect behaviour. Significant details of specification are violated. The code is readable only by someone who already knows what it is supposed to be doing. Comments not sufficient to see what the code is doing. Significant lack of comments makes it difficult to understand code. Some references have been listed and referred to in the appropriate places in the body of the text and listed at the end of the report. At least 2 references have been provided. Code not submitted Code not provided in answer document. Code produces incorrect results, does not compile, or significant errors occur. Code is poorly organised and very difficult to read. Code has no comments. No references. Abstract and Introduction (10%) Clearly addresses the five parts of the abstract so that the reader has a clear overview of the reports. Position and exceptions, if any, are clearly stated. Organisation of the argument is completely and clearly outlined and implemented. Partially addresses the five parts of the abstract and or addresses all five parts but the writing is not clear in places. Position is clearly stated. Organisation of argument is clear in parts or only partially described and mostly implemented. Does not provide an overview the report, or the writing is poor overall and mostly unclear. Position is vague. Organisation of argument is missing, vague or not consistently maintained. 5 Data (10%) Data are suitable, the report explains how the data were obtained. Provides a detailed, accurate description of the data and data methods to be employed within the project. Exploratory data analysis and verification are detailed and provides critical insight with clear overt links to model developments. Data insights are concisely presented and visualised. Data are suitable, the report explains how the data were obtained.
Answered 4 days AfterAug 14, 2021

Answer To: EXPENSE CLAIM/REPORT 1 MA5810- CAPSTONE PROJECT Total marks: 100 Due date: Wednesday, Week 7 (9th of...

Neha answered on Aug 18 2021
136 Votes
89621 - R and report/code.r
customer_data=read.csv("/home/dataflair/Mall_Customers.csv")
str(customer_data)
names(customer_data)
head(customer_data)
summary(customer_data$Age)
sd(customer_data$Age)
summary(customer_data$Annual.Income..k..)
sd(customer_data$Annual.Income..k..)
summary(customer_data$Age)
sd(customer_data$Spending.Score..1.100.)
a=table(customer_data$Gender)
barplot(a,main="Using BarPlot to display Gender Comparision",
ylab="Count",
xlab="Gender",
col=rainbow(2),
legend=rownames(a))
pct=round(a/sum(a)*100)
lbs=paste(c("Female","Male")," ",pct,"%",sep=" ")
library(plotri
x)
pie3D(a,labels=lbs,
main="Pie Chart Depicting Ratio of Female and Male")
summary(customer_data$Age)
hist(customer_data$Age,
col="blue",
main="Histogram to Show Count of Age Class",
xlab="Age Class",
ylab="Frequency",
labels=TRUE)
boxplot(customer_data$Age,
col="ff0066",
main="Boxplot for Descriptive Analysis of Age")
summary(customer_data$Annual.Income..k..)
hist(customer_data$Annual.Income..k..,
col="#660033",
main="Histogram for Annual Income",
xlab="Annual Income Class",
ylab="Frequency",
labels=TRUE)
plot(density(customer_data$Annual.Income..k..),
col="yellow",
main="Density Plot for Annual Income",
xlab="Annual Income Class",
ylab="Density")
polygon(density(customer_data$Annual.Income..k..),
col="#ccff66")
summary(customer_data$Spending.Score..1.100.)
Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 34.75 50.00 50.20 73.00 99.00
boxplot(customer_data$Spending.Score..1.100.,
horizontal=TRUE,
col="#990000",
main="BoxPlot for Descriptive Analysis of Spending Score")
hist(customer_data$Spending.Score..1.100.,
main="HistoGram for Spending Score",
xlab="Spending Score Class",
ylab="Frequency",
col="#6600cc",
labels=TRUE)
library(purrr)
set.seed(123)
# function to calculate total intra-cluster sum of square
iss <- function(k) {
kmeans(customer_data[,3:5],k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss
}
k.values <- 1:10
iss_values <- map_dbl(k.values, iss)
plot(k.values, iss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total intra-clusters sum of squares")
library(cluster)
library(gridExtra)
library(grid)
k2<-kmeans(customer_data[,3:5],2,iter.max=100,nstart=50,algorithm="Lloyd")
s2<-plot(silhouette(k2$cluster,dist(customer_data[,3:5],"euclidean")))
k3<-kmeans(customer_data[,3:5],3,iter.max=100,nstart=50,algorithm="Lloyd")
s3<-plot(silhouette(k3$cluster,dist(customer_data[,3:5],"euclidean")))
k4<-kmeans(customer_data[,3:5],4,iter.max=100,nstart=50,algorithm="Lloyd")
s4<-plot(silhouette(k4$cluster,dist(customer_data[,3:5],"euclidean")))
k5<-kmeans(customer_data[,3:5],5,iter.max=100,nstart=50,algorithm="Lloyd")
s5<-plot(silhouette(k5$cluster,dist(customer_data[,3:5],"euclidean")))
k6<-kmeans(customer_data[,3:5],6,iter.max=100,nstart=50,algorithm="Lloyd")
s6<-plot(silhouette(k6$cluster,dist(customer_data[,3:5],"euclidean")))
k7<-kmeans(customer_data[,3:5],7,iter.max=100,nstart=50,algorithm="Lloyd")
s7<-plot(silhouette(k7$cluster,dist(customer_data[,3:5],"euclidean")))
k8<-kmeans(customer_data[,3:5],8,iter.max=100,nstart=50,algorithm="Lloyd")
s8<-plot(silhouette(k8$cluster,dist(customer_data[,3:5],"euclidean")))
k9<-kmeans(customer_data[,3:5],9,iter.max=100,nstart=50,algorithm="Lloyd")
s9<-plot(silhouette(k9$cluster,dist(customer_data[,3:5],"euclidean")))
k10<-kmeans(customer_data[,3:5],10,iter.max=100,nstart=50,algorithm="Lloyd")
s10<-plot(silhouette(k10$cluster,dist(customer_data[,3:5],"euclidean")))
library(NbClust)
library(factoextra)
fviz_nbclust(customer_data[,3:5], kmeans, method = "silhouette")
set.seed(125)
stat_gap <- clusGap(customer_data[,3:5], FUN = kmeans, nstart = 25,
K.max = 10, B = 50)
fviz_gap_stat(stat_gap)
k6<-kmeans(customer_data[,3:5],6,iter.max=100,nstart=50,algorithm="Lloyd")
k6
pcclust=prcomp(customer_data[,3:5],scale=FALSE) #principal component analysis
summary(pcclust)
pcclust$rotation[,1:2]
set.seed(1)
ggplot(customer_data, aes(x =Annual.Income..k.., y = Spending.Score..1.100.)) +
geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
scale_color_discrete(name=" ",
breaks=c("1", "2", "3", "4", "5","6"),
labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")
ggplot(customer_data, aes(x =Spending.Score..1.100., y =Age)) +
geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
scale_color_discrete(name=" ",
breaks=c("1", "2", "3", "4", "5","6"),
labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")
kCols=function(vec){cols=rainbow (length (unique (vec)))
return (cols[as.numeric(as.factor(vec))])}
digCluster<-k6$cluster; dignm<-as.character(digCluster); # K-means clusters
plot(pcclust$x[,1:2], col =kCols(digCluster),pch =19,xlab ="K-means",ylab="classes")
legend("bottomleft",unique(dignm),fill=unique(kCols(digCluster)))
89621 - R and report/report.docx
Student Name
Student Number
Subject
Title
Contents
Abstract    3
Introduction    3
Data    4
Methods    5
K-means Algorithm    6
Determining Optimal Clusters    6
Results and discussion    6
Conclusions    16
References    17
Appendices    17
Abstract
In this report we will discuss about data science project. With the help of this report, they will perform the most essential application in the machine learning which is customer segmentation. The customer segmentation is implemented in this project with the help of R language. The customer segmentation is the best method to find best customers in the system. The customer segmentation is the most important application for unsupervised learning. We can use clustering techniques for the companies who wants to identify several segments off the customers and help them to target the potential user base. In this project of machine learning we will use K-Means algorithm. This algorithm is the most suitable algorithm for clustering the...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here