Essential Mathematics for Data ScientistsAssignment 4: Principal Component Analysis
This assignment asks you to implement, a principal component analysis (PCA) using a singular value decomposition on a real data set.
A real dataset
We now ask that you investigate the provided dataset which has come from the State of the Tropics report. (State of the Tropics (2017) Sustainable Infrastructure for the Tropics. James Cook University, Townsville, Australia, available atstateofthetropics.org). The data that we have used from this report is available at the Tropical Data Hub (https://research.jcu.edu.au/researchdata/default/search?query=State+of+the+Tropics&sort- field=score&sort-order=desc)
We have consolidated much of the data for the year 2010 into a file SotTCombined2010.xlsx. We ask that you analyse this data using the PCA and obtain the task requirements listed below. You will need to use MATLAB to do the computation. Write your analysis in a report, using Word, Latex, or any program you would like to use. You must submit your report as a PDF file.
The task
You are to implement the PCA using SVD, obtaining:
The principle component vectors
The proportion of variation explained by the principal components
The matrix of scores
A dimensionally reduced representation of the dataset
The residuals of the reduced representation
Any outliers
Your report must determine the relationships between the variables reported in the spreadsheet and the strength of those relationships. You should also identify the outlier countries, those for which the relationships the PCA identifies are not present.
In analysing the data there are several things to keep in mind:
•The data contains many missing values. You should exclude countries which contain missing values.
Assignment: Principal Component Analysis – Marking Scheme
This is to be written in a report format, as a series of paragraphs (not dot points). The report will have the following information embedded within. 50% of the marks associated with each element are associated with your description of it within the report. The remaining marks allocated are to the Matlab code for correctly producing the quantities required to go into the report. Please submit your report and the matlab code.
16 marks total:
Excluding countries with missing values(1 mark)
Creating, centring and scaling matrix of raw data for PCA(3 marks: 1 for creating, 1 for centring, 1 for scaling)
Performing the SVD(1 mark)
Identifying relationships between the variables described by the first principal component
vector(2 marks: 1 for identifying positive relationships, 1 for negative)
Identifying relationships between the variables described by the second principal
component vector(1 mark)
Obtaining singular values and proportions of variation(2 marks: 1 for the singular values, 1
for the proportions)
Identifying proportions of variation(2 marks: 1 mark for identifying the proportions, 1 mark for identifying a cumulative total over the highest proportions – at cutoff determined by author of the report)
Calculating the matrix of scores(1 mark)
Obtaining a dimensionally reduced representation of the data set(1 mark)
Computing residuals(2 marks: 1 mark for residuals, 1 mark for interpretation)