hw12 August 6, 2021 [1]: # Initialize Otter import otter grader = otter.Notebook("hw12.ipynb") 1 Homework 12: Principal Component Analysis In lecture we discussed how PCA can be used for...

1 answer below »
Only need question 1e and question 3i


hw12 August 6, 2021 [1]: # Initialize Otter import otter grader = otter.Notebook("hw12.ipynb") 1 Homework 12: Principal Component Analysis In lecture we discussed how PCA can be used for dimensionality reduction. Specifically, given a high dimensional dataset, PCA allows us to: 1. Understand the rank of the data. If k principal components capture almost all of the variance, then the data is roughly rank k. 2. Create 2D scatterplots of the data. Such plots are a rank 2 representation of our data, and allow us to visually identify clusters of similar observations. A solid geometric understanding of PCA will help you understand why PCA is able to do these two things. In this homework, we’ll build that geometric intuition and look at PCA on two datasets: one where PCA works poorly, and the other where it works pretty well. 1.1 Due Date This assignment is due Monday, August 9th at 11:59 PM PDT. Collaboration Policy Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually. If you do discuss the assignments with others please include their names in the cell below. Collaborators: … 1.2 Score Breakdown Question Points Question 1a 1 Question 1b 1 Question 1c 1 Question 1d 1 Question 1e 1 Question 2a 2 Question 2b 1 Question 2c 1 1 Question Points Question 2d 3 Question 2e 2 Question 3a 1 Question 3b 1 Question 3c 1 Question 3d 2 Question 3e 2 Question 3f 2 Question 3g 1 Question 3h 2 Question 3i 2 Total 28 [2]: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline import plotly.express as px # Note: If you're having problems with the 3d scatter plots, uncomment the two␣ ↪→lines below, and you should see a version that # number that is at least 4.1.1. # import plotly # plotly.__version__ 1.3 Question 1: PCA on 3D Data In question 1, our goal is to see visually how PCA is simply the process of rotating the coordinate axes of our data. The code below reads in a 3D dataset. We have named the DataFrame surfboard because the data resembles a surfboard when plotted in 3D space. [3]: surfboard = pd.read_csv("data3d.csv") surfboard.head(5) [3]: x y z 0 0.005605 2.298191 1.746604 1 -1.093255 2.457522 0.170309 2 0.060946 0.473669 -0.003543 3 -1.761945 2.151108 3.132426 4 1.950637 -0.194469 -2.101949 2 The cell below will allow you to view the data as a 3D scatterplot. Rotate the data around and zoom in and out using your trackpad or the controls at the top right of the figure. You should see that the data is an ellipsoid that looks roughly like a surfboard or a hashbrown patty. That is, it is pretty long in one direction, pretty wide in another direction, and relatively thin along its third dimension. We can think of these as the “length”, “width”, and “thickness” of the surfboard data. Observe that the surfboard is not aligned with the x/y/z axes. If you get an error that your browser does not support webgl, you may need to restart your kernel and/or browser. [4]: fig = px.scatter_3d(surfboard, x='x', y='y', z='z', range_x = [-10, 10],␣ ↪→range_y = [-10, 10], range_z = [-10, 10]) fig.show() To give the figure a little more visual pop, the following cell does the same plot, but also assigns a pre-determined color value (that we’ve arbitrarily chosen) to each point. These colors do not mean anything important, they’re simply there as a visual aid. You might find it useful to use colorize_surfboard_data later in this assignment. [5]: def colorize_surfboard_data(df): colors = pd.read_csv("surfboard_colors.csv", header = None).values df_copy = df.copy() df_copy.insert(loc = 3, column = "color", value = colors) return df_copy fig = px.scatter_3d(colorize_surfboard_data(surfboard), x='x', y='y', z='z',␣ ↪→range_x = [-10, 10], range_y = [-10, 10], range_z = [-10, 10], color =␣ ↪→"color", color_continuous_scale = 'RdBu') fig.show() 1.4 Question 1a Now that we’ve understood the data, let’s work on understanding what PCA will do when applied to this data. To properly perform PCA, we will first need to “center” the data so that the mean of each feature is 0. Compute the columnwise mean of surfboard in the cell below, and store the result in surfboard_mean. You can choose to make surfboard_mean a numpy array or a series, whichever is more convenient for you. Regardless of what data type you use, surfboard_mean should have 3 means, 1 for each attribute, with the x coordinate first, then y, then z. Then, subtract surfboard_mean from surfboard, and save the result in surfboard_centered. The order of the columns in surfboard_centered should be x, then y, then z. 3 https://www.google.com/search?q=hashbrown+patty&source=lnms&tbm=isch https://www.google.com/search?q=hashbrown+patty&source=lnms&tbm=isch [6]: surfboard_mean = np.mean(surfboard, axis = 0) surfboard_centered = surfboard - surfboard_mean [7]: grader.check("q1a") [7]: q1a results: All test cases passed! 1.5 Question 1b As you may recall from lecture, PCA is a specific application of the singular value decomposition (SVD) for matrices. If we have a data matrix X, we can decompose it into U , Σ and V T such that X = UΣV T . In the following cell, use the np.linalg.svd function to compute the SVD of surfboard_centered. Store the U , Σ, and V T matrices in u, s, and vt respectively. This is one line of simple code, exactly like what we saw in lecture. Hint: Set the full_matrices argument of np.linalg.svd to False. [8]: u, s, vt = np.linalg.svd(surfboard_centered, full_matrices = False) u, s, vt [8]: (array([[-0.02551985, -0.02108339, -0.03408865], [-0.02103979, -0.0259219 , 0.05432967], [-0.00283413, -0.00809889, 0.00204459], …, [ 0.01536972, -0.00483066, 0.05673824], [-0.00917593, 0.0345672 , 0.03491181], [-0.01701236, 0.02743128, -0.01966704]]), array([103.76854043, 40.38357469, 21.04757518]), array([[ 0.38544534, -0.67267377, -0.63161847], [-0.5457216 , -0.7181477 , 0.43180066], [-0.74405633, 0.17825229, -0.64389929]])) [9]: grader.check("q1b") [9]: q1b results: All test cases passed! 1.6 Question 1c: Total Variance Let’s now consider the relationship between the singular values s and the variance of our data. Recall that the total variance is the sum of the variances of each column of our data. Below, we provide code that computes the variance for each column of the data. Note: The variances are the same for both surfboard_centered and surfboard, so we show only one to avoid redundancy. [10]: np.var(surfboard, axis=0) 4 https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html [10]: x 2.330704 y 5.727527 z 4.783513 dtype: float64 The total variance of our dataset is given by the sum of these numbers. [11]: total_variance_computed_from_data = sum(np.var(surfboard, axis=0)) total_variance_computed_from_data [11]: 12.841743509780109 As discussed in lecture, the total variance of the data is also equal to the sum of the squares of the singular values divided by the number of data points, that is: V ar(X) = ∑d i=1 σ 2 i N where σi is the singular value corresponding to the ith principal component, N is the total number of data points, and V ar(X) is the total variance of the data. In the cell below, compute the total variance using the the formula above and store the result in the variable total_variance_computed_from_singular_values. Your result should be very close to total_variance_computed_from_data. [12]: total_variance_computed_from_singular_values = np.sum(s**2)/surfboard.shape[0] total_variance_computed_from_singular_values [12]: 12.841743509780104 [13]: grader.check("q1c") [13]: q1c results: All test cases passed! 1.7 Question 1d: Explained Variance and Scree Plots In the cell below, set variance_explained_by_1st_pc to the proportion of the total variance explained by the 1st principal component. Your answer should be a number between 0 and 1. Note: This topic was discussed in this section of the PCA lecture slides. [14]: variance_explained_by_1st_pc = (s[0]**2/surfboard.shape[0])/␣ ↪→total_variance_computed_from_data variance_explained_by_1st_pc [14]: 0.8385084140449129 [15]: grader.check("q1d") 5 https://docs.google.com/presentation/d/1zpawVI7o2cYA_C_kSQLBjOMrFkSwMDk23JcedzrzttA/edit#slide=id.ge684cfc9d0_2_98 [15]: q1d results: All test cases passed! We can also create a scree plot that shows the proportion of variance explained by all of our principal components, ordered from most to least. An example scree plot is given below. Note that the variance explained by the first principal component matches the value we calculated above for variance_explained_by_1st_pc. Note: If you’re wondering where len(surfboard_centered) went, it got canceled out when we divided the variance of a given PC by the total variance. [16]: plt.plot([1, 2, 3], s**2 / sum(s**2)); plt.xticks([1, 2, 3], [1, 2, 3]); plt.xlabel('PC #'); plt.ylabel('Fraction of Variance Explained'); plt.title('Fraction of Variance Explained by each Principal Component') [16]: Text(0.5, 1.0, 'Fraction of Variance Explained by each Principal Component') For this small toy problem, the scree plot is not particularly useful. We’ll see why they are useful in practice later in this homework. 1.8 Question 1e: V as a Rotation Matrix In lecture, we saw that the first column of XV contained the first principal component values for each observation, the second column of XV contained the second principal component values for 6 each observation, and so forth. Let’s give this matrix a name: P = XV is sometimes known as the “principal component matrix”. Compute the P matrix for the surfboard dataset and store it in the variable surfboard_pcs. [17]: surfboard_pcs = u @ np.diag(s) [18]: grader.check("q1e") [18]: q1e results: Trying: all(np.isclose(surfboard_pcs.loc[0], [-2.648, -0.851, -0.717], atol=1e-3)) Expecting: True ********************************************************************** Line 1, in q1e 2 Failed example: all(np.isclose(surfboard_pcs.loc[0], [-2.648, -0.851, -0.717], atol=1e-3)) Exception raised: Traceback (most recent call last): File "/opt/conda/lib/python3.8/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in all(np.isclose(surfboard_pcs.loc[0], [-2.648, -0.851, -0.717], atol=1e-3)) AttributeError: 'numpy.ndarray' object has no attribute 'loc' 1.9 Visualizing the Principal Component Matrix In some sense, we can think of P as an output of the PCA procedure. It is simply a rotation of the data such that the data will now appear “axis aligned”. Specifically, for a 3d dataset, if we plot PC1, PC2, and PC3 along the x, y, and z axes of our plot, then the greatest amount of variation happens along the x-axis, the second greatest amount along the y-axis, and the smallest amount along the z-axis. To visualize this, run the cell below, which will show our data now projected onto the principal component space. Compare with your
Answered Same DayAug 07, 2021

Answer To: hw12 August 6, 2021 [1]: # Initialize Otter import otter grader = otter.Notebook("hw12.ipynb") 1...

Karthi answered on Aug 07 2021
129 Votes
07/08/2021 hw12.ipynb - Colaboratory
https://colab.research.google.com/drive/1DIGIATPpDLSuCavuXI6rR99tyF6fY5lX?authuser=3#scrollTo=fqb54n_MvPum&printMo… 1/17
from client.api.notebook import Notebook
ok = Notebook("hw12.ipynb")
In lecture we discussed how PCA can be used for dimensionality reduction. Speci�cally, given a
high dimensional dataset, PCA allows us to:
1. Understand the rank of the data. If principal components capture almost all of the
variance, then the data is roughly rank .
2. Create 2D scatterplots of the data. Such plots are a rank 2 representation of our data, and
allow us to visually identify clusters of similar observations.
A solid geometric understanding of PCA will help you understand why PCA is able to do these
two things. In this homework,
we'll build that geometric intuition and look at PCA on two
datasets: one where PCA works poorly, and the other where it works pretty well.
Due Date
This assignment is due Monday, August 9th at 11:59 PM PDT.
Collaboration Policy
Data science is a collaborative activity. While you may talk with others about the homework, we
ask that you write your solutions individually. If you do discuss the assignments with others
please include their names in the cell below.
Homework 12: Principal Component Analysis
k
k
Collaborators: ...
Question Points
Question 1a 1
Question 1b 1
Question 1c 1
Question 1d 1
Question 1e 1
Question 2a 2
Question 2b 1
Question 2c 1
Question 2d 3
Question 2e 2
Question 3a 1
Score Breakdown
07/08/2021 hw12.ipynb - Colaboratory
https://colab.research.google.com/drive/1DIGIATPpDLSuCavuXI6rR99tyF6fY5lX?authuser=3#scrollTo=fqb54n_MvPum&printMo… 2/17
Question Points
Question 3b 1
Question 3c 1
Question 3d 2
Question 3e 2
Question 3f 2
Question 3g 1
Question 3h 2
Question 3i 2
Total 28
4.4.1' '
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
 
import plotly.express as px
 
# Note: If you're having problems with the 3d scatter plots, uncomment the two line
#      number that is at least 4.1.1.
import plotly
plotly.__version__
Question 1: PCA on 3D Data
In question 1, our goal is to see visually how PCA is simply the process of rotating the
coordinate axes of our data.
The code below reads in a 3D dataset. We have named the DataFrame surfboard because the
data resembles a surfboard when plotted in 3D space.
x y z
0 0.005605 2.298191 1.746604
1 -1.093255 2.457522 0.170309
2 0.060946 0.473669 -0.003543
3 -1.761945 2.151108 3.132426
4 1.950637 -0.194469 -2.101949
surfboard = pd.read_csv("data3d.csv")
surfboard.head(5)
07/08/2021 hw12.ipynb - Colaboratory
https://colab.research.google.com/drive/1DIGIATPpDLSuCavuXI6rR99tyF6fY5lX?authuser=3#scrollTo=fqb54n_MvPum&printMo… 3/17
The cell below will allow you to view the data as a 3D scatterplot. Rotate the data around and
zoom in and out using your trackpad or the controls at the top right of the �gure.
You should see that the data is an ellipsoid that looks roughly like a surfboard or a hashbrown
patty. That is, it is pretty long in one direction, pretty wide in another direction, and relatively thin
along its third dimension. We can think of these as the "length", "width", and "thickness" of the
surfboard data.
Observe that the surfboard is not aligned with the x/y/z axes.
If you get an error that your browser does not support webgl, you may need to restart your kernel
and/or browser.
fig = px.scatter_3d(surfboard, x='x', y='y', z='z', range_x = [-10, 10], range_y = 
fig.show()
To give the �gure a little more visual pop, the following cell does the same plot, but also assigns
a pre-determined color value (that we've arbitrarily chosen) to each point. These colors do not
mean anything important, they're simply there as a visual aid.
You might �nd it useful to use colorize_surfboard_data later in this assignment.
https://www.google.com/search?q=hashbrown+patty&source=lnms&tbm=isch
07/08/2021 hw12.ipynb - Colaboratory
https://colab.research.google.com/drive/1DIGIATPpDLSuCavuXI6rR99tyF6fY5lX?authuser=3#scrollTo=fqb54n_MvPum&printMo… 4/17
def colorize_surfboard_data(df):
    colors = pd.read_csv("surfboard_colors.csv", header = None).values
    df_copy = df.copy()
    df_copy.insert(loc = 3, column = "color", value = colors)
    return df_copy

fig = px.scatter_3d(colorize_surfboard_data(surfboard), x='x', y='y', z='z', range_
fig.show()
Now that we've understood the data, let's work on understanding what PCA will do when applied
to this data.
To properly perform PCA, we will �rst need to "center" the data so that the mean of each feature
is 0.
Compute the columnwise mean of surfboard in the cell below, and store the result in
surfboard_mean . You can choose to make surfboard_mean a numpy array or a series,
whichever is more convenient for you. Regardless of what data type you use, surfboard_mean
should have 3 means, 1 for each attribute, with the x coordinate �rst, then y, then z.
Question 1a
07/08/2021 hw12.ipynb - Colaboratory
https://colab.research.google.com/drive/1DIGIATPpDLSuCavuXI6rR99tyF6fY5lX?authuser=3#scrollTo=fqb54n_MvPum&printMo… 5/17
Then, subtract surfboard_mean from surfboard , and save the result in
surfboard_centered . The order of the columns in surfboard_centered should be x, then y,
then z.
surfboard_mean = np.mean(surfboard, axis = 0)
surfboard_centered = surfboard - surfboard_mean
/content/tests/q1a.py: All tests passed!
ok.grade("q1a");
As you may recall from lecture, PCA is a speci�c application of the singular value decomposition
(SVD) for matrices. If we have a data matrix , we can decompose it into , and such
that .
In the following cell, use the np.linalg.svd function to compute the SVD of
surfboard_centered . Store the , , and matrices in u , s , and vt respectively. This is
one line of simple code, exactly like what we saw in lecture.
Hint: Set the full_matrices argument of np.linalg.svd to False .
[ ] ↳ 3 cells hidden
Question 1b
X U Σ V
T
X = UΣV
T
U Σ V
T
[ ] ↳ 7 cells hidden
Question 1c: Total Variance
[ ] ↳ 6 cells hidden
Question 1d: Explained Variance and Scree Plots
In lecture, we saw that the �rst column of contained the �rst principal component values
for each observation, the second column of contained the second principal component
values for each observation, and so forth.
Let's give this matrix a name: is sometimes known as the "principal component
matrix".
Compute the matrix for the surfboard dataset and store it in the variable surfboard_pcs .
[ ] ↳ 4 cells hidden
Question 1e: V as a Rotation Matrix
XV
XV
P = XV
P
https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html
07/08/2021 hw12.ipynb - Colaboratory
https://colab.research.google.com/drive/1DIGIATPpDLSuCavuXI6rR99tyF6fY5lX?authuser=3#scrollTo=fqb54n_MvPum&printMo… 6/17
[ ] ↳ 4 cells hidden
Visualizing the Principal Component Matrix
Above, we saw that the principal component matrix is simply the original data rotated in
space so that it appears axis-aligned.
Whenever we do a 2D scatter plot of only the �rst 2 columns of , we are simply looking at the
data from "above", i.e. so that the 3rd (or higher) PC is invisible to us.
Question 1 Summary
P
P
[ ] ↳ 6 cells hidden
Question 2
Using PCA, we can try to visualize student performance on ALL questions simultaneously. In the
cell below, create a DataFrame called mid1_1st_2_pcs that has 992 rows and 2 columns,
where the �rst column is named pc1 and represents the �rst principal component, and the
second column is named pc2 and represents the second principal component. The columns of
your dataframe should be named pc1 and pc2 .
Reminder: make sure to center your data �rst!
Question...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here