11/18/22, 4:22 PM WGU Performance Assessmenthttps://tasks.wgu.edu/student/ XXXXXXXXXX/course/ XXXXXXXXXX/task/2807/overview 1/10NVM2 — NVM2 TASK 1: CLASSIFICATION ANALYSISDATA MINING I —...

1 answer below »
WGU PERFORMANCE ASSESSMENT


11/18/22, 4:22 PM WGU Performance Assessment https://tasks.wgu.edu/student/009463359/course/20900018/task/2807/overview 1/10 NVM2 — NVM2 TASK 1: CLASSIFICATION ANALYSIS DATA MINING I — D209 PRFA — NVM2 COMPETENCIES 4030.06.1 : Classification Data Mining Models The graduate applies observations to appropriate classes and categories using classification models. 4030.06.3 : Data Mining Model Performance The graduate evaluates data mining model performance for precision, accuracy, and model comparison. INTRODUCTION In this task, you will act as an analyst and create a data mining report. In doing so, you must select one of the data dictionary and data set files to use for your report from the following link: Data Sets and Associated Data Dictionaries.   You should also refer to the data dictionary file for your chosen data set from the provided link. You will use Python or R to analyze the given data and create a data mining report in a word processor (e.g., Microsoft Word). Throughout the submission, you must visually represent each step of your work and the findings of your data analysis.   Note: All algorithms and visual representations used need to be captured either in tables or as screenshots added into the submitted document. A separate Microsoft Excel (.xls or .xlsx) document of the cleaned data should be submitted along with the written aspects of the data mining report. REQUIREMENTS Your submission must be your original work. No more than a combined total of 30% of the submission and no more than a 10% match to any one individual source can be directly quoted or closely paraphrased from sources, even if cited correctly. The originality report that is provided when you submit your task can be used as a guide. TASK OVERVIEW SUBMISSIONS EVALUATION REPORT https://lrps.wgu.edu/provision/227079957 11/18/22, 4:22 PM WGU Performance Assessment https://tasks.wgu.edu/student/009463359/course/20900018/task/2807/overview 2/10  You must use the rubric to direct the creation of your submission because it provides detailed criteria that will be used to evaluate your work. Each requirement below may be evaluated by more than one rubric aspect. The rubric aspect titles may contain hyperlinks to relevant portions of the course. Tasks may not be submitted as cloud links, such as links to Google Docs, Google Slides, OneDrive, etc., unless specified in the task requirements. All other submissions must be file types that are uploaded and submitted as attachments (e.g., .csv, .docx, .pdf, .ppt).  Part I: Research Question A.  Describe the purpose of this data mining report by doing the following: 1.  Propose one question relevant to a real-world organizational situation that you will answer using one of the following classification methods: •  k-nearest neighbor (KNN) •  Naive Bayes 2.  Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.   Part II: Method Justification B.  Explain the reasons for your chosen classification method from part A1 by doing the following: 1.  Explain how the classification method you chose analyzes the selected data set. Include expected outcomes. 2.  Summarize one assumption of the chosen classification method. 3.  List the packages or libraries you have chosen for Python or R, and justify how each item on the list supports the analysis.   Part III: Data Preparation C.  Perform data preparation for the chosen data set by doing the following: 1.  Describe one data preprocessing goal relevant to the classification method from part A1. 2.  Identify the initial data set variables that you will use to perform the analysis for the classification question from part A1, and classify each variable as continuous or categorical. 3.  Explain each of the steps used to prepare the data for the analysis. Identify the code segment for each step. 4.  Provide a copy of the cleaned data set.   Part IV: Analysis D.  Perform the data analysis and report on the results by doing the following: 1.  Split the data into training and test data sets and provide the file(s). 2.  Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you performed. 3.  Provide the code used to perform the classification analysis from part D2.   11/18/22, 4:22 PM WGU Performance Assessment https://tasks.wgu.edu/student/009463359/course/20900018/task/2807/overview 3/10 Part V: Data Summary and Implications E.  Summarize your data analysis by doing the following: 1.  Explain the accuracy and the area under the curve (AUC) of your classification model. 2.  Discuss the results and implications of your classification analysis. 3.  Discuss one limitation of your data analysis. 4.  Recommend a course of action for the real-world organizational situation from part A1 based on your results and implications discussed in part E2.   Part VI: Demonstration F.  Provide a Panopto video recording that includes a demonstration of the functionality of the code used for the analysis and a summary of the programming environment.   Note: The audiovisual recording should feature you visibly presenting the material (i.e., not in voiceover or embedded video) and should simultaneously capture both you and your multimedia presentation.   Note: For instructions on how to access and use Panopto, use the "Panopto How-To Videos" web link provided below. To access Panopto's website, navigate to the web link titled "Panopto Access," and then choose to log in using the “WGU” option. If prompted, log in using your WGU student portal credentials, and then it will forward you to Panopto’s website.   To submit your recording, upload it to the Panopto drop box titled “Data Mining I – NVM2.” Once the recording has been uploaded and processed in Panopto's system, retrieve the URL of the recording from Panopto and copy and paste it into the Links option. Upload the remaining task requirements using the Attachments option.   G.  Record the web sources used to acquire data or segments of third-party code to support the analysis. Ensure the web sources are reliable. H.  Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized. I.  Demonstrate professional communication in the content and presentation of your submission. File Restrictions File name may contain only letters, numbers, spaces, and these symbols: ! - _ . * ' ( ) File size limit: 200 MB File types allowed: doc, docx, rtf, xls, xlsx, ppt, pptx, odt, pdf, txt, qt, mov, mpg, avi, mp3, wav, mp4, wma, flv, asf, mpeg, wmv, m4v, svg, tif, tiff, jpeg, jpg, gif, png, zip, rar, tar, 7z RUBRIC 11/18/22, 4:22 PM WGU Performance Assessment https://tasks.wgu.edu/student/009463359/course/20900018/task/2807/overview 4/10 A1:PROPOSAL OF QUESTION A2:DEFINED GOAL B1:EXPLANATION OF CLASSIFICATION METHOD B2:SUMMARY OF METHOD ASSUMPTION NOT EVIDENT The submission does not propose 1 question. APPROACHING COMPETENCE The submission proposes 1 question that is not relevant to a real-world organizational situation. Or the proposal does not include 1 of the given classification methods. COMPETENT The submission proposes 1 question that is relevant to a real-world organizational situa- tion, and the proposal includes 1 of the given classification methods. NOT EVIDENT The submission does not define 1 goal for data analysis. APPROACHING COMPETENCE The submission defines 1 goal for data analy- sis, but the goal is not reasonable, is not within the scope of the scenario, or is not rep- resented in the available data. COMPETENT The submission defines 1 reasonable goal for data analysis that is within the scope of the scenario and is represented in the available data. NOT EVIDENT The submission does not explain how the cho- sen classification method analyzes the se- lected data set. APPROACHING COMPETENCE The submission does not logically explain how the chosen classification method analyzes the selected data set, or the explanation includes inaccurate expected outcomes. COMPETENT The submission logically explains how the cho- sen classification method analyzes the se- lected data set and includes accurate expected outcomes. NOT EVIDENT The submission does not summarize 1 as- sumption of the chosen classification method. APPROACHING COMPETENCE The submission inadequately summarizes 1 assumption of the chosen classification COMPETENT The submission adequately summarizes 1 as- sumption of the chosen classification method. 11/18/22, 4:22 PM WGU Performance Assessment https://tasks.wgu.edu/student/009463359/course/20900018/task/2807/overview 5/10 B3:PACKAGES OR LIBRARIES LIST C1:DATA PREPROCESSING C2:DATA SET VARIABLES C3:STEPS FOR ANALYSIS method. NOT EVIDENT The submission does not list the packages or libraries chosen for Python or R. APPROACHING COMPETENCE The submission lists the packages or libraries chosen for Python or R but does not justify how 1 or more items on the list support the analysis. COMPETENT The submission lists the packages or libraries chosen for Python or R and justifies how each item on the list supports the analysis. NOT EVIDENT The submission does not describe 1 data pre- processing goal. APPROACHING COMPETENCE The submission describes 1 data preprocess- ing goal, but it is not relevant to the classifica- tion method from part A1. COMPETENT The submission describes 1 data preprocess- ing goal that is relevant to the classification method from part A1. NOT EVIDENT The submission does not identify any data set variables used to perform the analysis for the classification question from part A1 or does not classify the variables as continuous or categorical. APPROACHING COMPETENCE The submission identifies the data set vari- ables used to perform the analysis for the classification question from part A1, but the submission inaccurately classifies 1 or more variables as continuous or categorical. COMPETENT The submission identifies the data set vari- ables used to perform the analysis for the clas- sification question from part A1, and the sub- mission accurately classifies each variable as continuous or categorical. 11/18/22, 4:22 PM WGU Performance Assessment https://tasks.wgu.edu/student/009463359/course/20900018/task/2807/overview 6/10 C4:CLEANED DATA SET D1:SPLITTING THE DATA D2:OUTPUT AND INTERMEDIATE CALCULATIONS NOT EVIDENT The submission does not explain each step used to prepare the data for the analysis, or the submission does not identify the code segment for each step. APPROACHING COMPETENCE The submission inaccurately explains 1 or more steps used to prepare the data for analysis, or the submission identifies an inac- curate code segment for 1 or more steps. COMPETENT The submission accurately explains each step used to prepare the data for analysis, and the submission identifies an accurate code seg- ment for each step. NOT EVIDENT The submission does not include a copy of the cleaned data set APPROACHING COMPETENCE The submission includes a copy of the cleaned data set, but the data set is inaccurate. COMPETENT The submission includes an accurate copy of the cleaned data set. NOT EVIDENT The submission does not provide the training and test data set file(s). APPROACHING COMPETENCE The submission provides training and test data sets, but the split is not reasonably proportioned. COMPETENT The submission provides reasonably propor- tioned training and test data sets. NOT EVIDENT The submission does not describe the analy- sis technique used to analyze the data, or it does not include screenshots of the interme- diate calculations performed. APPROACHING COMPETENCE The submission inaccurately describes the analysis technique used to appropriately ana- lyze the data, or the submission includes COMPETENT The submission accurately describes the analysis technique used to appropriately ana- lyze the data, and the submission includes ac- 11/18/22, 4:22 PM WGU Performance Assessment https://tasks.wgu.edu/student/009463359/course/20900018/task/2807/overview 7/10 D3:CODE EXECUTION E1:ACCURACY AND AUC E2:RESULTS AND IMPLICATIONS E3:LIMITATION screenshots of the intermediate calculations performed but they are inaccurate. curate screenshots of the intermediate calcu- lations performed. NOT EVIDENT The submission does not provide the code used to perform the classification analysis from part D2. APPROACHING COMPETENCE The submission provides the code used to perform the classification analysis from part D2, but 1 or more errors are evident during the execution of the code. COMPETENT The submission provides the code used to per- form the classification analysis from part D2 and the code executes without errors. NOT EVIDENT The submission does not explain the accuracy or the AUC of the classification model. APPROACHING COMPETENCE The submission does not logically explain the accuracy or the AUC of the classification model. COMPETENT The submission logically explains both the ac- curacy and the AUC of the classification model. NOT EVIDENT The submission does not discuss both the re- sults and implications of the classification analysis. APPROACHING COMPETENCE The submission discusses both the results and implications of the classification analysis, but the discussion is inadequate. COMPETENT The submission adequately discusses both the results and implications of the classification analysis. 11/18/22, 4:22 PM WGU Performance Assessment https://tasks.wgu.edu/student/009463359/course/20900018/task/2807/overview 8/10 E4:COURSE OF ACTION F:PANOPTO RECORDING G:SOURCES FOR THIRD-PARTY CODE NOT EVIDENT The submission does not discuss 1 limitation of the data analysis. APPROACHING COMPETENCE The submission discusses 1 limitation of the data analysis but lacks adequate detail or is illogical. COMPETENT The submission logically discusses 1 limitation of the data analysis with adequate detail. NOT EVIDENT The submission does not recommend a course of action for the real-world organiza- tional situation from part A1 APPROACHING COMPETENCE The submission does not recommend a rea- sonable course of action for the real-world organizational situation from part A1, or the course of action is not based on the results and implications discussed in part E2. COMPETENT The submission recommends a reasonable course of action for the real-world organiza- tional situation from part A1 based on the re- sults and implications discussed in part E2. NOT EVIDENT The submission does not provide a Panopto video recording. APPROACHING COMPETENCE The submission provides a Panopto video recording, but it does not include a demon- stration of the functionality of the code used for the analysis or a summary of the program- ming environment or both. COMPETENT The submission provides a Panopto video recording that
Answered 13 days AfterNov 18, 2022

Answer To: 11/18/22, 4:22 PM WGU Performance Assessmenthttps://tasks.wgu.edu/student/ XXXXXXXXXX/course/...

Aditi answered on Nov 22 2022
43 Votes
In [1]:
In [2]:
Task 1 - KNN Classification
Configure Notebook:
Configure and import packages. A imports.PY file contains all of the programming necessary for importing and customising. There is a second assistant as well. Several functions used all through this notebook are defined in a PY file.
    from imports import *
%matplotlib inline
warnings.filterwarnings('ignore')
P:\code\wgu\py\Scripts\python.exe
python version: 3.9.7
pandas version: 1.3.0
numpy version: 1.19.5
scipy version: 1.7.1
sklearn version: 1.0.1
matplotlib version: 3.4.2
seaborn version: 0.11.2
graphviz version: 0.17
    from helpers import *
getFilename version: 1.0
saveTable version: 1.0
describeData version: 1.0
createScatter version: 1.0
createBarplot version: 1.1
get_unique_numbers version: 1.0
createCorrelationMatrix version: 1.0
createStackedHistogram version: 1.0
plotDataset version: 1.0
Part I: Research Question
A. Describe the purpose of this data min
ing report by doing the following:
In [3]:
A1. Propose one question relevant to a real-world organizational situation that you will answer using one of the following classification methods: (a) k-nearest neighbor (KNN) or (b) Naive Bayes.
Primary purpose: A telecoms business has received an inquiry about churn. When a client decides to quit using services, this is churn. Is it feasible to categories a new (or current) client based on their resemblance to previous customers with comparable qualities that have and haven't churned in the past if the firm has customer information that have but have not done so in the past. Two (2) attributes—MonthlyCharge and Tenure—from the company's database of 10,000 consumers will be taken into account in this research. Additionally, if the forecast comes true, the analysis will make an effort to determine how accurate the prediction was.
A2. Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.
Primary purpose: With MonthlyCharge = $170.00 and Tenure = 1.0, the study will try to forecast customer churn for a new client. The company's customer data may be used to accomplish this aim, as both traits are included in the data for 10,000 customers and should give sufficient information for the forecast. K-nearest neighbours (KNN) will be used in the study to categorise the new client depending on the k-nearest existing customers with comparable features.
    import pandas as pd
newCustomer = pd.DataFrame([{'Tenure': 1.0,
'MonthlyCharge': 170.0,
'zTenure': 0.0,
'zMonthlyCharge': 0.0}])
Part II: Method Justification
B. Explain the reasons for your chosen classification method from part A1 by doing the following:
B1. Explain how the classification method you chose analyzes the selected data set. Include expected outcomes.
Describe Method. KNN classification will search for comparable characteristics in
the nearest k neighbors, which are in close proximity to a classification goal value. A classification excellent understanding on those values will be generated when it determines which classification value appears more frequently in those k-neighbors. The target variable should be shown in relation to the model's accuracy summary and k-neighbors in the results, in my opinion.
B2. Summarize one assumption of the chosen classification method.
One supposition. It is a fundamental tenet of KNN modelling that related items are close to one another. It will search for comparable customer records to categories the new customer by looking for that class's occurrences most often in those near neighbors.
B3. List the packages or libraries you have chosen for Python or R, and justify how each item on the list supports the analysis.
At the very start of the notebook, all of the Python packages needed for this study were loaded. Version numbers and the packages are displayed. In addition to the typical Python tools (such as numpy, scipy, matplotlib, pandas, etc.), sklearn provides the main package needed to build and view the classification model. I also employ two (2). Instead of putting all that code into to this notebook, I use it in different notebooks thanks to Py files.
All necessary packages are included in Imports.py, and Helpers.py contains a wealth of useful features that let me standardize my tables, figures, and other notebook components. The two of these. The notebook will come with PY files for your convenience.
Part III: Data Preparation
C. Perform data preparation for the chosen data set by doing the following:
C1. Describe one data preprocessing goal relevant to the classification method from part A1.
One purpose of data preprocessing. After importing the firm data into the Python environment, the raw numerical data should be normalized before the KNN classification analysis can be applied to this issue. Additionally, the business information will divide into two (2) subsets: a training dataset with 70% of the data and a testing or validation dataset with the remaining 30%. The training set will then be used by the KNN to create the model, and the test set will be used to verify the model. To make it as easy and clear as possible for anybody to track the analysis all through the notebook, the major objective of data preparation will be establishing these subsets of data. The following is a list of the planned data variables for this analysis:
Raw Data.
y = target data (i.e. Churn (categorical))
X = feature data (i.e. MonthlyCharge, and Tenure) rawData = y.merge(X)
Clean Data.
y = target data (i.e. Churn (bool))
X = feature data (i.e. MonthlyCharge, Tenure, zMonthlyCharge, and zTenure)
cleanData = y.merge(X)
Training Data. 70% of the cleaned data.
X_train = created using train-test-split (i.e.
zMonthlyCharge, and zTenure)
y_train = created using train-test-split
trainData = y_train.merge(X_train)
Testing Data. The remaining 30% of the cleaned data.
X_test = created using train-test-split (i.e.
zMonthlyCharge, and zTenure)
y_test = created using train-test-split
testData = y_test.merge(X_test)
C2. Identify the initial data set variables that you will use to perform the analysis for the classification question from part A1, and classify each variable as continuous or categorical.
Establish the initial variables. I will take into account two aspects, MonthlyCharge and Tenure, as well as one objective, Churn, for my study. The reading of is done with Pandas. The USECOLS option only returns certain data from a CSV raw data file.
The monthly fee that the consumer is charged represents an average for each individual customer.
Tenure (FEATURE) The length of time a consumer has been a customer of the company
If a client has stopped receiving services during the past month, that is churn (TARGET) (yes, no).
In [4]:
TABLE 3-1.SELECTED RAW DATA.
Initial state of dataset before any manipulations.
    raw = pd.read_csv('data/churn_clean.csv',
usecols=['Churn','Tenure','MonthlyCharge']) saveTable(data=raw, title='RAW', sect='C2',
course='D209', task='Task2', caption='3 1')
    
    0
    1
    2
    3
    Churn
    No
    Yes
    No
    No
    Tenure
    6.796
    1.157
    15.754
    17.087
    MonthlyCharge
    172.456
    242.633
    159.948
    119.957
shape: (10000, 3)
Table saved to: TABLES/D209_TASK2_C2_TAB_3_1_RAW.CSV
Summary. 10,000 customer records with three (3) variables each make up the raw customer data for the firm that has been read into the df variable. Two (2) of the variables, which are continuous (numerical) data, will be employed as features, and the third variable is our target binary variable. For each variable, the conventional transformation—a Z-scored column—was added in additional to the raw data.
C3. Explain each of the steps used to prepare the data for the analysis. Identify the code segment for each step.
Step 1: Enter chosen firm details. The pandas.read csv() method was used to read the relevant customer data (Churn, MonthlyCharge, and Tenure) into the Python environment using the usecols=[] option. Earlier, in section C2 [9], this was finished.
In [5]:
In [6]:
In [7]:
In [8]:
    # start with a copy of raw data
clean = raw.copy()
Step 2: Each row of the Churn variable initially had Yes or No values, therefore this step used the pandas.replace() method to transform the category data into boolean data. Boolean data is a form of numerical data in Python that can be either 1 or 0. (int). Earlier, in section C2 [9], this was finished.Target Data ( y ). Convert categorical Churn to numeric boolean. Ref: (1) https://pandas.pydata.org/pandas- docs/stable/user_guide/indexing.html
    target = 'Churn'
clean[target] = clean[target].replace({"No":False, "Yes":True}) clean[target] = clean[target].astype('bool')
Step 3: Explain the starting set of variables. Describe the data, whether numerical or categorical, for each variable. I used a program I wrote to iterate over each one and list it along with a brief explanation. Additionally, to display descriptive statistics for numerical data, utilize the pandas.describe() function. Sections C2 [10] and C2 [11] above accomplished this.
    features = ['MonthlyCharge','Tenure']
for c in features:
clean['z'+c] = (clean[c] - clean[c].mean()) / clean[c].std()
    describeData(data=clean)
1. Churn is boolean (BINARY): [False    True].
2. Tenure is numerical (CONTINUOUS) - type: float64. Min: 1.000    Max: 71.999    Std: 26.443
3. MonthlyCharge is numerical (CONTINUOUS) - type: float64. Min: 79.979    Max: 290.160    Std: 42.943
4. zMonthlyCharge is numerical (CONTINUOUS) - type: float64. Min: -2.157    Max: 2.737    Std: 1.000
5. zTenure is numerical (CONTINUOUS) - type:...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here