Data Wrangling Assignment 2 Assessment type: Written report (PDF document) using R Markdown Word limit: Maximum 25 pages Purpose The purpose of this final assignment is to put to work the tools and...

kindly follow rubric whatever instruction given in assignment need exactly same


Data Wrangling Assignment 2 Assessment type: Written report (PDF document) using R Markdown Word limit: Maximum 25 pages Purpose The purpose of this final assignment is to put to work the tools and knowledge that you gain throughout this course. This provides you with multiple benefits. · It will provide you with more experience using data preprocessing tools on real life data sets. · It helps you to self-direct your learning and interests to find unique and creative ways to wrangle your data. · It starts to build your data analytics portfolio. Portfolios (or e-portfolios) are a great way to show potential employers what you are capable of. Overview This assignment requires you to find some open data, and use your knowledge, skills gained during the course to preprocess the data. You will create a report using R Markdown to explain the steps taken by you in order to perform the data preprocessing tasks. You will also publish this report online (in RPubs) which will give you the opportunity to build your data analytics portfolio. This is a great way of showing potential employers what you are capable of. You will be awarded (with marks) the clearer you demonstrate your skills. Assessment criteria and weighting Please see the marking rubric to know the assessment criteria and weightage. Course Learning outcomes This assessment is linked to the following course learning outcomes: 1. Accurately, logically and ethically combine data from multiple sources to make suitable for statistical analysis and draw valid interpretations. 2. Articulate how data meets the best practice standards (e.g. tidy data principles). 3. Select, perform and justify data validation processes for raw datasets. 4. Use leading open source software (e.g. R) for reproducible, automated data processing. Assignment Data Sources Assignment 2 is open-ended however you are required to find suitable datasets that fulfill the minimum requirements given below. All of the datasets that you use in this assignment must be open and ideally have a Creative Commons Licence. This will ensure you can share your work with anyone provided you make proper attribution. If you’re not sure if data is Open, contact the provider, read the documentation or post on the discussion board and I will investigate. Some open data sources are provided below, but I encourage you to find others: · https://www.kaggle.com · UCI Machine Learning Repository · data.gov · world bank · amazon web services · google data sets · youtube video data sets · analytics vidhya · quandl · driven data · http://www.abs.gov.au/ · https://www.data.vic.gov.au/ · http://www.bom.gov.au/ · https://relational.fit.cvut.cz Minimum Requirements for the Data sets Considering this is a data preprocessing class, I do expect your data set to have certain requirements so that you can demonstrate your knowledge of data preprocessing. The following are the minimum requirements for the data sets that I will look for: 1. At least two data sets should be merged to create your assignment data (for example you can take crime statistics for the cities/states in Australia and merge this data set with cities/states’ per capita income data). 2. Your data set should include multiple data types (numerics, characters, factors, etc). 3. Your data set should include variables suitable for data type conversions so that you should be able to apply the required data type conversions (e.g., character -> factor, character -> date, numeric -> factor, etc. conversions). 4. Your data set should include at least one factor variable that needs to be labelled and/or ordered. 5. At least one of the data sets that you use should be Untidy. You need to explain why the data set or data sets you used is/are Untidy. Then you need to apply the required steps to reshape your data into a tidy format. 6. At least one variable needs to be created/mutated from the existing ones (e.g. the data may contain income and expense variables and you may create a savings variable out of the income and expense variables). 7. You are expected to scan all variables for missing values, special values and obvious errors (i.e. inconsistencies). If there are missing values, use any of the suitable techniques outlined in Module 5 to deal with them, reason and document your approach properly. If there are no missing values in the data, then scan all variables for any special values and obvious errors, use any of the suitable techniques outlined in Module 5 to deal with them, reason and document your approach properly. 8. You are expected to scan all numeric variables for outliers. If there are outliers, use any of the suitable techniques outlined in Module 6 to deal with them, reason and document your approach properly. 9. You are expected to apply data transformations on at least one of the variables. The purpose of this transformation should be one of the following reasons: i) to change the scale for better understanding of the variable, ii) to convert a non-linear relation into linear one, or iii) to decrease the skewness and convert the distribution into a normal distribution. 10. You are expected to use only readr, xlsx, readxl, foreign, gdata, rvest, dplyr, tidyr, deductive, deducorrect, editrules, validate, Hmisc, forecast, stringr, lubridate, car, outliers, MVN, infotheo, MASS, caret, MLR , ggplot2, knitr and base R functions for this section. You can also use your own functions. This will show your accumulated knowledge that you gained throughout the semester in this course. Optional things that you can do to preprocess data: · You can subset your data by selecting variables and/or filtering in (or out) cases. Please don’t forget to put an explanation in your report if you do so. · Your data set can include date or string information or both. If this is the case, I expect you to apply required date conversions for dates and string manipulations for strings as required. · Depending on your level of knowledge gained in other courses (i.e. Applied Analytics and/or Machine Learning, etc) you may apply data normalisation, feature selection and feature extraction. Note that, this is an optional task and you don’t have to apply any of these techniques if you don’t know the theory and the fundamentals. Report Section Details 1. Report title and student details [Plain text]: You can add the title of your report and student details by updating the “title” and “author” entries at the top of the R Markdown Template. 2. Required packages [R code]: Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10. 3. Executive Summary [Plain text]: In your own words, provide a brief summary of the preprocessing. Explain the steps that you have taken to preprocess your data. Write this section last after you have performed all data preprocessing. (Word count Max: 300 words). 4. Data [Plain text & R code & Output]: A clear description of data sets, their sources, and variable descriptions should be provided. In this section, you must also provide the R codes with outputs (e.g. head of data sets) that you used to import/read/scrape the data set. You need to fulfil the minimum requirement #1 and merge at least two data sets to create the one you are going to work on. In addition to the R codes and outputs, you need to explain the steps that you have taken. 5. Understand [Plain text & R code & Output]: Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4. 6. Tidy & Manipulate Data I [Plain text & R code & Output]: Explain why your data (or one of the data sets) doesn’t conform the tidy data principles (minimum requirement #5). Apply the required steps to reshape the data into a tidy format. In addition to the R codes and outputs, explain everything that you do in this step. 7. Tidy & Manipulate Data II [Plain text & R code & Output]: Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step. 8. Scan I [Plain text & R code & Output]: Scan the data for missing values, special values and obvious errors (i.e. inconsistencies). In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly. 9. Scan II [Plain text & R code & Output]: Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly. 10. Transform [Plain text & R code & Output]: Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9. NOTE: Note that sometimes the order of the tasks may be different than the order given here. For example, you may need to tidy the data sets first to be able to create the common key to merge. Therefore, for such cases you may have a different ordering of the sections. Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head()). Academic integrity and plagiarism Academic integrity is about honest presentation of your academic work. It means acknowledging the work of others while developing your own insights, knowledge and ideas. You should take extreme care that you have: · acknowledged words, data, diagrams, models, frameworks and/or ideas of others you have quoted (i.e. directly copied), summarised, paraphrased, discussed or mentioned in your assessment through the appropriate referencing methods · provided a reference list of the publication details so your reader can locate the source if necessary. This includes material taken from internet sites.
Sep 24, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here