Practical Data Science with Python COSC 2670/2738 Assignment 1 Assessment Type Individual Due Date 23:59, the 15th of April, 2020 Marks 15 Introduction In this assignment, you will examine a data file...

1 answer below »
kindly find attached supporting format as well and update with best possible price


Practical Data Science with Python COSC 2670/2738 Assignment 1 Assessment Type Individual Due Date 23:59, the 15th of April, 2020 Marks 15 Introduction In this assignment, you will examine a data file and carry out the first steps of the data science process, including the cleaning and exploring of data. You will need to develop and implement appropriate steps, in IPython, to load a data file into memory, clean, process, and analyse it. This assignment is intended to give you practical experience with the typical first steps of the data science process. The “Practical Data Science” Canvas contains further announcements and a discus- sion board for this assignment. Please be sure to check these on a regular basis – it is your responsibility to stay informed with regards to any announcements or changes. Login through https://learninghub.rmit.edu.au. Where to Develop Your Code You are encouraged to develop and test your code in two environments: Jupyter Note- book on Lab PCs and Teaching Servers. Jupyter Notebook on Lab PCs On Lab Computer, you can find Jupyter Notebook via: Start → All Programs → Anaconda3 (64-bit) → Jupyter Notebook Then, • Select New → Python 3 • The new created ‘*.ipynd’ is created at the following location: – C:\Users\sXXXXXXX – where sXXXXXXX should be replaced with a string consisting of the letter “s” followed by your student number. https://learninghub.rmit.edu.au Teaching Servers Three CSIT teaching servers are available for your use: (titan|saturn|jupiter).csit.rmit.edu.au. Details for how to access these servers are available in ‘‘Extra: Run Anaconda on RMIT Coreteaching Servers’’ under the Modules/Week2: Data Curation section of the course Canvas. You are encouraged to develop your code on these machines. If you choose to develop your code elsewhere, it is your responsibility to ensure that your assignment submission can be successfully run using the version of IPython installed on Lab PCs or (titan|saturn|jupiter).csit.rmit.edu.au, as this is where your code will be run for marking purposes. Important: You are required to make regular backups of all of your work. This is good practice, no matter where you are developing your assignment solutions. Academic integrity and plagiarism (standard warning) Academic integrity is about honest presentation of your academic work. It means ac- knowledging the work of others while developing your own insights, knowledge and ideas. You should take extreme care that you have: • Acknowledged words, data, diagrams, models, frameworks and/or ideas of others you have quoted (i.e. directly copied), summarised, paraphrased, discussed or men- tioned in your assessment through the appropriate referencing methods • Provided a reference list of the publication details so your reader can locate the source if necessary. This includes material taken from Internet sites. If you do not acknowledge the sources of your material, you may be accused of plagiarism because you have passed off the work and ideas of another person without appropriate referencing, as if they were your own. RMIT University treats plagiarism as a very serious offence constituting misconduct. Plagiarism covers a variety of inappropriate behaviours, including: • Failure to properly document a source • Copyright material from the internet or databases • Collusion between students For further information on our policies and procedures, please refer to the following: https://www.rmit.edu.au/students/student-essentials/rights-and-responsibilities/ academic-integrity. All submission will be checked by TurnedIn. General Requirements This section contains information about the general requirements that your assignment must meet. Please read all requirements carefully before you start. • You must do the analysis in IPython. 2 https://www.rmit.edu.au/students/student-essentials/rights-and-responsibilities/academic-integrity https://www.rmit.edu.au/students/student-essentials/rights-and-responsibilities/academic-integrity • Parts of this assignment will include a written report, this must be in PDF format. • Please ensure that your submission follows the file naming rules specified in the tasks below. File names are case sensitive, i.e. if it is specified that the file name is gryphon, then that is exactly the file name you should submit; Gryphon, GRYPHON, griffin, and anything else but gryphon will be rejected. Assessment details Task 1: Data Preparation (5%) Have a look at the file StarWars.csv, which is available in Canvas under the Assignments -> Assignment 1 section of the course Canvas. This file contains data behind the story America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters)1. The author collected the data by running a poll through SurveyMonkey Audience, surveying 1,186 respondents. The description of the questions asked in the survey is given below. • Have you seen any of the 6 films in the Star Wars franchise? • Do you consider yourself to be a fan of the Star Wars film franchise? • Which of the following Star Wars films have you seen? Please select all that apply. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi) • Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi) • Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. (Han Solo, Luke Skywalker, Princess Leia Organa, Anakin Skywalker, Obi Wan Kenobi, Emperor Palpatine, Darth Vader, Lando Calrissian, Boba Fett, C-3P0, R2-D2, Jar Jar Binks, Padme Amidala, Yoda) • Which character shot first? • Are you familiar with the Expanded Universe? • Do you consider yourself to be a fan of the Expanded Universe? • Do you consider yourself to be a fan of the Star Trek franchise? • Gender • Age 1https://github.com/fivethirtyeight/data/tree/master/star-wars-survey 3 gryphon Gryphon GRYPHON griffin gryphon • Household Income • Education • Location (Census Region) Being a careful data scientist, you know that it is vital to carefully check any available data before starting to analyse it. Your task is to prepare the provided data for analysis. You will start by loading the CSV data from the file (using appropriate pandas functions) and checking whether the loaded data is equivalent to the data in the source CSV file. Then, you need to clean the data by using the knowledge we taught in the lectures. You need to deal with all the potential issues/errors in the data appropriately. Task 2: Data Exploration (5%) Explore the provided data based on the following steps: 1. Explore the survey question: Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi), then analysis how people rate Star Wars Movies. 2. Explore the relationships between columns. You need to choose 3 pairs of columns to focus on, and you need to generate 1 visualisation for each pair. Each pair of columns that you choose should address a plausible hypothesis for the data concerned. 3. Explore whether there are relationship between people’s demographics (Gender, Age, Household Income, Education, Location) and their attitude to Start War characters. Note, each visualization (graph) shoul be complete and informative in itself, and should be clear for readers to read and obtain information. Task 3: Report (5%) Write your report and save it in a file called report.pdf, and it must be in PDF format, and must be at most 6 (in single column format) pages (including figures and references) with a font size between 10 and 12 points. Penalties will apply if the report does not satisfy the requirement. Moreover, the quality of the report will be considered, e.g. clarity, grammar mistakes, the flow of the presentation. Remember to clearly cite any sources (including books, research papers, course notes, etc.) that you referred to while designing aspects of your programs. • Create a heading called “Data Preparation” in your report. 4 – Provide a brief explanation of how you addressed the task. For the steps of dealing with the potential issues/errors, please create a sub-section for each type of errors you dealt with (e.g. typos, extra whitespaces, sanity checks for impossible values, and missing values etc), and also explain and justify how you dealt with each kind of errors. • Create a heading called “Data Exploration” in your report. – For each numbered step in Task 2 above, create a sub-section with correspond- ing numbering. What to Submit, When, and How The assignment is due at 23:59, the 15th of April, 2020. Assignments submitted after this time will be subject to standard late submission penal- ties. You need to submit the following files: • Notebook file containing your python commands for Task 1 and Task 2, ‘assign- ment1.ipynb’. Please use the provided solution template to organise your solutions: assignment1 TEMPLATE.ipynb # For the notebook files, please make sure to clean them and remove any unnecessary lines of code (cells). Follow these steps before submission: 1. Main menu → Kernel → Restart & Run All 2. Wait till you see the output displayed properly. You should see all the data printed and graphs displayed. • Your report.pdf file: at most 6 (in single column format) pages (including figures and references) with a font size between 10 and 12 points. Penalties will apply if the report does not satisfy the requirement. They must be submitted as ONE single zip file, named as your student number (for example, 1234567.zip if your student ID is s1234567). The zip file must be submitted in Canvas: Assignments/Assignment 1. Please do NOT submit other unnecessary files. 5 A Marking Guidelines Data Preparation Data Exploration Report (Maximum = 5 marks) (Maximum = 5 marks) (Maximum = 5 marks) 5 marks 5 marks 5 marks Data preparation is well designed, systematic and well explained. All potential errors/issues have been completely examined and properly treated Analysis is thorough and demonstrates understanding and critical analysis. Well- reasoned exploration are provided for all sub-tasks. All analysis, comparisons and conclusions are evidenced by data (e.g. in well-formatted figures and/or tables). Very clear, well struc- tured and accessible re- port, an undergraduate student can pick up the report and understand it with no difficulty. 4 marks 4 marks 4 marks Data preparation is reasonably designed, systematic and explained. There are at least one obvious missing issue/error. Each examined error/issue have been completely checked and properly treated. Analysis is thorough and demonstrates good understanding and critical
Answered Same DayApr 18, 2021COSC2670

Answer To: Practical Data Science with Python COSC 2670/2738 Assignment 1 Assessment Type Individual Due Date...

Neha answered on Apr 19 2021
148 Votes
Data preparation
Pandas in the python is used as the data manipulation and analysis library. It is on
e of the
cornerstones of the python scientific programming stack. It can be used for multiple task
which also involves data preparation. The data preparation can be done using the CRISP-DM
model. Another method is KDD process which involves the selection, preprocessing and
transformation.
Exploratory data analysis
It is one of the point from the data analysis field, data science or the machine learning
project. It can be defined as the practice of including visual and quantitative methods to
help us in understanding the dataset without assuming anything. It is an important and
crucial step before entering the machine learning or any statistical modeling.
Dealing with the missing values
Here are some common methods which can be used to deal with the missing values present
in the dataset
1) Drop instances and attributes
2) Impute the attribute mean, median and mode for all the missing...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here