data science12/11/2020 Project Proposal...

Question

data science12/11/2020 Project Proposal https://psu.instructure.com/courses/2083772/assignments/12364997 1/5 Project Proposal Due  Nov 24 by 9pm  Points  50  Submitting  a text entry box or a file upload Submit Assignment This project may be done individually or in pairs. Pairs must be identified on the group project survey by no later than Nov. 13.  NOTE: If the assignment is being completed by a pair of students, late days may not be used for this assignment.  NOTE: Completion of this proposal requires substantial work and the collection of some data (we call it preliminary data when we are doing a proposal). Start this early. Plan. Do not wait until the last minute!  Project Objectives: Demonstrate knowledge of the different principles and features behind relational data models and NoSQL data models Gain practical experience handling data in relational and NoSQL data management tools Think about data in the context of human problems Communicate project methods and results to an audience   The first part of this project is about selecting a dataset, thinking about what kinds of interesting things could be done with that dataset and loading it into a relational database tool. The overarching goal of the project by the end of the semester will be to find some clear ways to compare a relational and non-relational (NoSQL) database.  Submit a proposal detailing the follows: 1. Introduction, background - why the data is interesting (this can be brief, but don't just copy the intro text provided below) 2. Dataset you will use. Describe the data and what is in the dataset. Such things as attributes, how many entries are in the data, and limitations or special features of the dataset would be helpful here, especially in the context of the problem areas you have said you are interested in from your background section.  3. Preliminary Results: Create a schema and load your data file into an SQL database (MySQL, SQL Server, or SQLite are ones that we have readily available). Describe how you loaded the data and describe, briefly, any challenges you had. Show some screenshots and describe your schema and show some summary statistics and/or initial rows in your data tables. Run some sample queries on this data. How do they work?  12/11/2020 Project Proposal https://psu.instructure.com/courses/2083772/assignments/12364997 2/5 4. Proposal: What metrics might you use to compare with another NoSQL data tool (are there particular types of queries)?  At this point, what data tool might you be considering for the data? Describe any progress you have made in progressing to load this data with that data tool.  5. Collaboration report (if working in pairs): If this is a paired project, provide an update on the functioning of the team.  Who was responsible for which components of the activities described in this report? It is not acceptable to say that both members contributed to all parts or to be general in the description -- describe tasks as specifically as you can. Each individual should be responsible for specific tasks, and for full credit, there should be a mix of technical and non- technical tasks for both members. Datasets Below is a list of vetted datasets along with a brief description.  If you can find additional information related to one of the topics below, you should feel free to use it, making sure to cite and link to the source of the data and information. You may also propose to use your own dataset. Please be sure to clear it with someone on the teaching team so that we can be sure it will be sufficient to carry out the rest of the components of this project.  Firearm permits and background checks Dataset: https://github.com/BuzzFeedNews/nics-firearm-background-checks (https://github.com/BuzzFeedNews/nics-firearm-background-checks) The data in this repository comes from the FBI’s National Instant Criminal Background Check System  (https://www.fbi.gov/about-us/cjis/nics) . Mandated by the Brady Handgun Violence Prevention Act of 1993 and launched by the FBI on November 30, 1998, NICS is used by Federal Firearms Licensees (FFLs) to instantly determine whether a prospective buyer is eligible to buy firearms or explosives. Before ringing up the sale, cashiers call in a check to the FBI or to other designated agencies to ensure that each customer does not have a criminal record or isn’t otherwise ineligible to make a purchase. More than 100 million such checks have been made in the last decade, leading to more than 700,000 denials. The FBI provides data on the number of firearm checks by month, state, and type — but as a PDF. The code in this GitHub repository downloads that PDF, parses it, and produces a spreadsheet/CSV of the data. Click here to download the data, which currently covers November 1998 – April 2019 Analyzing Crimes in Boston Dataset: https://www.kaggle.com/ankkur13/boston-crime-data (https://www.kaggle.com/ankkur13/boston-crime-data) This is a dataset containing records from a new crime incident report system in Boston. The dataset contains a reduced set of fields focused on gathering when and where an incident occurred and the type of incident.  World Happiness Index https://github.com/BuzzFeedNews/nics-firearm-background-checks https://www.fbi.gov/about-us/cjis/nics https://www.kaggle.com/ankkur13/boston-crime-data 12/11/2020 Project Proposal https://psu.instructure.com/courses/2083772/assignments/12364997 3/5 Project Proposal Rubric Dataset: https://www.kaggle.com/unsdsn/world-happiness(Links to an external site.) (https://www.kaggle.com/unsdsn/world-happiness) The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012. This Kaggle link contains datasets from 2015, 2016, and 2017. The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. Read the context for this data that is presented on Kaggle, carefully to best understand what is contained in this dataset.  Baseball Data Dataset: http://www.seanlahman.com/baseball-archive/statistics/ (http://www.seanlahman.com/baseball-archive/statistics/) https://www.baseball-reference.com  (https://www.baseball-reference.com/) These datasets cover 1871-2018 batting/pitching stats for baseball (covers both players and teams) with plenty of metadata. Possible areas of exploration could include player performance at different time points.  The data is separated into different files with different themes, so lots of potential data to load, and you will need to look at the files and decide which topic area you want to focus on to be sure you get the right data.  Data from Neo4j Sandboxes If you don’t have Neo4j locally installed on your machine you can use a Neo4j Sandbox which will give you an interactive experience with the graph database by creating a temporary instance of Neo4j in the cloud. There are several Sandboxes available containing a wide variety of datasets to choose from, they can be found from the Neo4j Sandbox site  (https://neo4j.com/sandbox) . Below are a couple interesting ones. Women’s World Cup 2019 The 2019 Women’s World Cup modeled as a Graph - players, teams, matches, and more.  Legis-Graph US Congress modeled as a Graph - bills, votes, members, committees, and more. If you have Neo4j installed, the raw data sources (CSV, JSON, XML) have been provided along with the import scripts in Cypher. You need to run the Cypher script using a command-line client like the cypher-shell. https://www.kaggle.com/unsdsn/world-happiness http://www.seanlahman.com/baseball-archive/statistics/ https://www.baseball-reference.com/ https://neo4j.com/sandbox 12/11/2020 Project Proposal https://psu.instructure.com/courses/2083772/assignments/12364997 4/5 Criteria Ratings Pts 10.0 pts  12.0 pts  13.0 pts  5.0 pts  Required Components Proper details  included in  introduction,  background,  progress, future  work, and  collaboration  report sections 10.0 pts Full Marks Contains sufficient detail in all required components 8.0 pts Good Missing or thin in one required component 7.0 pts Developing Missing or thin in 2-3 required components 4.0 pts Poor Missing 4 - 5 required components Technical Work Appropriate and  understandable  technical work  completed in  database tools 12.0 pts Full Marks Technical work was completed in a logical way towards assigned goals and tasks; it is well documented; if challenges are impeding progress, the progress to date is well documented and roadblocks are clearly identified so that they can be solved 10.0 pts Good Technical work was completed in a logical way and is reasonably well documented; if challenges are impeding progress, the progress to date is well documented but roadblocks that are impeding progress may not be identified 8.0 pts Developing Technical work is progressing in a manner less than would be expected or has gone in a direction that is not logical with respect to assigned tasks 6.0 pts Poor Missing 4 - 5 required components Accuracy 13.0 pts Full Marks Discussion of SQL and NoSQL databases is accurate and covers major principles and points of difference thoroughly. Discussion of the data and scientific problem are accurate and any claims are properly referenced to appropriate sources. Overall, there are no more than 1 -2 minor inaccuracies in these discussions. 10.0 pts Good The paper contains 1 - 2 major inaccuracies or omissions or >2 minor inaccuracies. 8.0 pts Developing The paper contains 2 -4 significant inaccuracies or omissions 6.0 pts Poor The paper contains >4 significant inaccuracies or omissions Problem 5.0 pts Full Marks The comparison task that has been set up between the SQL and NoSQL database is clear and the paper articulates how the comparison will be conducted. If it can be quantified or more subjective, this will be noted. Passes the test of repeatability -- could I repeat your comparison by reading the description in your paper. 4.0 pts Good Statement of the comparison is clear but may not pass the repeatability test because no clear quantifiable or subjective metrics are stated. 3.0 pts Developing No comparisons between data tools are proposed. 12/11/2020 Project Proposal https://psu.instructure.com/courses/2083772/assignments/12364997 5/5 Total Points: 50.0 Criteria Ratings Pts 5.0 pts  5.0 pts  Readability 5.0 pts Full Marks The paper is logically and coherently organized and flows well. The writing is clear and easy to read and appropriate for the audience. 4.0 pts Good The paper is organized logically but is disjointed in some places. The writing is generally clear but sometimes clunky in word choice or sentence structure. 3.0 pts Poor The paper is very difficult to follow and understand Creativity This criterion is linked to a learning outcome 5.0 pts Full Marks The paper is beautifully written with striking, thoughtful word choices and phrasing. Figures, if applicable, are well presented and add to the understanding of concepts through visualization. The next steps go beyond the obvious and demonstrate insight. 4.0 pts Good The paper is well written and contains some original thoughts on the data or tools. The next steps are reasonable but may be obvious. 3.0 pts Poor The next steps are not logical or relevant. There is no insight into the data.   12/11/2020 Project Progress Report https://psu.instructure.com/courses/2083772/assignments/12364999 1/4 Project Progress Report Due  Dec 3 by 9pm  Points  100  Submitting  a text entry box or a file upload Submit Assignment NOTE: If the assignment is being completed by a pair

Neha · Accepted Answer

Introduction: In this report we will try to understand how we can utilise the SQL database tool for the data set. In this report I have selected a data set which is based on the crime incident in Boston. The data set contains records which shows information about the incident report system present in the Boston and it contains some reduced set of fields which have major focus over the location and occurrence of the incident and its type. With this project we will try to learn about the features of SQL tools and NoSQL database tools. This can help us to understand which type of tool is better for the data set. The whole proposal was about the usage of database tool and compare which one is better. The proposal was to import the dataset in MySQL and select any one NoSQL database tool. In this project I created a schema in MySQL database tool and imported the whole database set into the MySQL database. Once the data set was imported into the database it was easier to execute queries on it. On this data set I tried to find out the major location on which incident were taking place. The filter was set to find out the location which are most occurring in the database. The whole data set is about the crime incidents which are taking place in Boston and we can try to analyse the locations and the time at which most of the incidents are taking place so that important actions can be taken to decrease the crimes.

12/11/2020 Project Proposal https://psu.instructure.com/courses/2083772/assignments/ XXXXXXXXXX/5 Project Proposal Due Nov 24 by 9pm Points 50 Submitting a text entry box or a file upload Submit...

Answer To: 12/11/2020 Project Proposal https://psu.instructure.com/courses/2083772/assignments/ XXXXXXXXXX/5...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment