questions 2 to 4Data Transformation Data Transformation Zhichao Jiang 2020-10-01 Visualisation is an...

Question

questions 2 to 4Data Transformation Data Transformation Zhichao Jiang 2020-10-01 Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you will need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. 1 Import data 1.1 Working directory R associates itself with a folder (i.e. directory) on your computer. To see which one, run getwd() at the console. 	This folder is known as your “working directory” 	When you save files, R will save them here 	When you load files, R will look for them here 2 Data transformation What geoms shoul be used for this graph? We will learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges: 	Pick observations by their values (filter()). 	Reorder the rows (arrange()). 	Pick variables by their names (select()). 	Create new variables with functions of existing variables (mutate()). 	Collapse many values down to a single summary (summarize()). These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation. All verbs work similarly: 	The first argument is a data frame (tibble). 	The subsequent arguments describe what to do with the data frame, using the variable names (without quotes). 	The result is a new data frame. Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work. 2.1 select() select(babynames,name,prop) ## # A tibble: 1,924,665 x 2 ##    name        prop ##           ##  1 Mary      0.0724 ##  2 Anna      0.0267 ##  3 Emma      0.0205 ##  4 Elizabeth 0.0199 ##  5 Minnie    0.0179 ##  6 Margaret  0.0162 ##  7 Ida       0.0151 ##  8 Alice     0.0145 ##  9 Bertha    0.0135 ## 10 Sarah     0.0132 ## # … with 1,924,655 more rows 2.2 Select helpers 	use : to select range of columns select(babynames,name:prop) ## # A tibble: 1,924,665 x 3 ##    name          n   prop ##            ##  1 Mary       7065 0.0724 ##  2 Anna       2604 0.0267 ##  3 Emma       2003 0.0205 ##  4 Elizabeth  1939 0.0199 ##  5 Minnie     1746 0.0179 ##  6 Margaret   1578 0.0162 ##  7 Ida        1472 0.0151 ##  8 Alice      1414 0.0145 ##  9 Bertha     1320 0.0135 ## 10 Sarah      1288 0.0132 ## # … with 1,924,655 more rows 	use - to select every column but select(babynames,-c(name,prop)) ## # A tibble: 1,924,665 x 3 ##     year sex       n ##       ##  1  1880 F      7065 ##  2  1880 F      2604 ##  3  1880 F      2003 ##  4  1880 F      1939 ##  5  1880 F      1746 ##  6  1880 F      1578 ##  7  1880 F      1472 ##  8  1880 F      1414 ##  9  1880 F      1320 ## 10  1880 F      1288 ## # … with 1,924,655 more rows 	use starts_with() to select columns start with select(babynames,starts_with("n")) ## # A tibble: 1,924,665 x 2 ##    name          n ##          ##  1 Mary       7065 ##  2 Anna       2604 ##  3 Emma       2003 ##  4 Elizabeth  1939 ##  5 Minnie     1746 ##  6 Margaret   1578 ##  7 Ida        1472 ##  8 Alice      1414 ##  9 Bertha     1320 ## 10 Sarah      1288 ## # … with 1,924,655 more rows 	use ends_with() to select columns end with select(babynames,ends_with("e")) ## # A tibble: 1,924,665 x 1 ##    name      ##         ##  1 Mary      ##  2 Anna      ##  3 Emma      ##  4 Elizabeth ##  5 Minnie    ##  6 Margaret  ##  7 Ida       ##  8 Alice     ##  9 Bertha    ## 10 Sarah     ## # … with 1,924,655 more rows 	use contains() to select columns contain select(babynames,contains("e")) ## # A tibble: 1,924,665 x 3 ##     year sex   name      ##           ##  1  1880 F     Mary      ##  2  1880 F     Anna      ##  3  1880 F     Emma      ##  4  1880 F     Elizabeth ##  5  1880 F     Minnie    ##  6  1880 F     Margaret  ##  7  1880 F     Ida       ##  8  1880 F     Alice     ##  9  1880 F     Bertha    ## 10  1880 F     Sarah     ## # … with 1,924,655 more rows 	use num_range() to select named in prefix, number style select(babynames,num_range("x",1:5)) ## # A tibble: 1,924,665 x 0 2.3 $ and select() $ extracts columnn contents as a vector. select() extracts column contents as a tibble. select(babynames, n) babynames$n 2.3.1 Your turn Which of these is NOT a way to select the name and n columns together? select(babynames, -c(year, sex, prop)) select(babynames, name:n) select(babynames, starts_with("n")) select(babynames, ends_with("n")) 2.4 filter() filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.  filter(babynames, name == "Garret") ## # A tibble: 110 x 5 ##     year sex   name       n      prop ##              ##  1  1881 M     Garret     6 0.0000554 ##  2  1883 M     Garret     5 0.0000445 ##  3  1895 M     Garret     5 0.0000395 ##  4  1900 M     Garret     5 0.0000308 ##  5  1912 M     Garret     5 0.0000111 ##  6  1913 M     Garret    10 0.0000186 ##  7  1914 M     Garret     9 0.0000132 ##  8  1915 M     Garret    15 0.0000170 ##  9  1916 M     Garret    16 0.0000173 ## 10  1917 M     Garret     7 0.0000073 ## # … with 100 more rows 2.4.1 Missing values One important feature of R that can make comparison tricky are missing values, or NA (“not availables”). NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown. NA > 5 ## [1] NA NA + 10 ## [1] NA NA == NA ## [1] NA NA | FALSE ## [1] NA NA & FALSE ## [1] FALSE NA*0 ## [1] NA Inf*0 ## [1] NaN If you want to determine if a value is missing, use is.na() filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly. df  1) ## # A tibble: 1 x 1 ##       x ##    ## 1     3 filter(df, is.na(x) | x > 1) ## # A tibble: 2 x 1 ##       x ##    ## 1    NA ## 2     3 2.4.2 Your turn 	Use filter, babynames, and the logical operators to find: 	All of the rows where prop is greater than or equal to 0.08 	All of the children named “Sea” 2.4.3 Boolean operators filter(babynames, name == "Garrett", year == 1880) ## # A tibble: 1 x 5 ##    year sex   name        n     prop ##             ## 1  1880 M     Garrett    13 0.000110 filter(babynames, name == "Garrett" & year == 1880) ## # A tibble: 1 x 5 ##    year sex   name        n     prop ##             ## 1  1880 M     Garrett    13 0

Pooja · Accepted Answer

library(tidyr)
library(dplyr)
library(ggplot2)
library(nycflights13)
library(openair)
nycflights13::flights
#summary(flights)
#View(flights)
#2a#
table1

Data Transformation Data Transformation Zhichao Jiang XXXXXXXXXX Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need....

Answer To: Data Transformation Data Transformation Zhichao Jiang XXXXXXXXXX Visualisation is an important tool...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment