Data Transformation Data Transformation Zhichao Jiang XXXXXXXXXX Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need....

1 answer below »
questions 2 to 4


Data Transformation Data Transformation Zhichao Jiang 2020-10-01 Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you will need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. 1 Import data 1.1 Working directory R associates itself with a folder (i.e. directory) on your computer. To see which one, run getwd() at the console. This folder is known as your “working directory” When you save files, R will save them here When you load files, R will look for them here 2 Data transformation What geoms shoul be used for this graph? We will learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges: Pick observations by their values (filter()). Reorder the rows (arrange()). Pick variables by their names (select()). Create new variables with functions of existing variables (mutate()). Collapse many values down to a single summary (summarize()). These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation. All verbs work similarly: The first argument is a data frame (tibble). The subsequent arguments describe what to do with the data frame, using the variable names (without quotes). The result is a new data frame. Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work. 2.1 select() select(babynames,name,prop) ## # A tibble: 1,924,665 x 2 ## name prop ## ## 1 Mary 0.0724 ## 2 Anna 0.0267 ## 3 Emma 0.0205 ## 4 Elizabeth 0.0199 ## 5 Minnie 0.0179 ## 6 Margaret 0.0162 ## 7 Ida 0.0151 ## 8 Alice 0.0145 ## 9 Bertha 0.0135 ## 10 Sarah 0.0132 ## # … with 1,924,655 more rows 2.2 Select helpers use : to select range of columns select(babynames,name:prop) ## # A tibble: 1,924,665 x 3 ## name n prop ## ## 1 Mary 7065 0.0724 ## 2 Anna 2604 0.0267 ## 3 Emma 2003 0.0205 ## 4 Elizabeth 1939 0.0199 ## 5 Minnie 1746 0.0179 ## 6 Margaret 1578 0.0162 ## 7 Ida 1472 0.0151 ## 8 Alice 1414 0.0145 ## 9 Bertha 1320 0.0135 ## 10 Sarah 1288 0.0132 ## # … with 1,924,655 more rows use - to select every column but select(babynames,-c(name,prop)) ## # A tibble: 1,924,665 x 3 ## year sex n ## ## 1 1880 F 7065 ## 2 1880 F 2604 ## 3 1880 F 2003 ## 4 1880 F 1939 ## 5 1880 F 1746 ## 6 1880 F 1578 ## 7 1880 F 1472 ## 8 1880 F 1414 ## 9 1880 F 1320 ## 10 1880 F 1288 ## # … with 1,924,655 more rows use starts_with() to select columns start with select(babynames,starts_with("n")) ## # A tibble: 1,924,665 x 2 ## name n ## ## 1 Mary 7065 ## 2 Anna 2604 ## 3 Emma 2003 ## 4 Elizabeth 1939 ## 5 Minnie 1746 ## 6 Margaret 1578 ## 7 Ida 1472 ## 8 Alice 1414 ## 9 Bertha 1320 ## 10 Sarah 1288 ## # … with 1,924,655 more rows use ends_with() to select columns end with select(babynames,ends_with("e")) ## # A tibble: 1,924,665 x 1 ## name ## ## 1 Mary ## 2 Anna ## 3 Emma ## 4 Elizabeth ## 5 Minnie ## 6 Margaret ## 7 Ida ## 8 Alice ## 9 Bertha ## 10 Sarah ## # … with 1,924,655 more rows use contains() to select columns contain select(babynames,contains("e")) ## # A tibble: 1,924,665 x 3 ## year sex name ## ## 1 1880 F Mary ## 2 1880 F Anna ## 3 1880 F Emma ## 4 1880 F Elizabeth ## 5 1880 F Minnie ## 6 1880 F Margaret ## 7 1880 F Ida ## 8 1880 F Alice ## 9 1880 F Bertha ## 10 1880 F Sarah ## # … with 1,924,655 more rows use num_range() to select named in prefix, number style select(babynames,num_range("x",1:5)) ## # A tibble: 1,924,665 x 0 2.3 $ and select() $ extracts columnn contents as a vector. select() extracts column contents as a tibble. select(babynames, n) babynames$n 2.3.1 Your turn Which of these is NOT a way to select the name and n columns together? select(babynames, -c(year, sex, prop)) select(babynames, name:n) select(babynames, starts_with("n")) select(babynames, ends_with("n")) 2.4 filter() filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. filter(babynames, name == "Garret") ## # A tibble: 110 x 5 ## year sex name n prop ## ## 1 1881 M Garret 6 0.0000554 ## 2 1883 M Garret 5 0.0000445 ## 3 1895 M Garret 5 0.0000395 ## 4 1900 M Garret 5 0.0000308 ## 5 1912 M Garret 5 0.0000111 ## 6 1913 M Garret 10 0.0000186 ## 7 1914 M Garret 9 0.0000132 ## 8 1915 M Garret 15 0.0000170 ## 9 1916 M Garret 16 0.0000173 ## 10 1917 M Garret 7 0.0000073 ## # … with 100 more rows 2.4.1 Missing values One important feature of R that can make comparison tricky are missing values, or NA (“not availables”). NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown. NA > 5 ## [1] NA NA + 10 ## [1] NA NA == NA ## [1] NA NA | FALSE ## [1] NA NA & FALSE ## [1] FALSE NA*0 ## [1] NA Inf*0 ## [1] NaN If you want to determine if a value is missing, use is.na() filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly. df <- tibble(x="c(1," na,="" 3))="" filter(df,="" x=""> 1) ## # A tibble: 1 x 1 ## x ## ## 1 3 filter(df, is.na(x) | x > 1) ## # A tibble: 2 x 1 ## x ## ## 1 NA ## 2 3 2.4.2 Your turn Use filter, babynames, and the logical operators to find: All of the rows where prop is greater than or equal to 0.08 All of the children named “Sea” 2.4.3 Boolean operators filter(babynames, name == "Garrett", year == 1880) ## # A tibble: 1 x 5 ## year sex name n prop ## ## 1 1880 M Garrett 13 0.000110 filter(babynames, name == "Garrett" & year == 1880) ## # A tibble: 1 x 5 ## year sex name n prop ## ## 1 1880 M Garrett 13 0
Answered Same DayOct 30, 2021

Answer To: Data Transformation Data Transformation Zhichao Jiang XXXXXXXXXX Visualisation is an important tool...

Pooja answered on Oct 31 2021
132 Votes
library(tidyr)
library(dplyr)
library(ggplot2)
library(nycflights13)
library(openair)
nycflight
s13::flights
#summary(flights)
#View(flights)
#2a#
table1 <- table(flights$dest)
sort(table1, ascending = TRUE)
#b#
flights$gained_time <- c(flights$arr_delay - flights$dep_delay)
plot(flights$dep_delay,...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here