Ch05 Book

A digest of some information from Chapter 5.

Note: three assignments are recommended: 5.3-5.5, 5.6, 5.7. Combining the last two was too much work.

General

Many summary functions do not work if the data contains NA. The na.rm option causes them to be ignored, or drop_na() gets rid of them in the pipe. There are other methods we will learn later.

Summarizing Functions (5.6)

mean
median
sd = standard deviation (will learn later if you do not know)
n(): how many values are in the group
n_distinct: how many distinct values
sum
min, max
quantile(data,0.25): produces a single number, a value that is greater than 25% of the data values at less than the remaining 75%. Read ?quantile for a surprising amount of technical information.
ntile(data,n): produces a column of numbers giving which of the n bins the input numbers fall into. Outputs number from 1 to n inclusive.
mad = median absolute deviation
IQR = interquartile range (Q3-Q1)

Detail Notes: a boolean TRUE counts as the number 1 and FALSE counts as 0. Summing booleans gives how many trues. Finding the mean of booleans gives the proportion that are true.

Boolean is.na is used to find out how many NAs are in the data.

Window functions

Source: window-functions vignette.

These functions are normally used only on grouped data.

Common accumulating functions

cumsum: the cumulative sum
cummean: the cumulative mean
cummin: least value so far
cummax: greatest value so far
cumall: boolean, true if every value so far is true (like and; once false it stays false)
cumany: boolean, true if any value so far is true (like “or”; once true it stays true)

Common other functions

min_rank: order within the window
lag: the previous item in the vector
lead: the next item in the vector

Uncommon: Ranking Functions

There are a ton of ranking functions because:

there are many ways ties can be broken
everybody wants to rank everything

row_number
min_rank
dense_rank
cume_dist: proportion of values <= current value
percent_rank: proportion of values < current value

Example from source that shows the difference between these functions.

x <- c(1, 1, 2, 2, 2)
row_number(x)
#> [1] 1 2 3 4 5
min_rank(x)
#> [1] 1 1 3 3 3
dense_rank(x)
#> [1] 1 1 2 2 2
cume_dist(x)
#> [1] 0.4 0.4 1.0 1.0 1.0
percent_rank(x)
#> [1] 0.0 0.0 0.5 0.5 0.5

Last modified August 18, 2023: 2022-2023 End State (7352e87)