3  Base R Fundamentals

Author
Affiliation

Ryan McShane, Ph.D.

The University of Chicago

Published

Oct. 7th, 2024

Modified

Oct. 10th, 2024

3.0.1 Overview

  1. Preface
  2. The Very Basics
  3. Packages and Help Pages
  4. R Objects
  5. R Notation
  6. Modifying Values
  1. Environments
  2. Programs
  3. S3
  4. Loops
  5. Speed

3.1 HoPR Ch 0: Preface

3.1.1 R History

  • R is a scripting language for statistical data wrangling and analysis.
  • It was inspired by, and is mostly compatible with, the statistical language S developed by AT&T.
  • The name S, for statistics, was an allusion to another programming language with a one-letter name developed at AT&T—the famous C language.
  • S later was sold to a small firm, which added a graphical user interface (GUI) and named the result S-Plus.
  • R has become far more popular than S or S-Plus, both because it’s free and because more people are contributing to it.

3.1.2 About R

  • R is a public-domain implementation of the well-regarded S statistical language, and the R/S platform is a de facto standard among professional statisticians.
  • R is comparable, and often superior, in power to commercial products in most of the significant senses—variety of operations available, programmability, graphics, etc.
  • In addition to providing statistical operations, R is a general-purpose programming language, so you can use it to automate analyses and create new functions that extend the existing language features.
  • R incorporates features found in object-oriented programming and functional programming languages.
  • Because R is open source software, it’s easy to get help from the user community. Also, a lot of new functions are contributed by users, many of whom are prominent statisticians.

3.1.3 Grolemund Treats R like a Programming Language

“Learning to program is like learning to speak another language – you progress faster when you practice. In fact, learning to program is learning to speak another language. You will get the best results if you follow along with the examples in the book and experiment whenever an idea strikes you.” – Grolemund


“Using the functions in R is like riding a bus. Writing programs in R is like driving a car.”

“Busses are very easy to use, you just need to know which bus to get on, where to get on, and where to get off […]. Cars, on the other hand, require much more work: you need to have [directions], you need to put gas in every now and then, you need to know the rules of the road […]. The big advantage of the car is that it can take you a bunch of places that the bus does not go and it is quicker for some trips that would require transferring between busses. [… P]rograms like SPSS are busses, easy to use for the standard things, but very frustrating if you want to do something that is not already preprogrammed.

R is a 4-wheel drive SUV […] with a bike on the back, a kayak on top, good walking and running shoes in the passenger seat, and mountain climbing and spelunking gear in the back. R can take you anywhere you want to go if you take time to learn how to use the equipment, but that is going to take longer than learning where the bus stops are in SPSS.” - Greg Snow, 2006

3.1.4 Hadley’s Foreword: GUIs Hamper Three Properties for Good Data Analysis

Reproducibility

The ability to re-create a past analysis, which is crucial for good science.

Automation

The ability to rapidly re-create an analysis when data changes (as it always does).

Communication

Code is just text, so it is easy to communicate. When learning, this makes it easy to get help.

3.1.5 Object Oriented Programming 1

The advantages of object orientation can be explained by example: Consider statistical regression.

  • When you perform a regression analysis with other statistical packages, like SAS or SPSS, you get a mountain of output on the screen.
  • By contrast, if you call the lm() regression function in R, the function returns an object containing all the results.
    • Estimated coefficients
    • Standard Errors
    • Residuals, and so on.
  • You then pick and choose, programatically, which parts of that object to extract.

3.1.6 Object Oriented Programming 2

  • You will see that R’s approach makes programming much easier, partly because it offers a certain uniformity of access to data.
  • This uniformity stems from the fact that R is polymorphic, which means that a single function can be applied to different types of inputs, which the function processes in the appropriate way. Such a function is called a generic function. If you are a C++ programmer, you have seen a similar concept in virtual functions.
  • For instance, consider the summary() function.
    • If you apply it to a numeric vector, you get a five-number summary.
    • If you apply it to a character vector, you get a frequency table.
    • Indeed, you can use the summary() function on just about any object produced by R.
    • This is nice, since it means that you, as a user, have fewer commands to remember!

3.1.7 Functional Programming

  • As is typical in functional programming languages, a common theme in R programming is avoidance of explicit iteration.
  • Instead of coding loops, you exploit R’s functional features, which let you express iterative behavior implicitly.
  • This can lead to code that executes much more efficiently, and it can make a huge timing difference when running R on large data sets.
  • The functional programming nature of the R language offers many advantages:
    • Clearer, more compact code
    • Potentially much faster execution speed
    • Less debugging, because the code is simple
    • Easier transition to parallel programming

3.2 HoPR Ch 1: The Very Basics

3.2.1 Console Output

1:4
## [1] 1 2 3 4
1:40
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
  • The console is interactive and R at its core.
    • Use the up and down arrows to cycle through the most recent commands.
    • The > for new line of code, + for continuation of code, and [1] for index of a vector are commonly used now to indicate R code has been used and outputted, but there’s been a stylistic shift away from this in more recent publications.
    • Today, the console is good for one-time use code (e.g., install.packages() to install a new R package), but should generally be avoided, as you generally want to run and re-run your code – either in an R script or .QMD/.RMD.

3.2.2 Influence of the Console on Code and knitr Output

  • ggplot2
    • The + operator is spiritually inspired by the + continue operator in the console.
    • Hadley has said that he would go back to change the + operator to a pipe operator.
  • knitr
    • show > for first line of code and + for the continuation with prompt = TRUE
    • hide the ## in the output with comment = ""
```{r comment = "", prompt = TRUE}
> if (TRUE) {
+   2 + 2
+ }
```
[1] 4

vs. default

```{r}
if (TRUE) {
  2 + 2
}
```
## [1] 4

3.2.3 Aside on Style Guides

Your code styling should be consistent. Your code should be readable, well-organized, and aligned with your collaborator’s style.

  • In STAT2/37815, I will make style rules and recommendations.

3.2.4 Assignment Operators

  • <- and = are left-assign operators: x = 5 stores the value 5 into the R object x.
    • The tidyverse style guide recommends <- over =. The case against = is trivial.
      • We could hypothetically write a <- b <- c <- 4, which simultaneously sets all of a, b, and c to 4, and similarly, x = y = z = 5 does the same, but mixed uses of <- and = do not work. However, simultaneous assignment is itself poor style.
    • I will use = exclusively, just as Yihui Xie does, but you are welcome to use =.
      • = is easier to read/type than <- and = is the left-assign operator used in many languages (e.g., C, SAS, Python, Java, etc.).
  • -> is the right-assign operator: 5 -> x stores the value 5 into the R object x.
    • It makes it difficult to visually parse code when both left-assign and right-assign operators are in play. The Google style guide agrees.

3.2.5 Names

Illegal

  • _a
  • 1a
  • $
  • !a

Tidyverse

  • Good
    • a
    • a_g
  • Bad
    • G
    • aG
  • ???
    • a1
    • .a

Preventing Chaos

  • Don’t reuse names.
  • Don’t give names that are used by common functions (e.g., c, t, mean, etc).
  • The tidyverse style guide recommends lowercase only, with words separated underscores. This is fine, except sometimes this rule collides with, e.g., kable for column names.
  • Use janitor::clean_names() to clean column names in a data frame.

3.2.6 Scalars (single values) and Vectors

  • All scalars are vectors with a single value in R
  • We can generate an integer sequence quickly with the colon:
5:10
## [1]  5  6  7  8  9 10
# And also rational numbers
(0:5)/5    
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
# And in reverse
5:0
## [1] 5 4 3 2 1 0
  • Use the c() function to combine scalars:
base::c(3, 1, 2)
## [1] 3 1 2
  • We can combine multiple vectors:
c(1, 5:3)
## [1] 1 5 4 3
c(5:3, c(7, 1))
## [1] 5 4 3 7 1
v1 = 5:3
v2 = 6:9
c(v1, v2)
## [1] 5 4 3 6 7 8 9
# we can even name vector elements
(v3 = c("ab" = v1, "xy" = v2))
## ab1 ab2 ab3 xy1 xy2 xy3 xy4 
##   5   4   3   6   7   8   9
v3["xy1"]
## xy1 
##   6

3.2.7 More on Vector Creation

  • The length() function counts the number of elements in a vector:
base::length(x = 5:10)
## [1] 6
tenths = (0:10)/10
length(tenths)
## [1] 11
# Index vector creation temptation
1:length(tenths)
##  [1]  1  2  3  4  5  6  7  8  9 10 11
# But does it work with an empty vector?
(empty_vector = vector())
## logical(0)
length(empty_vector)
## [1] 0
1:length(empty_vector)
## [1] 1 0
  • Failure! This should be an empty vector!
  • seq_along() creates an index vector of the same length as the input vector, along.with:
base::seq_along(along.with = 5:10)
## [1] 1 2 3 4 5 6
seq_along(v3)
## [1] 1 2 3 4 5 6 7
seq_along(empty_vector)
## integer(0)

3.2.8 Functions

  • As in most programming languages, the heart of R programming consists of writing functions.
  • A function is a group of instructions that takes inputs, uses them to compute other values, and returns a result.

3.2.9 Example Function oddcount()

Let’s define a function named oddcount(), whose purpose is to count the odd numbers in a vector of integers. Note that the modulo operator for remainder arithmetic is %% in R.

13 %% 12
## [1] 1
# counts odd integers in x
oddcount = function(x) {
  k = sum(x %% 2 == 1)
return(k)
}
oddcount(x = c(1, 3, 5))
## [1] 3
oddcount(x = c(1, 2, 9))
## [1] 2
  • First, we told R that we wanted to define a function named oddcount with one argument, x.
    • The left brace demarcates the start of the body of the function.
    • We wrote one R statement per line.
    • Specify what oddcount() returns with return().
  • After defining the function, we evaluated two calls to oddcount().
    • There are three odd numbers in the vector c(1, 3, 5), the call oddcount(x = c(1, 3, 5)) returns the value 3.
    • Similarly, x = c(1, 2, 9), returns 2.

3.2.10 Investigating Example Function oddcount()

Let’s see what happens with the following code:

1x = c(1, 3, 4)
2x %% 2
3x %% 2 == 1
4sum(x %% 2 == 1)
## [1] 1 1 0
## [1]  TRUE  TRUE FALSE
## [1] 2
1
Assigns vector c(1, 3, 4) to x
2
The modulo operator is vectorized – it applies the function to every element in x and returns a vector of remainders.
3
Checks whether the output of (x %% 2) is 1 or not. Returns TRUE if so and FALSE if not.
4
Sums over all of the TRUEs and FALSEs – the former is converted to 1 and the latter converted to 0.

3.2.11 Aside on Looping

C/C++ programmers might be tempted to write k = sum(x %% 2 == 1) like this:

k = 0
for (i in 1:length(x)) {
  if (x[i] %% 2 == 1) k = k + 1
}
k
## [1] 2
  • Here, length(x) is the number of elements in x.
    • Suppose there are 25 elements.
    • Then 1:length(x) means 1:25, which in turn means 1, 2, 3, …, 25.
    • This code would also work (unless x were to have length 0), but one of the major themes of R programming is to avoid loops if possible; if not, keep loops simple.
  • In this case, we didn’t need to loop at all – a vectorized solution was simple and straightforward.

3.2.12 More on return() in oddcount()

At the end of the code, we use the return() statement:

return(k)
  • This has the function return the computed value of k to the code that called it.
  • However, simply writing the following also works:
k
  • R functions will return the last value computed if there is no explicit return() call.
  • However, this approach must be used with care!
  • The tidyverse style guide says not to use return()
  • The Google and MLR3 style guides both agree that we should use return()
  • Hadley is correct that return() slows down the function because return() is an additional function call. The tidyverse is production-level code and it also relies heavily on many nested function calls.
  • However, you’re not writing production-level code – you’re writing code that you want others to read.
  • TL;DR: always use return()!

3.2.13 Arguments in oddcount()

  • In programming language terminology, x is the formal argument (or formal parameter) of the function oddcount().
  • In the first function call in the preceding example, c(1, 3, 5) is referred to as the actual argument.
  • These terms allude to the fact that x in the function definition is just a placeholder, whereas c(1, 3, 5) is the value actually used in the computation.
  • Similarly, in the second function call, c(1, 2, 9) is the actual argument.

3.2.14 Variable Scope: Local

  • A variable that is visible only within a function body is said to be local to that function.
  • In oddcount(), k is a local variable. It disappears after the function returns:
oddcount(x = c(1,2,3,7,9))
## [1] 4
k
## Error: object 'k' not found
  • It’s very important to note that the formal parameters in an R function are local variables!
  • Suppose we make the following function call:
z = c(2,6,7)
oddcount(x = z)
## [1] 1
  • Now suppose that the code of oddcount() changes z.
  • Then z would not change–after the call to oddcount(), z would have the same value as before.
  • To evaluate a function call, R copies each actual argument to the corresponding local parameter variable, and changes to that variable are not visible outside the function.

3.2.15 Variable Scope: Global

Variables created outside functions are global and are available within functions as well. Here’s an example:

f = function(x) return(x+y)
y = 3
f(x = 5)
## [1] 8
  • Here, y is a global variable.
  • A global variable can be written to from within a function by using R’s superassignment operator. However, this should not be used.

This is expected behavior:

f = function(x) {
  y = 0
return(x+y)
}
y = 3
f(x = 5)
## [1] 5
y
## [1] 3

3.2.16 Default Arguments

R also makes frequent use of default arguments. Consider a function definition like this:

g = function(x, y = 2, z = TRUE) { ... }
  • Here y will be initialized to 2 if the programmer does not specify y in the call.
  • Similarly, z will have the default value TRUE.

Now consider this call:

g(12, z = FALSE)
  • Here, the value 12 is the actual argument for x, and we accept the default value of 2 for y, but we override the default for z, setting its value to FALSE.
  • The preceding example also demonstrates that, like many programming languages, R has a Boolean type; that is, it has the logical values TRUE and FALSE.
  • R allows TRUE and FALSE to be abbreviated to T and F. However, you should not.

3.3 HoPR Ch 2: Packages and Help Pages

3.3.1 Packages and Help Pages

  • Virtually never need to declare library(base), library(stats), etc as these packages automatically load when R starts up.
  • Sometimes it’s important to delicately load packages (especially in production-level code). E.g., library(tidyverse) will load 40+ R packages (its dependencies and their dependencies, etc). Use, e.g., library(ggplot2) and/or library(dplyr) instead, for faster code and faster renders.
  • Update packages regularly (I do once per week).
  • Help page quality varies – do check examples.
  • Also, help pages include the ability to execute examples directly in your console (for better and worse – tread lightly).
pak::pkg_deps("tidyverse") |>
  dplyr::select(ref) |>
  dplyr::n_distinct()
## [1] 103
pak::pkg_deps_tree("tidyverse")
## tidyverse 2.0.0 [new][dl] (431.35 kB)
## ├─broom 1.0.7 [new][dl] (1.93 MB)
## │ ├─backports 1.5.0 [new][dl] (122.68 kB)
## │ ├─dplyr 1.1.4 [new][dl] (1.58 MB)
## │ │ ├─cli 3.6.3 [new][dl] (1.36 MB)
## │ │ ├─generics 0.1.3 [new][dl] (83.69 kB)
## │ │ ├─glue 1.8.0 [new][dl] (183.78 kB)
## │ │ ├─lifecycle 1.0.4 [new][dl] (140.93 kB)
## │ │ │ ├─cli
## │ │ │ ├─glue
## │ │ │ └─rlang 1.1.4 [new][dl] (1.62 MB)
## │ │ ├─magrittr 2.0.3 [new][dl] (229.42 kB)
## │ │ ├─pillar 1.9.0 [new][dl] (663.33 kB)
## │ │ │ ├─cli
## │ │ │ ├─fansi 1.0.6 [new][dl] (322.97 kB)
## │ │ │ ├─glue
## │ │ │ ├─lifecycle
## │ │ │ ├─rlang
## │ │ │ ├─utf8 1.2.4 [new][dl] (150.80 kB)
## │ │ │ └─vctrs 0.6.5 [new][dl] (1.36 MB)
## │ │ │   ├─cli
## │ │ │   ├─glue
## │ │ │   ├─lifecycle
## │ │ │   └─rlang
## │ │ ├─R6 2.5.1 [new][dl] (84.98 kB)
## │ │ ├─rlang
## │ │ ├─tibble 3.2.1 [new][dl] (695.05 kB)
## │ │ │ ├─fansi
## │ │ │ ├─lifecycle
## │ │ │ ├─magrittr
## │ │ │ ├─pillar
## │ │ │ ├─pkgconfig 2.0.3 [new][dl] (22.81 kB)
## │ │ │ ├─rlang
## │ │ │ └─vctrs
## │ │ ├─tidyselect 1.2.1 [new][dl] (228.15 kB)
## │ │ │ ├─cli
## │ │ │ ├─glue
## │ │ │ ├─lifecycle
## │ │ │ ├─rlang
## │ │ │ ├─vctrs
## │ │ │ └─withr 3.0.1 [new][dl] (231.30 kB)
## │ │ └─vctrs
## │ ├─generics
## │ ├─glue
## │ ├─lifecycle
## │ ├─purrr 1.0.2 [new][dl] (510.64 kB)
## │ │ ├─cli
## │ │ ├─lifecycle
## │ │ ├─magrittr
## │ │ ├─rlang
## │ │ └─vctrs
## │ ├─rlang
## │ ├─stringr 1.5.1 [new][dl] (323.42 kB)
## │ │ ├─cli
## │ │ ├─glue
## │ │ ├─lifecycle
## │ │ ├─magrittr
## │ │ ├─rlang
## │ │ ├─stringi 1.8.4 [new][dl] (15.03 MB)
## │ │ └─vctrs
## │ ├─tibble
## │ └─tidyr 1.3.1 [new][dl] (1.27 MB)
## │   ├─cli
## │   ├─dplyr
## │   ├─glue
## │   ├─lifecycle
## │   ├─magrittr
## │   ├─purrr
## │   ├─rlang
## │   ├─stringr
## │   ├─tibble
## │   ├─tidyselect
## │   └─vctrs
## ├─conflicted 1.2.0 [new][dl] (57.51 kB)
## │ ├─cli
## │ ├─memoise 2.0.1 [new][dl] (51.10 kB)
## │ │ ├─rlang
## │ │ └─cachem 1.1.0 [new][dl] (73.84 kB)
## │ │   ├─rlang
## │ │   └─fastmap 1.2.0 [new][dl] (135.36 kB)
## │ └─rlang
## ├─cli
## ├─dbplyr 2.5.0 [new][dl] (1.26 MB)
## │ ├─blob 1.2.4 [new][dl] (49.54 kB)
## │ │ ├─rlang
## │ │ └─vctrs
## │ ├─cli
## │ ├─DBI 1.2.3 [new][dl] (937.78 kB)
## │ ├─dplyr
## │ ├─glue
## │ ├─lifecycle
## │ ├─magrittr
## │ ├─pillar
## │ ├─purrr
## │ ├─R6
## │ ├─rlang
## │ ├─tibble
## │ ├─tidyr
## │ ├─tidyselect
## │ ├─vctrs
## │ └─withr
## ├─dplyr
## ├─dtplyr 1.3.1 [new][dl] (358.61 kB)
## │ ├─cli
## │ ├─data.table 1.16.2 [new][bld][cmp][dl] (5.49 MB)
## │ ├─dplyr
## │ ├─glue
## │ ├─lifecycle
## │ ├─rlang
## │ ├─tibble
## │ ├─tidyselect
## │ └─vctrs
## ├─forcats 1.0.0 [new][dl] (428.20 kB)
## │ ├─cli
## │ ├─glue
## │ ├─lifecycle
## │ ├─magrittr
## │ ├─rlang
## │ └─tibble
## ├─ggplot2 3.5.1 [new][dl] (5.01 MB)
## │ ├─cli
## │ ├─glue
## │ ├─gtable 0.3.5 [new][dl] (227.41 kB)
## │ │ ├─cli
## │ │ ├─glue
## │ │ ├─lifecycle
## │ │ └─rlang
## │ ├─isoband 0.2.7 [new][dl] (1.93 MB)
## │ ├─lifecycle
## │ ├─MASS 7.3-60.2 -> 7.3-61 [upd][dl] (1.17 MB)
## │ ├─mgcv 1.9-1 
## │ │ ├─nlme 3.1-164 -> 3.1-166 [upd][dl] (2.39 MB)
## │ │ │ └─lattice 0.22-6 
## │ │ └─Matrix 1.7-0 
## │ │   └─lattice
## │ ├─rlang
## │ ├─scales 1.3.0 [new][dl] (714.75 kB)
## │ │ ├─cli
## │ │ ├─farver 2.1.2 [new][dl] (1.52 MB)
## │ │ ├─glue
## │ │ ├─labeling 0.4.3 [new][dl] (63.36 kB)
## │ │ ├─lifecycle
## │ │ ├─munsell 0.5.1 [new][dl] (244.68 kB)
## │ │ │ └─colorspace 2.1-1 [new][dl] (2.67 MB)
## │ │ ├─R6
## │ │ ├─RColorBrewer 1.1-3 [new][dl] (54.47 kB)
## │ │ ├─rlang
## │ │ └─viridisLite 0.4.2 [new][dl] (1.30 MB)
## │ ├─tibble
## │ ├─vctrs
## │ └─withr
## ├─googledrive 2.1.1 [new][dl] (1.91 MB)
## │ ├─cli
## │ ├─gargle 1.5.2 [new][dl] (805.60 kB)
## │ │ ├─cli
## │ │ ├─fs 1.6.4 [new][dl] (413.27 kB)
## │ │ ├─glue
## │ │ ├─httr 1.4.7 [new][dl] (496.83 kB)
## │ │ │ ├─curl 5.2.3 [new][dl] (3.22 MB)
## │ │ │ ├─jsonlite 1.8.9 [new][dl] (1.11 MB)
## │ │ │ ├─mime 0.12 [new][dl] (40.92 kB)
## │ │ │ ├─openssl 2.2.2 [new][dl] (3.40 MB)
## │ │ │ │ └─askpass 1.2.1 [new][dl] (74.69 kB)
## │ │ │ │   └─sys 3.4.3 [new][dl] (47.84 kB)
## │ │ │ └─R6
## │ │ ├─jsonlite
## │ │ ├─lifecycle
## │ │ ├─openssl
## │ │ ├─rappdirs 0.3.3 [new][dl] (52.59 kB)
## │ │ ├─rlang
## │ │ └─withr
## │ ├─glue
## │ ├─httr
## │ ├─jsonlite
## │ ├─lifecycle
## │ ├─magrittr
## │ ├─pillar
## │ ├─purrr
## │ ├─rlang
## │ ├─tibble
## │ ├─uuid 1.2-1 [new][dl] (52.93 kB)
## │ ├─vctrs
## │ └─withr
## ├─googlesheets4 1.1.1 [new][dl] (523.73 kB)
## │ ├─cellranger 1.1.0 [new][dl] (106.58 kB)
## │ │ ├─rematch 2.0.0 [new][dl] (19.27 kB)
## │ │ └─tibble
## │ ├─cli
## │ ├─curl
## │ ├─gargle
## │ ├─glue
## │ ├─googledrive
## │ ├─httr
## │ ├─ids 1.0.1 [new][dl] (126.15 kB)
## │ │ ├─openssl
## │ │ └─uuid
## │ ├─lifecycle
## │ ├─magrittr
## │ ├─purrr
## │ ├─rematch2 2.1.2 [new][dl] (48.77 kB)
## │ │ └─tibble
## │ ├─rlang
## │ ├─tibble
## │ ├─vctrs
## │ └─withr
## ├─haven 2.5.4 [new][dl] (768.44 kB)
## │ ├─cli
## │ ├─forcats
## │ ├─hms 1.1.3 [new][dl] (105.34 kB)
## │ │ ├─lifecycle
## │ │ ├─pkgconfig
## │ │ ├─rlang
## │ │ └─vctrs
## │ ├─lifecycle
## │ ├─readr 2.1.5 [new][dl] (1.19 MB)
## │ │ ├─cli
## │ │ ├─clipr 0.8.0 [new][dl] (55.53 kB)
## │ │ ├─crayon 1.5.3 [new][dl] (165.17 kB)
## │ │ ├─hms
## │ │ ├─lifecycle
## │ │ ├─R6
## │ │ ├─rlang
## │ │ ├─tibble
## │ │ ├─vroom 1.6.5 [new][dl] (1.34 MB)
## │ │ │ ├─bit64 4.5.2 [new][dl] (510.32 kB)
## │ │ │ │ └─bit 4.5.0 [new][dl] (1.18 MB)
## │ │ │ ├─cli
## │ │ │ ├─crayon
## │ │ │ ├─glue
## │ │ │ ├─hms
## │ │ │ ├─lifecycle
## │ │ │ ├─rlang
## │ │ │ ├─tibble
## │ │ │ ├─tidyselect
## │ │ │ ├─tzdb 0.4.0 [new][dl] (1.02 MB)
## │ │ │ ├─vctrs
## │ │ │ └─withr
## │ │ └─tzdb
## │ ├─rlang
## │ ├─tibble
## │ ├─tidyselect
## │ └─vctrs
## ├─hms
## ├─httr
## ├─jsonlite
## ├─lubridate 1.9.3 [new][dl] (987.20 kB)
## │ ├─generics
## │ └─timechange 0.3.0 [new][dl] (514.48 kB)
## ├─magrittr
## ├─modelr 0.1.11 [new][dl] (202.78 kB)
## │ ├─broom
## │ ├─magrittr
## │ ├─purrr
## │ ├─rlang
## │ ├─tibble
## │ ├─tidyr
## │ ├─tidyselect
## │ └─vctrs
## ├─pillar
## ├─purrr
## ├─ragg 1.3.3 [new][dl] (1.97 MB)
## │ ├─systemfonts 1.1.0 [new][dl] (1.34 MB)
## │ │ └─lifecycle
## │ └─textshaping 0.4.0 [new][dl] (1.21 MB)
## │   ├─lifecycle
## │   └─systemfonts
## ├─readr
## ├─readxl 1.4.3 [new][dl] (1.20 MB)
## │ ├─cellranger
## │ └─tibble
## ├─reprex 2.1.1 [new][dl] (504.87 kB)
## │ ├─callr 3.7.6 [new][dl] (473.92 kB)
## │ │ ├─processx 3.8.4 [new][dl] (688.49 kB)
## │ │ │ ├─ps 1.8.0 [new][dl] (643.05 kB)
## │ │ │ └─R6
## │ │ └─R6
## │ ├─cli
## │ ├─clipr
## │ ├─fs
## │ ├─glue
## │ ├─knitr 1.48 [new][dl] (1.22 MB)
## │ │ ├─evaluate 1.0.1 [new][bld][dl] (34.87 kB)
## │ │ ├─highr 0.11 [new][dl] (44.22 kB)
## │ │ │ └─xfun 0.48 [new][dl] (559.14 kB)
## │ │ ├─xfun
## │ │ └─yaml 2.3.10 [new][dl] (119.45 kB)
## │ ├─lifecycle
## │ ├─rlang
## │ ├─rmarkdown 2.28 [new][dl] (2.70 MB)
## │ │ ├─bslib 0.8.0 [new][dl] (5.59 MB)
## │ │ │ ├─base64enc 0.1-3 [new][dl] (33.12 kB)
## │ │ │ ├─cachem
## │ │ │ ├─fastmap
## │ │ │ ├─htmltools 0.5.8.1 [new][dl] (363.20 kB)
## │ │ │ │ ├─base64enc
## │ │ │ │ ├─digest 0.6.37 [new][dl] (223.14 kB)
## │ │ │ │ ├─fastmap
## │ │ │ │ └─rlang
## │ │ │ ├─jquerylib 0.1.4 [new][dl] (526.06 kB)
## │ │ │ │ └─htmltools
## │ │ │ ├─jsonlite
## │ │ │ ├─lifecycle
## │ │ │ ├─memoise
## │ │ │ ├─mime
## │ │ │ ├─rlang
## │ │ │ └─sass 0.4.9 [new][dl] (2.61 MB)
## │ │ │   ├─fs
## │ │ │   ├─rlang
## │ │ │   ├─htmltools
## │ │ │   ├─R6
## │ │ │   └─rappdirs
## │ │ ├─evaluate
## │ │ ├─fontawesome 0.5.2 [new][dl] (1.35 MB)
## │ │ │ ├─rlang
## │ │ │ └─htmltools
## │ │ ├─htmltools
## │ │ ├─jquerylib
## │ │ ├─jsonlite
## │ │ ├─knitr
## │ │ ├─tinytex 0.53 [new][dl] (143.13 kB)
## │ │ │ └─xfun
## │ │ ├─xfun
## │ │ └─yaml
## │ ├─rstudioapi 0.16.0 [new][dl] (339.29 kB)
## │ └─withr
## ├─rlang
## ├─rstudioapi
## ├─rvest 1.0.4 [new][dl] (308.62 kB)
## │ ├─cli
## │ ├─glue
## │ ├─httr
## │ ├─lifecycle
## │ ├─magrittr
## │ ├─rlang
## │ ├─selectr 0.4-2 [new][dl] (501.28 kB)
## │ │ ├─stringr
## │ │ └─R6
## │ ├─tibble
## │ └─xml2 1.3.6 [new][dl] (1.61 MB)
## │   ├─cli
## │   └─rlang
## ├─stringr
## ├─tibble
## ├─tidyr
## └─xml2
## 
## Key:  [new] new | [upd] update | [dl] download | [bld] build | [cmp] compile

3.4 HoPR Ch 3: R Objects

3.4.1 Vectors: Double/Numeric vs Integer

vector_numeric = (1:5)/5*5
vector_integer1 = 1:5
vector_integer2 = seq(from = 1L, 
                      to = 5L, 
                      by = 1L)
base::typeof(x = vector_numeric)
## [1] "double"
typeof(vector_integer1)
## [1] "integer"
typeof(vector_integer2)
## [1] "integer"
base::is.numeric(x = vector_numeric)
## [1] TRUE
is.numeric(vector_integer1)
## [1] TRUE
base::is.integer(x = vector_numeric)
## [1] FALSE
is.integer(vector_integer1)
## [1] TRUE
lobstr::obj_size(vector_numeric)
## 96 B
lobstr::obj_size(vector_integer1)
## 680 B
lobstr::obj_size(vector_integer2)
## 80 B
  • All integer vectors are numeric, but not all numeric vectors are integer vectors.
  • Most R code for quantitative vectors assumes numeric input, although will also work on integer vectors. (Why? More later.)
  • (lobstr::obj_size more accurately reports how much memory an object uses compared to utils::object.size.)

3.4.2 Character and Logical Vectors

Characters

  • Everyone’s favorite character vectors come with R!
letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
length(letters)
## [1] 26
length("abcdefghijklmnopqrstuvwxyz")
## [1] 1
length(c("abcdefghijklm", "nopqrstuvwxyz"))
## [1] 2
  • These are character vectors with one and two elements, respectively. The elements are strings.

Logicals

(vector_logical = c(FALSE, TRUE, TRUE))
## [1] FALSE  TRUE  TRUE
is.logical(vector_logical)
## [1] TRUE
is.logical(vector_integer1)
## [1] FALSE
is.logical(vector_numeric)
## [1] FALSE
is.integer(vector_logical)
## [1] FALSE

3.4.3 Less Common Types

Complex and Raw

is.complex(3)
## [1] FALSE
is.complex(3 + 0i)
## [1] TRUE
is.complex(3 + 4i)
## [1] TRUE
base::as.raw(x = FALSE)
## [1] 00
as.raw(TRUE)
## [1] 01
as.raw(9)
## [1] 09
as.raw(15)
## [1] 0f
as.raw(16)
## [1] 10
as.raw(255)
## [1] ff

Factors

  • Factors are common in statistics, but not in programming. We’ll discuss these more in R4DS Ch 16.
(cba = as.factor(c("c", "b", "a")))
## [1] c b a
## Levels: a b c
as.integer(cba)
## [1] 3 2 1
abc = ordered(
    x = c("c", "b", "a"), 
    levels = c("c", "b", "a")
  )
as.integer(abc)
## [1] 1 2 3

Dates

  • Dates and date-times are common in data, but less so in programming. We’ll discuss these more in R4DS Ch 17.
(eoy = as.Date("2024-12-31"))
## [1] "2024-12-31"
eoy + 1
## [1] "2025-01-01"
as.POSIXct(
    x = "2019-08-23 11:00", 
    tz = "CST6CDT"
  )
## [1] "2019-08-23 11:00:00 CDT"

3.4.4 Coercion

  • All integers can be stored as numeric, but numeric cannot be cast as integer without loss of information.
sqrt(vector_numeric)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
as.integer(sqrt(vector_numeric))
## [1] 1 1 1 2 2
  • We can recast objects to other types, sometimes with disastrous results (even if we think it “fixes” our code because it runs now).

  • Coercion should be an approach of last resort, although sometimes it is unavoidable.

3.4.5 Coercion Examples

#A character vector of numbers....
a = c("1", "1", "2", "3", "1", "2")
#I can't do this
a + 1
## Error in a + 1: non-numeric argument to binary operator
#But I can do this:
as.numeric(a) + 1
## [1] 2 2 3 4 2 3

#What about this?
b = c("1", "1", "2", "C", "A", "B")
#What will happen here?
as.numeric(b) + 1
## [1]  2  2  3 NA NA NA

#Convert character to numeric
#Works if the character is a number
a = "5"
class(a)
## [1] "character"
a = as.numeric(a)
a
## [1] 5
class(a)
## [1] "numeric"

#Convert a character to a numeric
#This doesn't work
a = "five"
class(a)
## [1] "character"
a = as.numeric(a)
a
## [1] NA
class(a)
## [1] "numeric"

#Convert a numeric to a character
a = 5
class(a)
## [1] "numeric"
a = as.character(a)
a
## [1] "5"
class(a)
## [1] "character"

3.4.6 Vector Arithmetic

Addition/Subtraction

(ratings = 6:10)
## [1]  6  7  8  9 10
(ones = rep(x = 1, times = 5))
## [1] 1 1 1 1 1
ratings + ones
## [1]  7  8  9 10 11
ratings - ones
## [1] 5 6 7 8 9
ratings - 1
## [1] 5 6 7 8 9
ratings - 1:3
## [1] 5 5 5 8 8
  • Vector recycling is OK when it’s a scalar, but otherwise tends to lead to errors.
  • Don’t rely on vector recycling!

Multiplication

(1:2)*(3:4) # Element-wise product
## [1] 3 8
(1:2)%*%(3:4) # Dot product
##      [,1]
## [1,]   11
(1:2)%o%(3:4) # Outer product
##      [,1] [,2]
## [1,]    3    4
## [2,]    6    8
base::factorial(x = 5) #5!
## [1] 120
# Product of entire vector
base::prod(x = 1:5) 
## [1] 120
base::Reduce(x = 1:5, f = '*') # Same
## [1] 120
prod(5:8)
## [1] 1680
  • Division, /, works element-wise.

3.4.7 Working with Vectors

Subsetting vectors

letters[1]
## [1] "a"
letters[1:5]
## [1] "a" "b" "c" "d" "e"

#Don't need to be consecutive
letters[c(1, 4, 7)]
## [1] "a" "d" "g"
x = -3:2
#Check which elements are 0
x == 0
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE
#Class is logical
class(x == 0)
## [1] "logical"
str(x == 0)
##  logi [1:6] FALSE FALSE FALSE TRUE FALSE FALSE

#If I want them to be 0's and 1's.
(x == 0) + 0
## [1] 0 0 0 1 0 0
class((x == 0) + 0)
## [1] "numeric"

#How many are 0's?
sum(x == 0)
## [1] 1

#Logical statements can be combined
#This returns TRUE/FALSE/NA
x > 5 | x == 0
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE


#Then can be used to subset a vector
#This returns elements of the vector
x[x > 5 | x == 0]
## [1] 0

3.4.8 Special Values

Missing values

  • R uses NA to indicate a missing value.
x[1] = NA
x
## [1] NA -2 -1  0  1  2

#check which elements are missing
is.na(x)
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

#This is not what you want to do!
x == NA
## [1] NA NA NA NA NA NA

#Pull out elements that are not missing
x
## [1] NA -2 -1  0  1  2
x[!is.na(x)]
## [1] -2 -1  0  1  2
  • R also has NaN (Not a number)
#Let's look at NaN
x[2] = 0 / 0
x
## [1]  NA NaN  -1   0   1   2

#Both NA and NaN are treated a missing
is.na(x)
## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE
#There is also this function
is.nan(x)
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

#Is infinity missing?
is.na(Inf)
## [1] FALSE

3.4.9 Special Functions

#floored quotient
floor(7 / 2)
## [1] 3
#ceiling quotient
ceiling(7 / 2)
## [1] 4
#Round
round(x = 1/3, digits = 3)
## [1] 0.333
round(x = 1/3, digits = 8)
## [1] 0.3333333
round(x = 1/3, digits = 0)
## [1] 0
#absolute value
abs(-5)
## [1] 5
#exponents
3^4
## [1] 81
3**4
## [1] 81

3.4.10 paste and [con]cat[enate]

#Paste joins character strings
f = c("x1", "x2", "x3", "x4", "x5")
f
## [1] "x1" "x2" "x3" "x4" "x5"

#Instead of typing is all out, you can do
f = base::paste("x", 1:5, sep = "")
f
## [1] "x1" "x2" "x3" "x4" "x5"

#Same thing: paste0 defaults to sep = ""
f = base::paste0("x", 1:5)
f
## [1] "x1" "x2" "x3" "x4" "x5"

ex_func = function(){
  for (j in 1:2) { 
    paste0("Currently, i = ", j, "\n")
  }
}
ex_func()
  • cat prints a string directly to the user.
cat(f)
## x1 x2 x3 x4 x5
  • This is useful for providing messages to yourself when debugging code.
for (i in 1:2) {
  cat(paste0("Currently, i = ", i, "\n"))
}
## Currently, i = 1
## Currently, i = 2

ex_func = function(){
  for (j in 1:2) { 
    cat(paste0("Currently, i = ", i, "\n"))
  }
}
ex_func()
## Currently, i = 2
## Currently, i = 2

3.4.11 Matrices

  • Matrices have two indexes:
    • first index is rows
    • second index is columns
  • By default matrices are populated by columns:
(mat = matrix(data = 1:9, ncol = 3))
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
  • To input by row:
(mat = matrix(data = 1:9, ncol = 3, byrow = TRUE))
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Subsetting matrices

Need to specify a row and a column

mat[1, 1]
## [1] 1
# First row
mat[1, ]
## [1] 1 2 3
# Third column
mat[ , 3]
## [1] 3 6 9
  • First and third row with the second column removed
mat[c(1, 3), -2]
##      [,1] [,2]
## [1,]    1    3
## [2,]    7    9

3.4.12 cbind and rbind

  • cbind stacks matrices next to each other
  • rbind stacks matrices on top of one another
mat1 = matrix(1, ncol = 2, nrow = 2)
mat2 = matrix(2, ncol = 2, nrow = 2)
mat3 = matrix(3, ncol = 2, nrow = 2)
#cbind
cbind(mat1, mat2, mat3)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    2    2    3    3
## [2,]    1    1    2    2    3    3
#rbind
rbind(mat1, mat2, mat3)
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    2    2
## [4,]    2    2
## [5,]    3    3
## [6,]    3    3

3.4.13 Combining

(wide = cbind(mat1, mat2, mat3))
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    2    2    3    3
## [2,]    1    1    2    2    3    3
(tall = rbind(mat1, mat2, mat3))
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    2    2
## [4,]    2    2
## [5,]    3    3
## [6,]    3    3
rbind(wide, cbind(tall, tall, tall))
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    2    2    3    3
## [2,]    1    1    2    2    3    3
## [3,]    1    1    1    1    1    1
## [4,]    1    1    1    1    1    1
## [5,]    2    2    2    2    2    2
## [6,]    2    2    2    2    2    2
## [7,]    3    3    3    3    3    3
## [8,]    3    3    3    3    3    3

3.4.14 Arrays

#Quick note on arrays
arr = array(1:12, dim = c(2, 2, 3))
arr
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
## 
## , , 3
## 
##      [,1] [,2]
## [1,]    9   11
## [2,]   10   12
dim(arr)
## [1] 2 2 3
#three indexes
arr[1, , ]
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    3    7   11
arr[1, , 1]
## [1] 1 3

3.4.15 Lists

  • Lists are very flexible R objects
  • The items of a list can be of any class including many different classes within one R list
  • Lists are said to be recursive because an element of a list could potentially be a list itself
  • Lists are indexed with double brackets “[[]]” or with a “$name”
#Create a list
l =  list(3, 
          rep(0, 3), 
          matrix(c(1:4), ncol = 2), 
          paste("X", c(1:5), sep = ""))
l
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 0 0 0
## 
## [[3]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## [[4]]
## [1] "X1" "X2" "X3" "X4" "X5"

#Another way to create this list
#initialize an empty list
l = list()
l[[1]] = 3
l[[2]] = rep(0, 3)

3.4.16 Named Elements

l$three = matrix(c(1:4), ncol = 2)
l$four = paste("X", c(1:5), sep = "")
l
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 0 0 0
## 
## $three
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## $four
## [1] "X1" "X2" "X3" "X4" "X5"

l[[4]]
## [1] "X1" "X2" "X3" "X4" "X5"
l$four
## [1] "X1" "X2" "X3" "X4" "X5"
l[["four"]]
## [1] "X1" "X2" "X3" "X4" "X5"

#Let's look at the names
names(l)
## [1] ""      ""      "three" "four"

3.4.17 More on Lists

#assign names
names(l)[1:2] = c("one", "two")
names(l)
## [1] "one"   "two"   "three" "four"
#Let's look at the structure
str(l)
## List of 4
##  $ one  : num 3
##  $ two  : num [1:3] 0 0 0
##  $ three: int [1:2, 1:2] 1 2 3 4
##  $ four : chr [1:5] "X1" "X2" "X3" "X4" ...
#length of a list is the number of elements in the list
length(l)
## [1] 4
#Call an element of a list in two ways:
#By index
l[[3]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

#by name
l$three
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

3.4.18 Lists of lists

l2 = list(list(3, rep(0,3)), list(5, rep(1,6)))
l2
## [[1]]
## [[1]][[1]]
## [1] 3
## 
## [[1]][[2]]
## [1] 0 0 0
## 
## 
## [[2]]
## [[2]][[1]]
## [1] 5
## 
## [[2]][[2]]
## [1] 1 1 1 1 1 1
#First element is a list
l2[[1]]
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 0 0 0
#list of lists can be called by indexing as much as needed
l2[[1]][[1]]
## [1] 3

3.4.19 Data Frames

  • Data frames are just lists where each element must be a vectors of the same length
  • Much of the data that we are interested in can easily be analyzed as a data frame
  • Data frames have properties of a matrix and a list
  • Can be subsetted by index OR name
#create a data frame
df = data.frame(V1 = 1:10, 
                 V2 = rep(1, 10),
                 V3 = seq(1,20,2),
                 V4=c(rep("A",3),rep("B",7)),
                 V5=rnorm(10,0,5),
                 V6=paste0("X",c(1,1,2,3,1,3,1,2,3,4)))

head(df)
##   V1 V2 V3 V4        V5 V6
## 1  1  1  1  A 3.0553278 X1
## 2  2  1  3  A 1.6535375 X1
## 3  3  1  5  A 1.4776035 X2
## 4  4  1  7  B 3.4165959 X3
## 5  5  1  9  B 2.4695397 X1
## 6  6  1 11  B 0.3894594 X3

#These are equivalent
names(df)
## [1] "V1" "V2" "V3" "V4" "V5" "V6"
colnames(df)
## [1] "V1" "V2" "V3" "V4" "V5" "V6"
#Row names default to consecutive integers
rownames(df)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

#Pull out first column
df[,1]
##  [1]  1  2  3  4  5  6  7  8  9 10

#pull out first row
df[1,]
##   V1 V2 V3 V4       V5 V6
## 1  1  1  1  A 3.055328 X1

#Everything without the first three rows
df[-c(1:3),]
##    V1 V2 V3 V4         V5 V6
## 4   4  1  7  B  3.4165959 X3
## 5   5  1  9  B  2.4695397 X1
## 6   6  1 11  B  0.3894594 X3
## 7   7  1 13  B -3.2291396 X1
## 8   8  1 15  B -3.0526225 X2
## 9   9  1 17  B  0.7943039 X3
## 10 10  1 19  B  0.4331010 X4

#Rows 3, 6, and 8 with columns 2 and 4
df[c(3,6,8),c(2,4)]
##   V2 V4
## 3  1  A
## 6  1  B
## 8  1  B

#Rows can be added to a data.frame by using rbind
#Careful though!  This makes everything a character!
test = rbind(df, c(1,1,1,"A",1,"X2"))
str(test)
## 'data.frame':    11 obs. of  6 variables:
##  $ V1: chr  "1" "2" "3" "4" ...
##  $ V2: chr  "1" "1" "1" "1" ...
##  $ V3: chr  "1" "3" "5" "7" ...
##  $ V4: chr  "A" "A" "A" "B" ...
##  $ V5: chr  "3.05532783365282" "1.65353747776376" "1.47760353907312" "3.41659592659188" ...
##  $ V6: chr  "X1" "X1" "X2" "X3" ...

#The "right" way to do this!
test = rbind(df, data.frame(V1 = 11,
                             V2 = 1,
                             V3 = 1,
                             V4 = "A",
                             V5 = 1,
                             V6 = "X2"))
test
##    V1 V2 V3 V4         V5 V6
## 1   1  1  1  A  3.0553278 X1
## 2   2  1  3  A  1.6535375 X1
## 3   3  1  5  A  1.4776035 X2
## 4   4  1  7  B  3.4165959 X3
## 5   5  1  9  B  2.4695397 X1
## 6   6  1 11  B  0.3894594 X3
## 7   7  1 13  B -3.2291396 X1
## 8   8  1 15  B -3.0526225 X2
## 9   9  1 17  B  0.7943039 X3
## 10 10  1 19  B  0.4331010 X4
## 11 11  1  1  A  1.0000000 X2

#Columns can be added using cbind 
cbind(df,100:109)
##    V1 V2 V3 V4         V5 V6 100:109
## 1   1  1  1  A  3.0553278 X1     100
## 2   2  1  3  A  1.6535375 X1     101
## 3   3  1  5  A  1.4776035 X2     102
## 4   4  1  7  B  3.4165959 X3     103
## 5   5  1  9  B  2.4695397 X1     104
## 6   6  1 11  B  0.3894594 X3     105
## 7   7  1 13  B -3.2291396 X1     106
## 8   8  1 15  B -3.0526225 X2     107
## 9   9  1 17  B  0.7943039 X3     108
## 10 10  1 19  B  0.4331010 X4     109

#Or this:
df$new_col = 100:109

#You can derive new columns in this way
df$sumV1_V2 = df$V1 + df$V2
df
##    V1 V2 V3 V4         V5 V6 new_col sumV1_V2
## 1   1  1  1  A  3.0553278 X1     100        2
## 2   2  1  3  A  1.6535375 X1     101        3
## 3   3  1  5  A  1.4776035 X2     102        4
## 4   4  1  7  B  3.4165959 X3     103        5
## 5   5  1  9  B  2.4695397 X1     104        6
## 6   6  1 11  B  0.3894594 X3     105        7
## 7   7  1 13  B -3.2291396 X1     106        8
## 8   8  1 15  B -3.0526225 X2     107        9
## 9   9  1 17  B  0.7943039 X3     108       10
## 10 10  1 19  B  0.4331010 X4     109       11

3.5 HoPR Ch 4: R Notation & Ch 5: Modifying Values

3.5.1 Subsetting

By Index

Vectors

(nums = 6:10)
## [1]  6  7  8  9 10
nums[3:5]
## [1]  8  9 10

Matrices

(mat = matrix(data = 12:15, ncol = 2))
##      [,1] [,2]
## [1,]   12   14
## [2,]   13   15
mat[1, ]
## [1] 12 14

Lists

lst = list(nums = nums, mat = mat)
lst[[1]]
## [1]  6  7  8  9 10

By Name

Vectors

names(nums) = LETTERS[16:20]
nums[c("S", "R")]
## S R 
## 9 8

Matrices

colnames(mat) = c("Joe", "Amy")
mat[ , "Joe"]
## [1] 12 13
mat[ , "Joe", drop = FALSE]
##      Joe
## [1,]  12
## [2,]  13

Lists

lst[["nums"]]
## [1]  6  7  8  9 10

3.5.2 Subsetting by Logical Vector

By Index

Vectors

nums[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
##  P  R  T 
##  6  8 10

Matrices

binary = c(FALSE, TRUE)
mat[binary, binary, drop = FALSE]
##      Amy
## [1,]  15

Lists

lst[c(TRUE, TRUE)]
## $nums
## [1]  6  7  8  9 10
## 
## $mat
##      [,1] [,2]
## [1,]   12   14
## [2,]   13   15

Helper Functions

%in% for logical subsetting
set.seed(1234)
(rand_let_vec = sample(LETTERS, 60, replace = TRUE))
##  [1] "P" "Z" "V" "E" "L" "O" "I" "E" "F" "P" "D" "B" "G" "V" "Z" "F" "O" "N" "T"
## [20] "N" "X" "D" "D" "U" "H" "T" "X" "C" "D" "Z" "E" "B" "O" "H" "T" "P" "L" "C"
## [39] "W" "I" "S" "V" "D" "H" "J" "K" "B" "U" "O" "V" "Q" "F" "X" "S" "F" "Q" "Q"
## [58] "Y" "H" "Z"
(let_binary = LETTERS %in% rand_let_vec)
##  [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE
LETTERS[!let_binary]
## [1] "A" "M" "R"

\({}\)

which for index subsetting
(let_index = base::which(!let_binary))
## [1]  1 13 18
LETTERS[let_index]
## [1] "A" "M" "R"

3.5.3 See also: any, all, and xor

any

bool = c(TRUE, FALSE, TRUE, TRUE)
base::any(bool)
## [1] TRUE
TRUE | FALSE | TRUE | TRUE
## [1] TRUE
(date_val = lubridate::today())
## [1] "2024-10-10"
(b1 = is.character(date_val))
## [1] FALSE
(b2 = is.numeric(date_val))
## [1] FALSE
(b3 = lubridate::is.Date(date_val))
## [1] TRUE
any(c(b1, b2, b3))
## [1] TRUE

all

base::all(bool)
## [1] FALSE
TRUE & FALSE & TRUE & TRUE
## [1] FALSE
rep(TRUE, 100) |> all()
## [1] TRUE

xor

xor(TRUE, FALSE)
## [1] TRUE
xor(TRUE, TRUE)
## [1] FALSE
xor(bool)
## Error in xor(bool): argument "y" is missing, with no default
sum(bool) == 1
## [1] FALSE
sum(!bool) == 1
## [1] TRUE

3.6 HoPR Ch 6: Environments

3.6.1 Most Important Take-Away: Don’t Walk on the Flowers

Zelda: Breath of the Wild:

If you don’t, someone might get very angry:

  • Don’t walk on my flowers! You never know when setting a global variable will have unintended consequences.
  • Don’t mess around with global variables.
  • There are two times changing global variables makes sense:
    • Debugging your function. cat is limited in its ability to print, e.g., matrices. When you’re done debugging, delete the global variable-setting code.
    • You are writing a function which explicitly sets a global variable (see, for example: base::options() or ggplot2::theme_set()).

3.7 HoPR Ch 7: Programming

3.7.1 Conditional Statements

a = TRUE
b = FALSE
4 + a
## [1] 5

d = 0
if (d == FALSE) {
  print("d is FALSE")
}
## [1] "d is FALSE"

#also works
d = TRUE
if (d) {
  print("d is TRUE")
}
## [1] "d is TRUE"

#Also works
d = 1
if (d) {
  print("d is TRUE")
}
## [1] "d is TRUE"

if (-then)

sky = "sunny"
if (sky == "sunny") {
  print("Leave your umbrella at home!")
}
## [1] "Leave your umbrella at home!"

sky = "cloudy"
if (sky == "sunny") {
  print("Leave your umbrella at home!")
}

if then or else

sky = "cellphone"
if (sky == "sunny") {
  print("Leave your umbrella at home!")
} else {
  print("Bring your umbrella")
}
## [1] "Bring your umbrella"

3.7.2 Conditional Statements

Nested Conditional Statements

sky = "sunny"
if (sky == "sunny") {
  print("Leave your umbrella at home!")
} else {
  if (sky == "cloudy") {
    print("Bring your umbrella")
  } else {
    if (sky == "snowing") {
      print("Grab your parka")
    } else {
      print("Your guess is as good as mine")
    }
  }
}
## [1] "Leave your umbrella at home!"

Sequential Conditional Statements

sky = "dsfsad"
if (sky == "sunny") {
  print("Leave your umbrella at home!")
} else if (sky == "cloudy") {
  print("Bring your umbrella")
} else if (sky == "snowing") {
  print("Grab your parka")
} else {
  print("Your guess is as good as mine")
}
## [1] "Your guess is as good as mine"

ifelse

sky = "sunny"
ifelse(test = sky == "sunny", 
       yes = "Sunny", no = "not sunny")
## [1] "Sunny"

3.8 HoPR Ch 8: S3 System

3.8.1 Object-Oriented Programming Systems in R

There are many systems in R:

  • S3: only systems used in base:: and stats::
  • S4: seen a lot on Bioconductor
  • R6: soon to be supersededby R7
  • R7: R Consortium and Hadley Wickham are developing
  • ggproto: specific to ggplot2 and its spin-offs

3.8.2 Attributes

A matrix, x, with dimnames.

x = cbind(a = 5:6, p = 2L)
attributes(x)
## $dim
## [1] 2 2
## 
## $dimnames
## $dimnames[[1]]
## NULL
## 
## $dimnames[[2]]
## [1] "a" "p"
## strip an object's attributes:
attributes(x) = NULL
attributes(x)
## NULL
x # now just a vector of length 4
## [1] 5 6 2 2
attributes(x) = list(
  comment = "really special", 
  dim = c(2,2), names = paste(1:4),
  dimnames = list(LETTERS[1:2], 
                  letters[1:2])
  )
x
##   a b
## A 5 2
## B 6 2
## attr(,"names")
## [1] "1" "2" "3" "4"
names(x)
## [1] "1" "2" "3" "4"
dim(x)
## [1] 2 2
comment(x)
## [1] "really special"
dimnames(x)
## [[1]]
## [1] "A" "B"
## 
## [[2]]
## [1] "a" "b"

3.8.3 What does attributes() tells us about matrices, etc?

x = cbind(a = 5:6, p = 2L)
x
##      a p
## [1,] 5 2
## [2,] 6 2
attributes(x)
## $dim
## [1] 2 2
## 
## $dimnames
## $dimnames[[1]]
## NULL
## 
## $dimnames[[2]]
## [1] "a" "p"
attributes(x) = NULL
x
## [1] 5 6 2 2
x = data.frame(a = 1:3, f = 4)
x
##   a f
## 1 1 4
## 2 2 4
## 3 3 4
attributes(x)
## $names
## [1] "a" "f"
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3
attributes(x) = NULL
x
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] 4 4 4
x = array(data = 1:24, dim = 2:4)
x
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   13   15   17
## [2,]   14   16   18
## 
## , , 4
## 
##      [,1] [,2] [,3]
## [1,]   19   21   23
## [2,]   20   22   24
attributes(x)
## $dim
## [1] 2 3 4
attributes(x) = NULL
x
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

3.8.4 If we use the analogy that a vector in R is a dog…

Vector

Scalar

Matrix

Array

Data Frame

Analogy decoded:

  • Vector \(\rightarrow\) dog
  • Scalar \(\rightarrow\) small dog (a vector with one element)
  • Matrix \(\rightarrow\) one dog in a trench coat (vector with dim of 2)
  • Array \(\rightarrow\) one dog in a trench coat (vector with dim of 3+)
  • Data Frame \(\rightarrow\) at least one dog in a trench coat (a list of vectors)

3.8.5 Nested Attributes

xyzv = list(x = cbind(a = 5:6, p = 2L), 
     y = data.frame(a = 1:3, f = 4), 
     z = array(data = 1:24, dim = 2:4), 
     v = vector(length = 5L))
attributes(xyzv)
## $names
## [1] "x" "y" "z" "v"
  • Here, we see that attributes only applies to the top level of an object (in this case, the list).
  • So, how do we get the next level?
  • One option is to use lapply (more on apply and its cousins later).
lapply(X = xyzv, FUN = attributes)
## $x
## $x$dim
## [1] 2 2
## 
## $x$dimnames
## $x$dimnames[[1]]
## NULL
## 
## $x$dimnames[[2]]
## [1] "a" "p"
## 
## 
## 
## $y
## $y$names
## [1] "a" "f"
## 
## $y$class
## [1] "data.frame"
## 
## $y$row.names
## [1] 1 2 3
## 
## 
## $z
## $z$dim
## [1] 2 3 4
## 
## 
## $v
## NULL

3.9 HoPR Ch 9: Loops

3.9.1 for, while, repeat: Just Don’t if you can Avoid it

  • Attempt to vectorize your code instead.
  • base:: has the apply family of functions (apply, lapply, sapply, tapply, mapply, replicate, etc).

3.9.2 While Loops

  • Note: x += 1 doesn’t work in R!
x = 1
while (x <= 7) {
  x = x + 1
  print(x)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
cat("Last x is", x)
## Last x is 8
print(paste0("Last x is ",x))
## [1] "Last x is 8"
x = 1
while (x < 5){
  print("Hello")
  x = x*2
}
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"

cat("last value of x = ", x)
## last value of x =  8

3.9.3 For Loops

#Consecutive indexes
for (i in 1:10) {
  print(i)
  i = i + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

for (i in 10:1){
  print(i)
}
## [1] 10
## [1] 9
## [1] 8
## [1] 7
## [1] 6
## [1] 5
## [1] 4
## [1] 3
## [1] 2
## [1] 1

#Don't need to be consecutive!
for (i in c(2, 3, 5, 7, 11)){
  print(i)
}
## [1] 2
## [1] 3
## [1] 5
## [1] 7
## [1] 11

3.10 HoPR Ch 10: Speed

3.10.1 Speed

set.seed(1234)
integer_vec = rnorm(n = 10^5, sd = 5) |> 
  as.integer()
numeric_vec = as.numeric(integer_vec)
print_frame = data.frame(
    integer = integer_vec, 
    numeric = numeric_vec
  )
glimpse(print_frame)
## Rows: 100,000
## Columns: 2
## $ integer <int> -6, 1, 5, -11, 2, 2, -2, -2, -2, -4, -2, -4, -3, 0, 4, 0, -2, …
## $ numeric <dbl> -6, 1, 5, -11, 2, 2, -2, -2, -2, -4, -2, -4, -3, 0, 4, 0, -2, …
sum(integer_vec) |> bench::mark()
## # A tibble: 1 × 6
##   expression            min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 sum(integer_vec)     29µs   29.7µs    29548.        0B        0
sum(numeric_vec) |> bench::mark()
## # A tibble: 1 × 6
##   expression            min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 sum(numeric_vec)   57.8µs     59µs    15999.        0B        0
integer_vec |> lobstr::obj_size()
## 400.05 kB
numeric_vec |> lobstr::obj_size()
## 800.05 kB
  • bench package is currently the standard package for testing speed in R, and lobstr::obj_size does a better job of measuring object size (compared to its base equivalent, object.size).
  • As we can see here, storing a variable as an integer type (when the variable only takes on integer values) allows for faster calculations and takes up less space in memory.
  • Perhaps to no one’s surprise, double type takes up about double the space and double the time.