1  Getting Started in R

These notes will help get you started with R. They are mostly quite non-graphical, focusing instead on the basics of the R language.

1.1 Setting up R and RStudio

To get started, you need to do two things:

  1. Download R CRAN
  2. Download and Install RStudio

R is the software. RStudio is what is known as an Integrated Development Environment (IDE). Basically, an IDE is a way of running base software the provides various tools to make the coding experience easier (e.g., an editor with syntax highlighting, debugger, and interactive graphics facilities).

Here is a brief video explaining what to do:

What follows are some old notes that are very Base-R focused. The thing is, even though most people use the tidyverse tools these days, it’s still valuable to understand Base-R (since that’s what everything else is built upon!). This is particularly true for the material that forms the bulk of these notes. Making theoretical scientific figures (chapter Chapter 3) is simply easier in Base-R. You typically don’t need to do complex data wrangling to make a theoretical figure, so the powerful data-manipulation tools of dplyr, for instance, are unnecessary. Sometimes, we want to utilize the grid package, which underlies the tidyverse graphics library ggplot2, directly. Again, I find this more straightforward in Base-R.

The igraph package for drawing graphs (a.k.a., “networks”) that we discuss in chapter Chapter 4 also runs in Base-R.

1.2 What Is R?

  • R is statistical numerical software

  • R is a “dialect” of the S statistical programming language

  • R is a system for interactive data analysis

  • R is a high-level programming language

  • R is free

  • R is state-of-the-art in statistical computing. It is what many (most?) research statisticians use in their work

1.3 Why Use R?

  • R is FREE! That, by itself, is almost enough. No complicated licensing. Broad dissemination of research methodologies and results, etc.

  • R is available for a variety of computer platforms (e.g., Linux, MacOS, Windows).

  • R is widely used by professional statisticians, social scientists, biologists, demographers, and other scientists. This increases the likelihood that code will exist to do a calculation you might want to do.

  • R has remarkable online help lists, tutorials, etc.

  • R represents the state-of-the-art in statistical computing.

1.4 Wouldn’t Something Menu-Driven Be Easier?

  • Fallacious thinking

    • For teaching, text-based input is always better

    • Example code can be copied and input exactly; you can then tweak it and see what happens, facilitating the learning process

  • An example

  • What follows is a pretty complicated graph of the grooming interactions of a group of rhesus monkeys, Macaca mulatta, observed by Sade (1972)

  • With the code I used to generate this graph, you can recreate the figure exactly. Try it! It doesn’t matter if you have no idea what you’re doing yet. That’s the point.

  • The only thing you need to make this figure is to install and load the library igraph

1.5 Rhesus Money Grooming Network

require(igraph)
rhesus <- read.table("./data/sade1.txt", skip=1, header=FALSE)
rhesus <- as.matrix(rhesus)
nms <- c("066", "R006", "CN", "ER", "CY", "EC", "EZ", "004", "065", "022", "076", 
         "AC", "EK", "DL", "KD", "KE")
sex <- c(rep("M",7), rep("F",9))
dimnames(rhesus)[[1]] <- nms
dimnames(rhesus)[[2]] <- nms
grhesus <- graph_from_adjacency_matrix(rhesus, weighted=TRUE)
V(grhesus)$sex <- sex

rhesus.layout <- layout.kamada.kawai(grhesus)
plot(grhesus, 
     edge.width=log10(E(grhesus)$weight)+1, 
     edge.arrow.width=0.5,
     vertex.label=V(grhesus)$name,
     vertex.label.family="Helvetica",
     vertex.color=as.numeric(V(grhesus)$sex=="F")+5, 
     layout=rhesus.layout)

Rhesus monkey grooming network (Slade 1972).
  • A note on loading data: the above code loads a data file apparently called "./data/sade1.txt"

  • What does that mean?

  • The one dot followed by a slash, ./, means to go into the sub-directory called data, which is in our current working directory, and read the text file called sade1.txt

  • If the data sub-directory was actually in, say, the same directory where our working directory is located (i.e., they were two sub-directories of the same higher-level folder), we would use two dots, ../, which means to go out one directory in the hierarchy

  • This is actually not R but the underlying OS file system

  • Learning about the file/directory structure of your computer is actually an important (and under-appreciated) data-science skill

  • Check out the fantastic MIT course The Missing Semester of Your CS Education for information on various tools that can really improve your workflows and general skill-level

1.6 A Few Conventions and Other Helpful Bits

  • There are some things that you will see over and over in the code embedded in this document

  • The assignment operator <- is used to assign a value to a name.

  • The value is on the right-hand side of the operator and the name is on the left side

  • You can use = for assignment, but I don’t recommend it (it doesn’t work at all levels, makes the code harder to read, etc.)

  • Different environments make it more or less easy to use <-. In RStudio, hit the option key and the minus sign simultaneously

  • Comments are marked by #: anything following the hash will be ignored by R

  • Use comments liberally to help you (and others) understand your code

  • In these notes, the output that you would see on your own command line will be white following a grey box (the input). Frequently, it will begin with a [1], which indicates the first element of a vector

  • Sometimes I enclose a command in parentheses; this is simply to force R to echo the output (for pedagogical purposes)

# a comment
x <- c(1,2,3)
(y <- c(1,2,3))
[1] 1 2 3
  • You will probably want to seek help on functions. At the command line, simply type a question mark followed immediately by the function you want to query, ?function.name

    • In Rstudio, you may be given auto-complete suggestions, which you can click on to save typing
  • When you are done with your R session, type q() at the command line

  • R will ask you if you want to save your workspace. For now, you probably don’t.

  • Check out the much more comprehensive Introduction to R for all the language details.

1.7 R as a Calculator

# addition
2+2
[1] 4
# multiplication
2*3
[1] 6
a <- 2
b <- 3
a*b
[1] 6
# division
2/3
[1] 0.6666667
b/a
[1] 1.5
1/b/a
[1] 0.1666667
# note order of operations!
1/(b/a)
[1] 0.6666667
# parentheses can override order of operations
# an exponential
exp(-2)
[1] 0.1353353
# why we age
r <- 0.02
exp(-r*45)
[1] 0.4065697
# something more tricky
exp(log(2))
[1] 2
# generate 20 normally distributed random numbers
rnorm(20)
 [1]  0.2009012  2.2207369 -0.3607866  1.4697813 -1.4384434  0.8721387
 [7] -0.3946359  0.7073727  0.5261874  0.6818287 -0.4126075  0.5687733
[13] -0.6435988  0.3862529  1.0785412  0.1752227  0.2863084 -0.0291453
[19]  0.6750149 -0.8180922

1.8 Data Types

Numeric

All numbers in R are of the form double (i.e., double-precision floating point numbers). This can be a bit confusing for people who are used to languages with integer data types (like, most languages!). Entering something that looks like an integer doesn’t mean it is.

# it looks like an integer, but don't be fooled!
a <- 2
is.numeric(a)
[1] TRUE
is.integer(a)
[1] FALSE
is.double(a)
[1] TRUE
Integer

OK, technically R does have an integer class, but it is used very rarely and many functions will convert integers into doubles anyway. If you really must have an integer (e.g., because you are passing output to external C or FORTRAN code that expects it), add the suffix L to the entered number.

a <- 2L
is.integer(a)
[1] TRUE
Character

Strings are represented by the character data class.

(countries <- c("Uganda", "Tanzania", "Kenya", "Rwanda"))
[1] "Uganda"   "Tanzania" "Kenya"    "Rwanda"  
as.character(1:5)
[1] "1" "2" "3" "4" "5"
Factor

Factors are a data type for encoding categorical data. Notice that factors are printed without the quotes. This is because R stores them as a set of codes. Data of type “factor” are different from data of type “character” (which is what plain text is). Note the difference below between factor and character data. Because factors get used in statistical models, they are actually represented as numbers (the levels) that have associated names. Vectors, on the other hand, are just lists of numbers.

countries <- factor(c("Uganda", "Tanzania", "Kenya", "Rwanda"))
countries
[1] Uganda   Tanzania Kenya    Rwanda  
Levels: Kenya Rwanda Tanzania Uganda
# a trick to get some insight into how factors are handled by R
unclass(countries)
[1] 4 3 1 2
attr(,"levels")
[1] "Kenya"    "Rwanda"   "Tanzania" "Uganda"  
countries1 <- c("Uganda", "Tanzania", "Kenya", "Rwanda")
countries1 == unclass(countries1)
[1] TRUE TRUE TRUE TRUE
countries == unclass(countries)
[1] FALSE FALSE FALSE FALSE
Logical

TRUE and FALSE are reserved keywords, while T and F are global constants set to these. These logical variables are essential tools for subsetting data. You also use them extensively in setting optional arguments of functions.

t.or.f <- c(T,F,F,T,T)
is.logical(t.or.f)
[1] TRUE
aaa <- c(1,2,3,4,5)
# subset
aaa[t.or.f]
[1] 1 4 5
List

You can mix different types of data in a list using the command list(). This is useful when you write your own functions and want to output multiple things. Use the function str() to give you information about a list.

child1 <- list(name="mary", child.age=6,
status="foster",mother.alive=F, father.alive=T, parents.ages=c(24,35))
str(child1)
List of 6
 $ name        : chr "mary"
 $ child.age   : num 6
 $ status      : chr "foster"
 $ mother.alive: logi FALSE
 $ father.alive: logi TRUE
 $ parents.ages: num [1:2] 24 35

1.9 Coercion

  • Sometimes you have data in one type but need it in a different type

  • R provides a variety of methods to coerce data from one type to another

  • These methods are carried out by functions that begin with as.xxx, where xxx is the data type to which you are coercing

countries <- factor(c("Uganda", "Tanzania", "Kenya", "Rwanda"))
as.character(countries)
[1] "Uganda"   "Tanzania" "Kenya"    "Rwanda"  
as.numeric(countries)
[1] 4 3 1 2
# werk it backwards
countries1 <- c("Uganda", "Tanzania", "Kenya", "Rwanda")
as.factor(countries1)
[1] Uganda   Tanzania Kenya    Rwanda  
Levels: Kenya Rwanda Tanzania Uganda
# sometimes you want your numbers to actually be strings (e.g., when you make labels or column names)
as.character(1:5)
[1] "1" "2" "3" "4" "5"
# there actually is an integer class; it just doesn't get used much at all
a <- 2
is.integer(a)
[1] FALSE
is.integer(as.integer(a))
[1] TRUE
  • You can check the class of an object using functions that begin with is.xxx, where xxx is the data type you are querying (like is.integer() above)

1.10 Creating Vectors

  • A vector is a list of numbers – it turns out everything in R is represented as a vector but that doesn’t affect your life much.

  • In order to create a vector, you use the the function c(), which concatenates a list of items (hence the “c”).

  • You will use this a lot and it’s a super-common mistake to forget the c() when putting together a list of numbers, factors, etc.

  • If you do forget it, you will get a syntax error

  • Often we want either regularly spaced vectors or a vector of one value repeated. R has a number of facilities to perform these operations.

( manual <- c(1,3,5,7,9))
[1] 1 3 5 7 9
( count <- 1:20 )
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
( ages <- seq(0,85,by=5) )
 [1]  0  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
( ones <- rep(1,10) )
 [1] 1 1 1 1 1 1 1 1 1 1
( fourages <- rep(c(1,2),c(5,10)) )
 [1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
( equalspace <- seq(1,5, length=20) )
 [1] 1.000000 1.210526 1.421053 1.631579 1.842105 2.052632 2.263158 2.473684
 [9] 2.684211 2.894737 3.105263 3.315789 3.526316 3.736842 3.947368 4.157895
[17] 4.368421 4.578947 4.789474 5.000000
  • You can use rep() to repeat values.

  • Sometimes this can be tricky: the second argument tells R how many repetitions.

  • This argument can be a vector and this, along with the possibility of a vector of the items you want repeated too, allows you to create quite complex patterns very easily.

rep(2,10)
 [1] 2 2 2 2 2 2 2 2 2 2
rep(c(1,2),10)
 [1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
rep(c(1,2), c(5,10))
 [1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
rep("R roolz!", 3)
[1] "R roolz!" "R roolz!" "R roolz!"

1.11 Creating Matrices

  • As we said, a vector is a list of numbers

  • A matrix is a rectangular array of numbers – it is a vector of vectors, with the numbers indexed by row and column.

  • One way to create matrices is to “bind” columns together using the commands cbind() or rbind().

# age distribution of Gombe chimps in 1980 and 1986
cx1980 <- c(7, 13, 8, 13, 5, 35, 9)
cx1988 <- c(9, 11, 15, 8, 9, 38, 0)
( C <- cbind(cx1980, cx1988) )
     cx1980 cx1988
[1,]      7      9
[2,]     13     11
[3,]      8     15
[4,]     13      8
[5,]      5      9
[6,]     35     38
[7,]      9      0
# another way
C <- c(cx1980, cx1988)
(  C <- matrix(C, nrow=7, ncol=2) )
     [,1] [,2]
[1,]    7    9
[2,]   13   11
[3,]    8   15
[4,]   13    8
[5,]    5    9
[6,]   35   38
[7,]    9    0
  • What happens if we try to bind columns of different lengths?
# age distribution at Tai; Boesch uses fewer age classes
cxboesch <- c(18,10,15,30)
( C <- cbind(C,cxboesch) )
Warning in cbind(C, cxboesch): number of rows of result is not a multiple of
vector length (arg 2)
           cxboesch
[1,]  7  9       18
[2,] 13 11       10
[3,]  8 15       15
[4,] 13  8       30
[5,]  5  9       18
[6,] 35 38       10
[7,]  9  0       15
  • Both the warning message and the output can seem a little odd to the uninitiated

  • R uses a recycling rule for filling out vectors and matrices

  • When you try to put together things that are neither the same length nor multiples of each other, you get a warning

  • We can use the recycling rule to make a matrix of ones:

( X <- matrix(1,nr=3,nc=3) )
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1
[3,]    1    1    1
  • Note that using the short version of nrow, nr, is sufficient. This is often true – you can use the minimum name that is unambiguous.

  • The matrix() command requires at least 3 arguments: (1) a vector of numbers that will form the elements of the matrix, (2) the number of rows, and (3) the number of columns.

  • For small matrices, you might want to enter the vectors of values manually

  • If you do this, it’s important to know that R fills matrices column-wise (the standard for FORTRAN and definitely not the way most people actually work!).

  • Use the optional argument byrow=TRUE to make R read in the data row-wise

# cross-classified data on hair/eye color 
freq <- c(32,11,10,3,  38,50,25,15,  10,10,7,7,  3,30,5,8)
hair <- c("Black", "Brown", "Red", "Blond")
eyes <- c("Brown", "Blue", "Hazel", "Green")
freqmat <- matrix(freq, nr=4, nc=4, byrow=TRUE)
dimnames(freqmat)[[1]] <- hair
dimnames(freqmat)[[2]] <- eyes
freqmat
      Brown Blue Hazel Green
Black    32   11    10     3
Brown    38   50    25    15
Red      10   10     7     7
Blond     3   30     5     8
# might as well do something with it
mosaicplot(freqmat)

1.12 Data Frames

  • A data frame is an R object which stores a data matrix. A data frame is essentially a list of variables which are all the same length. A single data frame can hold different types of variables.

  • To access a variable contained in a data frame, use the data frame name followed by the variable name, separated by a dollar sign, $.

# five columns of data
satu <- c(1,2,3,4,5)
dua <- c("a","b","c","d","e")
tiga <- sample(c(TRUE,FALSE),5,replace=TRUE)
empat <- LETTERS[7:11]
lima <- rnorm(5)
# construct a data frame
(collection <- data.frame(satu,dua,tiga,empat,lima))
  satu dua  tiga empat        lima
1    1   a FALSE     G -0.24012272
2    2   b  TRUE     H  1.62867872
3    3   c FALSE     I  1.34427662
4    4   d  TRUE     J  0.04494688
5    5   e FALSE     K  0.11910426
# extract the third variable
collection$tiga
[1] FALSE  TRUE FALSE  TRUE FALSE
  • by default, data.frame() will produce row numbers (seen to the left of the first column in the data frame collection above)

1.13 Directories and Paths

  • R uses a working directory. The default can be set in the Preferences or using an initialization file (i.e., a file that is always read when R starts up).

  • If you read in a file without specifying a path, R will search in the working directory; if there is no file matching the name you provide, you receive an error message

  • We can query the working directory using the command getwd() and we can change it using setwd()

  • You can always load a file by giving either a full or relative path

getwd()
[1] "/Users/jhj1/Teaching/graphics"
#setwd("/Users/jhj1/Projects/git/AABA2023_Workshop/Markdown")
## can't actually change it because it screws up the rendering!
  • Setting the working directory is actually not recommended

  • It is not a good scientific practice that favors replicability/interoperability/etc.

  • It’s generally better to use R Projects in RStudio (as we do in this workshop)

  • To start an R Project, either double-click on the .RProj file in the project’s directory or clicking on the R Project menu button in the upper right corner of your RStudio frame

  • To share the work you have done in an R Project with collaborators, students, or scientists looking to replicate your work, simply share the folder containing the .RProj file

  • When you quit R, you will be asked if you want to save your R session

  • If a session has previously been saved in your working directory, there will be a copy of the workspace in the R binary format named .RData

  • When R is started in a particular directory, if there is an .RData file in that directory, it will load automatically

  • This can lead to some surprising behavior if you don’t know that it can happen

  • Automatically saving and loading workspaces is also not recommended

  • Best scientific practice involves constructing your workspace using broadly-interoperable data formats (e.g., .csv files) and scripts

1.14 Reading Files

  • There are a number of ways to read data into R. Probably the easiest and most frequently used involves reading data from plain-text (ASCII) files. These files can be space, tab, or comma delimited.

  • You can create these files in a spreadsheet program like Excel or output them from most other statistical packages.

  • You can read these from a local directory or from an internet source

  • R expects delimited files to be “white-space delimited” with values separated by either tabs or spaces and rows separated by carriage returns

  • It’s always a good idea to specify whether or not you have a header (i.e., column names). If you don’t, say header=FALSE; if you do, obviously, say header=TRUE

# read a space-delimitted file (a sociomatrix of kids 17 kids aggressive acts toward each other)
(kids <- read.table("./data/strayer_strayer1976-fig2.txt", header=FALSE))
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1   0  1  3  4  1  0  0  1  1   0   1   0   7   0   1   0   0
2   1  0  7  8  2  1  1 12  3   0   1   1   4   1   0   0   2
3   1  4  0  7  3  2  2  0  1   0   8   1   5   5   0   0   1
4   3  3  2  0  3  1 13  3  5   1   0   0   8   3   0   2   1
5   1  0  0  3  0  4  6  0  8   5   1   0   1   3   0   2   1
6   0  0  0  0  0  0  2  8 11   0   4   0   4   3   0   1   0
7   1  0  1  9  3  4  0  2  0   0   1   0   7   9   1   1   0
8   0  0  0  1  1  1  2  0  7   5   1   1   1   0   0   0   0
9   1  1  1  2  5 11  0  3  0   0   0   0   1   0   0   0   0
10  0  0  0  0  0  0  1  0  0   0   0  11   0   1   0   1   4
11  4  0  4  3  3  2  1  0  0   0   0   3  11   5   0   2   2
12  0  0  0  0  0  0  0  0  0   2   0   0   2   0   8   0   0
13  0  1  9  3  0  3  6  0  0   0  11   2   0   1   0   7   5
14  0  1  4  0  1  2  1  0  0   0   1   0   1   0   0   0   0
15  0  0  0  0  0  0  0  0  0   0   0   3   0   0   0   0   0
16  0  0  0  0  0  1  0  0  0   0   0   1   1   0   0   0   0
17  0  0  0  0  0  0  0  0  1   1   0   0   0   0   0   0   0
  • If your file is delimited by something other than spaces, it is a good idea to use a slightly different function, read.delim() and specify exactly what the delimiter is

  • Frequently, there will be non-tabular information at the top of a file (e.g., meta-data describing the data set). Use the skip=n option, where n is the number of lines you want skipped.

quercus <- read.delim("./data/quercus.txt", skip=24, sep="\t", header=TRUE)
head(quercus)
                    Species   Region Range acorn.size tree.height
1           Quercus alba L. Atlantic 24196        1.4          27
2  Quercus bicolor Willd.   Atlantic  7900        3.4          21
3 Quercus macrocarpa Michx. Atlantic 23038        9.1          25
4  Quercus prinoides Willd. Atlantic 17042        1.6           3
5         Quercus Prinus L. Atlantic  7646       10.5          24
6    Quercus stellata Wang. Atlantic 19938        2.5          17

1.15 The Workspace

  • R handles data in a manner that is different than many statistical packages.

  • In particular, you are not limited to a single rectangular data matrix at a time.

  • The workspace holds all the objects (e.g., data frames, variables, functions) that you have created or read in.

  • You can essentially have as many data frames as your machine’s memory will allow.

  • To find out what lurks in your workspace, use objects() command.

  • To remove an object, use rm().

  • If you really want to clear your whole workspace, you can use the following syntax: rm(list=ls()). Beware, though. Once you do this, you don’t get the data back.

objects()
 [1] "a"             "aaa"           "ages"          "b"            
 [5] "C"             "child1"        "collection"    "count"        
 [9] "countries"     "countries1"    "cx1980"        "cx1988"       
[13] "cxboesch"      "dua"           "empat"         "equalspace"   
[17] "eyes"          "fourages"      "freq"          "freqmat"      
[21] "grhesus"       "hair"          "kids"          "lima"         
[25] "manual"        "nms"           "ones"          "quercus"      
[29] "r"             "rhesus"        "rhesus.layout" "satu"         
[33] "sex"           "t.or.f"        "tiga"          "x"            
[37] "X"             "y"            
rm(aaa)
rm(list=ls())
objects()
character(0)

1.16 Scope

  • Because the R workspace can contain many different variables and even multiple data frames, you must be aware of scope

  • When we extract columns of a data frame (e.g., if we wanted to plot them) we need to use the syntax data.frame$col.name

## load it again because we cleared all objects!
quercus <- read.delim("./data/quercus.txt", skip=24, sep="\t", header=TRUE)
plot(quercus$tree.height, quercus$acorn.size, pch=16, col="red", xlab="Tree Height (m)", ylab="Acorn Size (cm3)")

  • It can be a hassle having to type the data frame name (and dollar sign) over and over again

  • With the with() function, we can set up a local scoping rule that allows us to drop the need to type the data frame name (and dollar sign) to access columns of a data frame

with(quercus, plot(tree.height, acorn.size, pch=16, col="blue", xlab="Tree Height (m)", ylab="Acorn Size (cm3)"))

  • Apparently, there are R users who gladly use with and those who hate its use. I fall into the former category.

  • Note that this is a very Base-R perspective. the tidyverse (e.g., ggplot2, etc.) changes many of these issues.

1.17 Indexing and Subsetting

  • Index (and access) the elements of a vector using square brackets. myvec[1] takes the first element of a vector called myvec.

  • Use the colon (:) operator for sequences. myvec[1:5] takes the first five elements of myvec.

  • R is unusual in that it allows negative indexing: myvec[-1] takes all elements of except the first one. To exclude a sequence, you need to place the sequence within parentheses: myvec[-(1:5)].

  • Vector indices don’t have to be consecutive: myvec[c(2,5,1,11)].

myvec <- c(1,2,3,4,5,6,66,77,7,8,9,10)
myvec[1]
[1] 1
myvec[1:5]
[1] 1 2 3 4 5
myvec[-1]
 [1]  2  3  4  5  6 66 77  7  8  9 10
myvec[-(1:5)]
[1]  6 66 77  7  8  9 10
# try without the parentheses
#myvec[-1:5]
myvec[c(2,5,1,11)]
[1] 2 5 1 9
  • Access the elements of a data frame using the dollar sign. Subsetting anything other than a data frame uses square brackets.
dim(quercus)
[1] 39  5
size <- quercus$acorn.size
size[1:3] #first 3 elements
[1] 1.4 3.4 9.1
size[17]  #only element 17
[1] 4.8
size[-39] #all but the last element
 [1]  1.4  3.4  9.1  1.6 10.5  2.5  0.9  6.8  1.8  0.3  0.9  0.8  2.0  1.1  0.6
[16]  1.8  4.8  1.1  3.6  1.1  1.1  3.6  8.1  3.6  1.8  0.4  1.1  1.2  4.1  1.6
[31]  2.0  5.5  5.9  2.6  6.0  1.0 17.1  0.4
size[c(3,6,9)] # elements 3,6,9
[1] 9.1 2.5 1.8
size[quercus$Region=="California"] # use a logical test to subset
 [1]  4.1  1.6  2.0  5.5  5.9  2.6  6.0  1.0 17.1  0.4  7.1
quercus[3,4] # access an element of an array or data frame by X[row,col]
[1] 9.1
quercus[,"tree.height"]
 [1] 27.0 21.0 25.0  3.0 24.0 17.0 15.0  0.3 24.0 11.0 15.0 23.0 24.0  3.0 13.0
[16] 30.0  9.0 27.0  9.0 24.0 23.0 27.0 24.0 23.0 18.0  9.0  9.0  4.0 18.0  6.0
[31] 17.0 20.0 30.0 23.0 26.0 21.0 15.0  1.0 18.0
  • The comma with nothing in front of it means take every row in the column named "tree.height".

1.18 More Subsetting

  • Positive indices include, negative indices exclude elements

  • 1:3 means a sequence from 1 to 3

  • You can only use a single negative subscript, i.e., you can’t use quercus$acorn.size[-1:3]

  • Of course, you can get around this by enclosing the vector in parentheses quercus$acorn.size[-(1:3)]

  • The logical operators are == (equal), != (not equal), and the various greater than/less than symbols: >, >=, <, <=

  • Further logicals are & (and), | (or), ! (not), && (another and), || (another or)

  • & and ! work elementwise on vectors: element 1 is compared in the two vectors, then element 2, and so on

  • && and || are tricky. These logical tests evaluate left to right, examining only the first element of each vector (they go until a result is determined for ||).

    • Why would you want that?? In general, you don’t. It makes some calculations faster.
  • When you refer to a variable in a data frame, you must specify the data frame name followed a dollar sign and the variable name quercus$acorn.size

  • Testing for equality is just a special case of a logical test. We frequently want to identify numbers either above or below some criterion.

myvec <- c(1,2,3,4,5,6,66,77,7,8,9,10)
myvec <- myvec[myvec<=10]
myvec
 [1]  1  2  3  4  5  6  7  8  9 10
x <- 1:7
# elements that are greater than 2 but less than 6
(x>2) & (x<6)
[1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
is.even <- rep(c(FALSE, TRUE),5)
(evens <- myvec[is.even])
[1]  2  4  6  8 10

1.19 Missing Values

  • NA is a special code for missing data.

  • NA pretty much means “Don’t Know.”

  • The presence of NA values in your data set can lead to some surprising consequences.

  • You can’t test for a NA the way you would test for any other value (i.e., using the == operator) since variable==NA is like asking in English, is the variable equal to some number I don’t know? How could you know that?!

  • It also doesn’t make any sense to add one to something you don’t know what it is – 1+NA is meaningless!

  • R therefore provides the function is.na() that allows us to subset using logicals.

aaa <- c(1,2,3,NA,4,5,6,NA,NA,7,8,9,NA,10)
aaa <- aaa[!is.na(aaa)]
aaa
 [1]  1  2  3  4  5  6  7  8  9 10
  • Here we used the not-operator (!) to index everything that is not an NA

  • This is actually probably the most common way of using is.na().

aaa <- c(1,2,3,NA,4,5,6,NA,NA,7,8,9,NA,10)
is.na(aaa)
 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
[13]  TRUE FALSE
!is.na(aaa)
 [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
[13] FALSE  TRUE
  • There are a couple other special values of objects

  • One is Inf, which means “infinity.”” It may result from dividing zero by zero.

  • Another is NaN, which means “not a number.” You will get this is, e.g., you try to take a logarithm of a negative number.

1.20 Summarizing Data

  • The function table() is a very useful way of exploring data
# generate 100 Poisson random numbers with mean/variance=5
aaa <- rpois(100,5)
table(aaa)
aaa
 1  2  3  4  5  6  7  8  9 10 11 12 
 5  9 14 13 17 16  9 10  3  2  1  1 
donner <- read.table("./data/donner.dat", header=TRUE, skip=2)
# survival=0 == died; male=0 == female
with(donner, table(male,survival))
    survival
male  0  1
   0  5 10
   1 20 10
# table along 3 dimensions
with(donner, table(male,survival,age))
, , age = 15

    survival
male 0 1
   0 0 1
   1 1 0

, , age = 18

    survival
male 0 1
   0 0 0
   1 0 1

, , age = 20

    survival
male 0 1
   0 0 1
   1 0 1

, , age = 21

    survival
male 0 1
   0 0 1
   1 0 0

, , age = 22

    survival
male 0 1
   0 0 1
   1 0 0

, , age = 23

    survival
male 0 1
   0 0 1
   1 2 1

, , age = 24

    survival
male 0 1
   0 0 1
   1 1 0

, , age = 25

    survival
male 0 1
   0 1 1
   1 5 1

, , age = 28

    survival
male 0 1
   0 0 0
   1 2 2

, , age = 30

    survival
male 0 1
   0 0 0
   1 3 1

, , age = 32

    survival
male 0 1
   0 0 2
   1 0 1

, , age = 35

    survival
male 0 1
   0 0 0
   1 1 0

, , age = 40

    survival
male 0 1
   0 0 1
   1 1 1

, , age = 45

    survival
male 0 1
   0 2 0
   1 0 0

, , age = 46

    survival
male 0 1
   0 0 0
   1 0 1

, , age = 47

    survival
male 0 1
   0 1 0
   1 0 0

, , age = 50

    survival
male 0 1
   0 1 0
   1 0 0

, , age = 57

    survival
male 0 1
   0 0 0
   1 1 0

, , age = 60

    survival
male 0 1
   0 0 0
   1 1 0

, , age = 62

    survival
male 0 1
   0 0 0
   1 1 0

, , age = 65

    survival
male 0 1
   0 0 0
   1 1 0
cage <- rep(0,length(donner$age))
# simplify by defining 2 age classes: over/under 25
cage[donner$age<=25] <- 1
cage[donner$age>25] <- 2
donner <- data.frame(donner,cage=cage)
with(donner, table(male,survival,cage))
, , cage = 1

    survival
male  0  1
   0  1  7
   1  9  4

, , cage = 2

    survival
male  0  1
   0  4  3
   1 11  6
  • It’s sometimes useful to sort a vector
aaa <- rpois(100,5)
sort(aaa)
  [1]  0  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3
 [26]  3  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4  4  5  5
 [51]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6
 [76]  6  6  6  7  7  7  7  7  7  7  7  7  8  8  8  8  9  9  9  9  9 10 11 11 17
# decreasing order
sort(aaa,decreasing=TRUE)
  [1] 17 11 11 10  9  9  9  9  9  8  8  8  8  7  7  7  7  7  7  7  7  7  6  6  6
 [26]  6  6  6  6  6  6  6  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
 [51]  5  5  4  4  4  4  4  4  4  4  4  4  4  4  3  3  3  3  3  3  3  3  3  3  3
 [76]  3  3  3  3  3  3  3  3  3  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  0
  • Sorting data frames is a bit more involved, but still straightforward

  • use the function order()

# five columns of data again
satu <- c(1,2,3,4,5)
dua <- c("a","b","c","d","e")
tiga <- sample(c(TRUE,FALSE),5,replace=TRUE)
empat <- LETTERS[7:11]
lima <- rnorm(5)
# construct a data frame
(collection <- data.frame(satu,dua,tiga,empat,lima))
  satu dua  tiga empat       lima
1    1   a  TRUE     G -0.5836921
2    2   b FALSE     H  0.9042409
3    3   c  TRUE     I -0.1804774
4    4   d FALSE     J  1.3276193
5    5   e  TRUE     K  0.3944727
o <- order(collection$lima)
collection[o,]
  satu dua  tiga empat       lima
1    1   a  TRUE     G -0.5836921
3    3   c  TRUE     I -0.1804774
5    5   e  TRUE     K  0.3944727
2    2   b FALSE     H  0.9042409
4    4   d FALSE     J  1.3276193
  • there are definitely better ways to do this using tidy tools like dplyr::arrange()!

1.21 Naming Data

  • The matrix of aggressive interactions among kids had neither column nor row names

  • We can add the codes used in the Strayer and Strayer (1976) paper

## load it again because we cleared all objects!
kids <- read.table("./data/strayer_strayer1976-fig2.txt", header=FALSE)
kid.names <- c("Ro","Ss","Br","If","Td","Sd","Pe","Ir","Cs","Ka",
                "Ch","Ty","Gl","Sa", "Me","Ju","Sh")
colnames(kids) <- kid.names
rownames(kids) <- kid.names
kids
   Ro Ss Br If Td Sd Pe Ir Cs Ka Ch Ty Gl Sa Me Ju Sh
Ro  0  1  3  4  1  0  0  1  1  0  1  0  7  0  1  0  0
Ss  1  0  7  8  2  1  1 12  3  0  1  1  4  1  0  0  2
Br  1  4  0  7  3  2  2  0  1  0  8  1  5  5  0  0  1
If  3  3  2  0  3  1 13  3  5  1  0  0  8  3  0  2  1
Td  1  0  0  3  0  4  6  0  8  5  1  0  1  3  0  2  1
Sd  0  0  0  0  0  0  2  8 11  0  4  0  4  3  0  1  0
Pe  1  0  1  9  3  4  0  2  0  0  1  0  7  9  1  1  0
Ir  0  0  0  1  1  1  2  0  7  5  1  1  1  0  0  0  0
Cs  1  1  1  2  5 11  0  3  0  0  0  0  1  0  0  0  0
Ka  0  0  0  0  0  0  1  0  0  0  0 11  0  1  0  1  4
Ch  4  0  4  3  3  2  1  0  0  0  0  3 11  5  0  2  2
Ty  0  0  0  0  0  0  0  0  0  2  0  0  2  0  8  0  0
Gl  0  1  9  3  0  3  6  0  0  0 11  2  0  1  0  7  5
Sa  0  1  4  0  1  2  1  0  0  0  1  0  1  0  0  0  0
Me  0  0  0  0  0  0  0  0  0  0  0  3  0  0  0  0  0
Ju  0  0  0  0  0  1  0  0  0  0  0  1  1  0  0  0  0
Sh  0  0  0  0  0  0  0  0  1  1  0  0  0  0  0  0  0
  • colnames() and rownames() are convenience functions

1.22 Working on Lists

  • apply() applies a function along the margins of a matrix

  • lapply() applies a function to a list and generates a list as its output

  • sapply() is similar to lapply() but generates a vector as its output

# cross-tabulation of sex partners by race/ethnicity from NHSLS
sextable <- read.csv("./data/nhsls_sextable.txt", header=FALSE)
dimnames(sextable)[[1]] <- c("white","black","hispanic","asian","other")
dimnames(sextable)[[2]] <- c("white","black","hispanic","asian","other")
# take a peek at it
sextable
         white black hispanic asian other
white     1131    12       16     3    15
black        5   268        5     0     0
hispanic    39     1      115     0     3
asian       12     0        0    10     4
other        7     0        1     0    18
# calculate marginals
(row.sums <- apply(sextable,1,sum))
   white    black hispanic    asian    other 
    1177      278      158       26       26 
(col.sums <- apply(sextable,2,sum))
   white    black hispanic    asian    other 
    1194      281      137       13       40 
# using sapply() gives similar output
sapply(sextable,sum)
   white    black hispanic    asian    other 
    1194      281      137       13       40 
# create a list -- each element of the list has a different length
aaa <- list(alpha = 1:10, beta = rnorm(50), x = sample(1:100, 100, replace=TRUE))
lapply(aaa,mean)
$alpha
[1] 5.5

$beta
[1] -0.0248892

$x
[1] 50.63
# more compact as a vector
sapply(aaa,mean)
     alpha       beta          x 
 5.5000000 -0.0248892 50.6300000 
# compare the output of sapply() to lapply()
lapply(sextable,sum)
$white
[1] 1194

$black
[1] 281

$hispanic
[1] 137

$asian
[1] 13

$other
[1] 40
  • The apply family of functions used to be more widely used and have been largely supplanted by the immensely powerful tools in dplyr and the tidyverse more generally

1.23 Flow Control: if

  • if allows you to conditionally evaluate expressions.

  • The basic syntax of an if statement is: if(condition) true.branch else false.branch

  • The else part of the statement is optional

(coin <- sample(c("heads","tails"),1))
[1] "tails"
if(coin=="tails") b <- 1 else b <- 0
b
[1] 1
  • Sometimes you can use the very efficient ifelse statement

  • ifelse takes three arguments: (1) the logical test, (2) the result if TRUE, (3) the result if FALSE

x <- 4:-2
# sqrt(x) produces warnings, but using ifelse to check works without producing warings
sqrt(ifelse(x >= 0, x, NA))
[1] 2.000000 1.732051 1.414214 1.000000 0.000000       NA       NA

1.24 Flow Control: for

  • If you want to repeat an action over and over again you need a loop

  • Loops are mostly generated using for statements

  • The basic syntax of a for loop is: for(item in sequence) statement(s)

x <- 1:5
for(i in 1:5) print(x[i])
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
  • That’s a pretty silly for loop – there are much more important uses of for loops!

  • If there are multiple statements executed by a for loop, those statements must be enclosed in curly braces, {}

  • We need to be careful with for loops because they can slow code down, particularly when they are nested and the number of iterations is very large.

  • Vectorizing and using mapping functions like apply and its relatives can greatly speed your code up

1.25 Using Packages

  • Much of the functionality of R comes from the many contributed packages

  • To use a package, you must first install it

  • This can be done at the command line using install.packages(package_name)

  • It is often more convenient to use a menu command

  • in RStudio this is under Tools>Install Packages...

  • Once a package is installed, you must load it in order to use it

  • Do this using the library() command

library(igraph)
#| warning: false
#| message: false
# might as well do something with it
# a small graph
g <- make_graph( c(1,2, 1,3, 2,3, 3,5), n=5 )
plot(g)