require(igraph)
<- read.table("./data/sade1.txt", skip=1, header=FALSE)
rhesus <- as.matrix(rhesus)
rhesus <- c("066", "R006", "CN", "ER", "CY", "EC", "EZ", "004", "065", "022", "076",
nms "AC", "EK", "DL", "KD", "KE")
<- c(rep("M",7), rep("F",9))
sex dimnames(rhesus)[[1]] <- nms
dimnames(rhesus)[[2]] <- nms
<- graph_from_adjacency_matrix(rhesus, weighted=TRUE)
grhesus V(grhesus)$sex <- sex
<- layout.kamada.kawai(grhesus)
rhesus.layout plot(grhesus,
edge.width=log10(E(grhesus)$weight)+1,
edge.arrow.width=0.5,
vertex.label=V(grhesus)$name,
vertex.label.family="Helvetica",
vertex.color=as.numeric(V(grhesus)$sex=="F")+5,
layout=rhesus.layout)
1 Getting Started in R
These notes will help get you started with R
. They are mostly quite non-graphical, focusing instead on the basics of the R
language.
1.1 Setting up R
and RStudio
To get started, you need to do two things:
R
is the software. RStudio is what is known as an Integrated Development Environment (IDE). Basically, an IDE is a way of running base software the provides various tools to make the coding experience easier (e.g., an editor with syntax highlighting, debugger, and interactive graphics facilities).
Here is a brief video explaining what to do:
What follows are some old notes that are very Base-R
focused. The thing is, even though most people use the tidyverse tools these days, it’s still valuable to understand Base-R
(since that’s what everything else is built upon!). This is particularly true for the material that forms the bulk of these notes. Making theoretical scientific figures (chapter Chapter 3) is simply easier in Base-R
. You typically don’t need to do complex data wrangling to make a theoretical figure, so the powerful data-manipulation tools of dplyr
, for instance, are unnecessary. Sometimes, we want to utilize the grid
package, which underlies the tidyverse graphics library ggplot2
, directly. Again, I find this more straightforward in Base-R
.
The igraph
package for drawing graphs (a.k.a., “networks”) that we discuss in chapter Chapter 4 also runs in Base-R
.
1.2 What Is R
?
R
is statistical numerical softwareR
is a “dialect” of the S statistical programming languageR
is a system for interactive data analysisR
is a high-level programming languageR
is freeR
is state-of-the-art in statistical computing. It is what many (most?) research statisticians use in their work
1.3 Why Use R
?
R
is FREE! That, by itself, is almost enough. No complicated licensing. Broad dissemination of research methodologies and results, etc.R
is available for a variety of computer platforms (e.g., Linux, MacOS, Windows).R
is widely used by professional statisticians, social scientists, biologists, demographers, and other scientists. This increases the likelihood that code will exist to do a calculation you might want to do.R
has remarkable online help lists, tutorials, etc.R
represents the state-of-the-art in statistical computing.
1.5 Rhesus Money Grooming Network
A note on loading data: the above code loads a data file apparently called
"./data/sade1.txt"
What does that mean?
The one dot followed by a slash,
./
, means to go into the sub-directory calleddata
, which is in our current working directory, and read the text file calledsade1.txt
If the
data
sub-directory was actually in, say, the same directory where our working directory is located (i.e., they were two sub-directories of the same higher-level folder), we would use two dots,../
, which means to go out one directory in the hierarchyThis is actually not
R
but the underlying OS file systemLearning about the file/directory structure of your computer is actually an important (and under-appreciated) data-science skill
Check out the fantastic MIT course The Missing Semester of Your CS Education for information on various tools that can really improve your workflows and general skill-level
1.6 A Few Conventions and Other Helpful Bits
There are some things that you will see over and over in the code embedded in this document
The assignment operator
<-
is used to assign a value to a name.The value is on the right-hand side of the operator and the name is on the left side
You can use
=
for assignment, but I don’t recommend it (it doesn’t work at all levels, makes the code harder to read, etc.)Different environments make it more or less easy to use
<-
. InRStudio
, hit the option key and the minus sign simultaneouslyComments are marked by
#
: anything following the hash will be ignored byR
Use comments liberally to help you (and others) understand your code
In these notes, the output that you would see on your own command line will be white following a grey box (the input). Frequently, it will begin with a
[1]
, which indicates the first element of a vectorSometimes I enclose a command in parentheses; this is simply to force
R
to echo the output (for pedagogical purposes)
# a comment
<- c(1,2,3)
x <- c(1,2,3)) (y
[1] 1 2 3
You will probably want to seek help on functions. At the command line, simply type a question mark followed immediately by the function you want to query,
?function.name
- In Rstudio, you may be given auto-complete suggestions, which you can click on to save typing
When you are done with your
R
session, typeq()
at the command lineR
will ask you if you want to save your workspace. For now, you probably don’t.Check out the much more comprehensive Introduction to R for all the language details.
1.7 R
as a Calculator
# addition
2+2
[1] 4
# multiplication
2*3
[1] 6
<- 2
a <- 3
b *b a
[1] 6
# division
2/3
[1] 0.6666667
/a b
[1] 1.5
1/b/a
[1] 0.1666667
# note order of operations!
1/(b/a)
[1] 0.6666667
# parentheses can override order of operations
# an exponential
exp(-2)
[1] 0.1353353
# why we age
<- 0.02
r exp(-r*45)
[1] 0.4065697
# something more tricky
exp(log(2))
[1] 2
# generate 20 normally distributed random numbers
rnorm(20)
[1] 0.2009012 2.2207369 -0.3607866 1.4697813 -1.4384434 0.8721387
[7] -0.3946359 0.7073727 0.5261874 0.6818287 -0.4126075 0.5687733
[13] -0.6435988 0.3862529 1.0785412 0.1752227 0.2863084 -0.0291453
[19] 0.6750149 -0.8180922
1.8 Data Types
- Numeric
-
All numbers in
R
are of the form double (i.e., double-precision floating point numbers). This can be a bit confusing for people who are used to languages with integer data types (like, most languages!). Entering something that looks like an integer doesn’t mean it is.
# it looks like an integer, but don't be fooled!
<- 2
a is.numeric(a)
[1] TRUE
is.integer(a)
[1] FALSE
is.double(a)
[1] TRUE
- Integer
-
OK, technically
R
does have an integer class, but it is used very rarely and many functions will convert integers into doubles anyway. If you really must have an integer (e.g., because you are passing output to external C or FORTRAN code that expects it), add the suffixL
to the entered number.
<- 2L
a is.integer(a)
[1] TRUE
- Character
-
Strings are represented by the character data class.
<- c("Uganda", "Tanzania", "Kenya", "Rwanda")) (countries
[1] "Uganda" "Tanzania" "Kenya" "Rwanda"
as.character(1:5)
[1] "1" "2" "3" "4" "5"
- Factor
-
Factors are a data type for encoding categorical data. Notice that factors are printed without the quotes. This is because R stores them as a set of codes. Data of type “factor” are different from data of type “character” (which is what plain text is). Note the difference below between factor and character data. Because factors get used in statistical models, they are actually represented as numbers (the levels) that have associated names. Vectors, on the other hand, are just lists of numbers.
<- factor(c("Uganda", "Tanzania", "Kenya", "Rwanda"))
countries countries
[1] Uganda Tanzania Kenya Rwanda
Levels: Kenya Rwanda Tanzania Uganda
# a trick to get some insight into how factors are handled by R
unclass(countries)
[1] 4 3 1 2
attr(,"levels")
[1] "Kenya" "Rwanda" "Tanzania" "Uganda"
<- c("Uganda", "Tanzania", "Kenya", "Rwanda")
countries1 == unclass(countries1) countries1
[1] TRUE TRUE TRUE TRUE
== unclass(countries) countries
[1] FALSE FALSE FALSE FALSE
- Logical
-
TRUE
andFALSE
are reserved keywords, whileT
andF
are global constants set to these. These logical variables are essential tools for subsetting data. You also use them extensively in setting optional arguments of functions.
<- c(T,F,F,T,T)
t.or.f is.logical(t.or.f)
[1] TRUE
<- c(1,2,3,4,5)
aaa # subset
aaa[t.or.f]
[1] 1 4 5
- List
-
You can mix different types of data in a list using the command
list()
. This is useful when you write your own functions and want to output multiple things. Use the functionstr()
to give you information about a list.
<- list(name="mary", child.age=6,
child1 status="foster",mother.alive=F, father.alive=T, parents.ages=c(24,35))
str(child1)
List of 6
$ name : chr "mary"
$ child.age : num 6
$ status : chr "foster"
$ mother.alive: logi FALSE
$ father.alive: logi TRUE
$ parents.ages: num [1:2] 24 35
1.9 Coercion
Sometimes you have data in one type but need it in a different type
R
provides a variety of methods to coerce data from one type to anotherThese methods are carried out by functions that begin with
as.xxx
, wherexxx
is the data type to which you are coercing
<- factor(c("Uganda", "Tanzania", "Kenya", "Rwanda"))
countries as.character(countries)
[1] "Uganda" "Tanzania" "Kenya" "Rwanda"
as.numeric(countries)
[1] 4 3 1 2
# werk it backwards
<- c("Uganda", "Tanzania", "Kenya", "Rwanda")
countries1 as.factor(countries1)
[1] Uganda Tanzania Kenya Rwanda
Levels: Kenya Rwanda Tanzania Uganda
# sometimes you want your numbers to actually be strings (e.g., when you make labels or column names)
as.character(1:5)
[1] "1" "2" "3" "4" "5"
# there actually is an integer class; it just doesn't get used much at all
<- 2
a is.integer(a)
[1] FALSE
is.integer(as.integer(a))
[1] TRUE
- You can check the class of an object using functions that begin with
is.xxx
, wherexxx
is the data type you are querying (likeis.integer()
above)
1.10 Creating Vectors
A vector is a list of numbers – it turns out everything in
R
is represented as a vector but that doesn’t affect your life much.In order to create a vector, you use the the function
c()
, which concatenates a list of items (hence the “c”).You will use this a lot and it’s a super-common mistake to forget the
c()
when putting together a list of numbers, factors, etc.If you do forget it, you will get a syntax error
Often we want either regularly spaced vectors or a vector of one value repeated.
R
has a number of facilities to perform these operations.
<- c(1,3,5,7,9)) ( manual
[1] 1 3 5 7 9
<- 1:20 ) ( count
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
<- seq(0,85,by=5) ) ( ages
[1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
<- rep(1,10) ) ( ones
[1] 1 1 1 1 1 1 1 1 1 1
<- rep(c(1,2),c(5,10)) ) ( fourages
[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
<- seq(1,5, length=20) ) ( equalspace
[1] 1.000000 1.210526 1.421053 1.631579 1.842105 2.052632 2.263158 2.473684
[9] 2.684211 2.894737 3.105263 3.315789 3.526316 3.736842 3.947368 4.157895
[17] 4.368421 4.578947 4.789474 5.000000
You can use
rep()
to repeat values.Sometimes this can be tricky: the second argument tells
R
how many repetitions.This argument can be a vector and this, along with the possibility of a vector of the items you want repeated too, allows you to create quite complex patterns very easily.
rep(2,10)
[1] 2 2 2 2 2 2 2 2 2 2
rep(c(1,2),10)
[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
rep(c(1,2), c(5,10))
[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
rep("R roolz!", 3)
[1] "R roolz!" "R roolz!" "R roolz!"
1.11 Creating Matrices
As we said, a vector is a list of numbers
A matrix is a rectangular array of numbers – it is a vector of vectors, with the numbers indexed by row and column.
One way to create matrices is to “bind” columns together using the commands
cbind()
orrbind()
.
# age distribution of Gombe chimps in 1980 and 1986
<- c(7, 13, 8, 13, 5, 35, 9)
cx1980 <- c(9, 11, 15, 8, 9, 38, 0)
cx1988 <- cbind(cx1980, cx1988) ) ( C
cx1980 cx1988
[1,] 7 9
[2,] 13 11
[3,] 8 15
[4,] 13 8
[5,] 5 9
[6,] 35 38
[7,] 9 0
# another way
<- c(cx1980, cx1988)
C <- matrix(C, nrow=7, ncol=2) ) ( C
[,1] [,2]
[1,] 7 9
[2,] 13 11
[3,] 8 15
[4,] 13 8
[5,] 5 9
[6,] 35 38
[7,] 9 0
- What happens if we try to bind columns of different lengths?
# age distribution at Tai; Boesch uses fewer age classes
<- c(18,10,15,30)
cxboesch <- cbind(C,cxboesch) ) ( C
Warning in cbind(C, cxboesch): number of rows of result is not a multiple of
vector length (arg 2)
cxboesch
[1,] 7 9 18
[2,] 13 11 10
[3,] 8 15 15
[4,] 13 8 30
[5,] 5 9 18
[6,] 35 38 10
[7,] 9 0 15
Both the warning message and the output can seem a little odd to the uninitiated
R
uses a recycling rule for filling out vectors and matricesWhen you try to put together things that are neither the same length nor multiples of each other, you get a warning
We can use the recycling rule to make a matrix of ones:
<- matrix(1,nr=3,nc=3) ) ( X
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
Note that using the short version of
nrow
,nr
, is sufficient. This is often true – you can use the minimum name that is unambiguous.The
matrix()
command requires at least 3 arguments: (1) a vector of numbers that will form the elements of the matrix, (2) the number of rows, and (3) the number of columns.For small matrices, you might want to enter the vectors of values manually
If you do this, it’s important to know that
R
fills matrices column-wise (the standard for FORTRAN and definitely not the way most people actually work!).Use the optional argument
byrow=TRUE
to makeR
read in the data row-wise
# cross-classified data on hair/eye color
<- c(32,11,10,3, 38,50,25,15, 10,10,7,7, 3,30,5,8)
freq <- c("Black", "Brown", "Red", "Blond")
hair <- c("Brown", "Blue", "Hazel", "Green")
eyes <- matrix(freq, nr=4, nc=4, byrow=TRUE)
freqmat dimnames(freqmat)[[1]] <- hair
dimnames(freqmat)[[2]] <- eyes
freqmat
Brown Blue Hazel Green
Black 32 11 10 3
Brown 38 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
# might as well do something with it
mosaicplot(freqmat)
1.12 Data Frames
A data frame is an
R
object which stores a data matrix. A data frame is essentially a list of variables which are all the same length. A single data frame can hold different types of variables.To access a variable contained in a data frame, use the data frame name followed by the variable name, separated by a dollar sign,
$
.
# five columns of data
<- c(1,2,3,4,5)
satu <- c("a","b","c","d","e")
dua <- sample(c(TRUE,FALSE),5,replace=TRUE)
tiga <- LETTERS[7:11]
empat <- rnorm(5)
lima # construct a data frame
<- data.frame(satu,dua,tiga,empat,lima)) (collection
satu dua tiga empat lima
1 1 a FALSE G -0.24012272
2 2 b TRUE H 1.62867872
3 3 c FALSE I 1.34427662
4 4 d TRUE J 0.04494688
5 5 e FALSE K 0.11910426
# extract the third variable
$tiga collection
[1] FALSE TRUE FALSE TRUE FALSE
- by default,
data.frame()
will produce row numbers (seen to the left of the first column in the data framecollection
above)
1.13 Directories and Paths
R
uses a working directory. The default can be set in the Preferences or using an initialization file (i.e., a file that is always read whenR
starts up).If you read in a file without specifying a path,
R
will search in the working directory; if there is no file matching the name you provide, you receive an error messageWe can query the working directory using the command
getwd()
and we can change it usingsetwd()
You can always load a file by giving either a full or relative path
getwd()
[1] "/Users/jhj1/Teaching/graphics"
#setwd("/Users/jhj1/Projects/git/AABA2023_Workshop/Markdown")
## can't actually change it because it screws up the rendering!
Setting the working directory is actually not recommended
It is not a good scientific practice that favors replicability/interoperability/etc.
It’s generally better to use
R
Projects in RStudio (as we do in this workshop)To start an R Project, either double-click on the
.RProj
file in the project’s directory or clicking on the R Project menu button in the upper right corner of your RStudio frameTo share the work you have done in an R Project with collaborators, students, or scientists looking to replicate your work, simply share the folder containing the .RProj file
When you quit
R
, you will be asked if you want to save yourR
sessionIf a session has previously been saved in your working directory, there will be a copy of the workspace in the
R
binary format named.RData
When
R
is started in a particular directory, if there is an.RData
file in that directory, it will load automaticallyThis can lead to some surprising behavior if you don’t know that it can happen
Automatically saving and loading workspaces is also not recommended
Best scientific practice involves constructing your workspace using broadly-interoperable data formats (e.g., .csv files) and scripts
1.14 Reading Files
There are a number of ways to read data into R. Probably the easiest and most frequently used involves reading data from plain-text (ASCII) files. These files can be space, tab, or comma delimited.
You can create these files in a spreadsheet program like Excel or output them from most other statistical packages.
You can read these from a local directory or from an internet source
R
expects delimited files to be “white-space delimited” with values separated by either tabs or spaces and rows separated by carriage returnsIt’s always a good idea to specify whether or not you have a header (i.e., column names). If you don’t, say
header=FALSE
; if you do, obviously, sayheader=TRUE
# read a space-delimitted file (a sociomatrix of kids 17 kids aggressive acts toward each other)
<- read.table("./data/strayer_strayer1976-fig2.txt", header=FALSE)) (kids
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1 0 1 3 4 1 0 0 1 1 0 1 0 7 0 1 0 0
2 1 0 7 8 2 1 1 12 3 0 1 1 4 1 0 0 2
3 1 4 0 7 3 2 2 0 1 0 8 1 5 5 0 0 1
4 3 3 2 0 3 1 13 3 5 1 0 0 8 3 0 2 1
5 1 0 0 3 0 4 6 0 8 5 1 0 1 3 0 2 1
6 0 0 0 0 0 0 2 8 11 0 4 0 4 3 0 1 0
7 1 0 1 9 3 4 0 2 0 0 1 0 7 9 1 1 0
8 0 0 0 1 1 1 2 0 7 5 1 1 1 0 0 0 0
9 1 1 1 2 5 11 0 3 0 0 0 0 1 0 0 0 0
10 0 0 0 0 0 0 1 0 0 0 0 11 0 1 0 1 4
11 4 0 4 3 3 2 1 0 0 0 0 3 11 5 0 2 2
12 0 0 0 0 0 0 0 0 0 2 0 0 2 0 8 0 0
13 0 1 9 3 0 3 6 0 0 0 11 2 0 1 0 7 5
14 0 1 4 0 1 2 1 0 0 0 1 0 1 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0
16 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0
17 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
If your file is delimited by something other than spaces, it is a good idea to use a slightly different function,
read.delim()
and specify exactly what the delimiter isFrequently, there will be non-tabular information at the top of a file (e.g., meta-data describing the data set). Use the
skip=n
option, wheren
is the number of lines you want skipped.
<- read.delim("./data/quercus.txt", skip=24, sep="\t", header=TRUE)
quercus head(quercus)
Species Region Range acorn.size tree.height
1 Quercus alba L. Atlantic 24196 1.4 27
2 Quercus bicolor Willd. Atlantic 7900 3.4 21
3 Quercus macrocarpa Michx. Atlantic 23038 9.1 25
4 Quercus prinoides Willd. Atlantic 17042 1.6 3
5 Quercus Prinus L. Atlantic 7646 10.5 24
6 Quercus stellata Wang. Atlantic 19938 2.5 17
1.15 The Workspace
R
handles data in a manner that is different than many statistical packages.In particular, you are not limited to a single rectangular data matrix at a time.
The workspace holds all the objects (e.g., data frames, variables, functions) that you have created or read in.
You can essentially have as many data frames as your machine’s memory will allow.
To find out what lurks in your workspace, use
objects()
command.To remove an object, use
rm()
.If you really want to clear your whole workspace, you can use the following syntax:
rm(list=ls())
. Beware, though. Once you do this, you don’t get the data back.
objects()
[1] "a" "aaa" "ages" "b"
[5] "C" "child1" "collection" "count"
[9] "countries" "countries1" "cx1980" "cx1988"
[13] "cxboesch" "dua" "empat" "equalspace"
[17] "eyes" "fourages" "freq" "freqmat"
[21] "grhesus" "hair" "kids" "lima"
[25] "manual" "nms" "ones" "quercus"
[29] "r" "rhesus" "rhesus.layout" "satu"
[33] "sex" "t.or.f" "tiga" "x"
[37] "X" "y"
rm(aaa)
rm(list=ls())
objects()
character(0)
1.16 Scope
Because the
R
workspace can contain many different variables and even multiple data frames, you must be aware of scopeWhen we extract columns of a data frame (e.g., if we wanted to plot them) we need to use the syntax
data.frame$col.name
## load it again because we cleared all objects!
<- read.delim("./data/quercus.txt", skip=24, sep="\t", header=TRUE)
quercus plot(quercus$tree.height, quercus$acorn.size, pch=16, col="red", xlab="Tree Height (m)", ylab="Acorn Size (cm3)")
It can be a hassle having to type the data frame name (and dollar sign) over and over again
With the
with()
function, we can set up a local scoping rule that allows us to drop the need to type the data frame name (and dollar sign) to access columns of a data frame
with(quercus, plot(tree.height, acorn.size, pch=16, col="blue", xlab="Tree Height (m)", ylab="Acorn Size (cm3)"))
Apparently, there are
R
users who gladly usewith
and those who hate its use. I fall into the former category.Note that this is a very Base-
R
perspective. the tidyverse (e.g., ggplot2, etc.) changes many of these issues.
1.17 Indexing and Subsetting
Index (and access) the elements of a vector using square brackets.
myvec[1]
takes the first element of a vector calledmyvec
.Use the colon (:) operator for sequences.
myvec[1:5]
takes the first five elements ofmyvec
.R
is unusual in that it allows negative indexing:myvec[-1]
takes all elements of except the first one. To exclude a sequence, you need to place the sequence within parentheses:myvec[-(1:5)]
.Vector indices don’t have to be consecutive:
myvec[c(2,5,1,11)]
.
<- c(1,2,3,4,5,6,66,77,7,8,9,10)
myvec 1] myvec[
[1] 1
1:5] myvec[
[1] 1 2 3 4 5
-1] myvec[
[1] 2 3 4 5 6 66 77 7 8 9 10
-(1:5)] myvec[
[1] 6 66 77 7 8 9 10
# try without the parentheses
#myvec[-1:5]
c(2,5,1,11)] myvec[
[1] 2 5 1 9
- Access the elements of a data frame using the dollar sign. Subsetting anything other than a data frame uses square brackets.
dim(quercus)
[1] 39 5
<- quercus$acorn.size
size 1:3] #first 3 elements size[
[1] 1.4 3.4 9.1
17] #only element 17 size[
[1] 4.8
-39] #all but the last element size[
[1] 1.4 3.4 9.1 1.6 10.5 2.5 0.9 6.8 1.8 0.3 0.9 0.8 2.0 1.1 0.6
[16] 1.8 4.8 1.1 3.6 1.1 1.1 3.6 8.1 3.6 1.8 0.4 1.1 1.2 4.1 1.6
[31] 2.0 5.5 5.9 2.6 6.0 1.0 17.1 0.4
c(3,6,9)] # elements 3,6,9 size[
[1] 9.1 2.5 1.8
$Region=="California"] # use a logical test to subset size[quercus
[1] 4.1 1.6 2.0 5.5 5.9 2.6 6.0 1.0 17.1 0.4 7.1
3,4] # access an element of an array or data frame by X[row,col] quercus[
[1] 9.1
"tree.height"] quercus[,
[1] 27.0 21.0 25.0 3.0 24.0 17.0 15.0 0.3 24.0 11.0 15.0 23.0 24.0 3.0 13.0
[16] 30.0 9.0 27.0 9.0 24.0 23.0 27.0 24.0 23.0 18.0 9.0 9.0 4.0 18.0 6.0
[31] 17.0 20.0 30.0 23.0 26.0 21.0 15.0 1.0 18.0
- The comma with nothing in front of it means take every row in the column named
"tree.height"
.
1.18 More Subsetting
Positive indices include, negative indices exclude elements
1:3
means a sequence from 1 to 3You can only use a single negative subscript, i.e., you can’t use
quercus$acorn.size[-1:3]
Of course, you can get around this by enclosing the vector in parentheses
quercus$acorn.size[-(1:3)]
The logical operators are
==
(equal),!=
(not equal), and the various greater than/less than symbols:>
,>=
,<
,<=
Further logicals are
&
(and),|
(or),!
(not),&&
(another and),||
(another or)&
and!
work elementwise on vectors: element 1 is compared in the two vectors, then element 2, and so on&&
and||
are tricky. These logical tests evaluate left to right, examining only the first element of each vector (they go until a result is determined for||
).- Why would you want that?? In general, you don’t. It makes some calculations faster.
When you refer to a variable in a data frame, you must specify the data frame name followed a dollar sign and the variable name
quercus$acorn.size
Testing for equality is just a special case of a logical test. We frequently want to identify numbers either above or below some criterion.
<- c(1,2,3,4,5,6,66,77,7,8,9,10)
myvec <- myvec[myvec<=10]
myvec myvec
[1] 1 2 3 4 5 6 7 8 9 10
<- 1:7
x # elements that are greater than 2 but less than 6
>2) & (x<6) (x
[1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE
<- rep(c(FALSE, TRUE),5)
is.even <- myvec[is.even]) (evens
[1] 2 4 6 8 10
1.19 Missing Values
NA
is a special code for missing data.NA
pretty much means “Don’t Know.”The presence of
NA
values in your data set can lead to some surprising consequences.You can’t test for a
NA
the way you would test for any other value (i.e., using the==
operator) sincevariable==NA
is like asking in English, is the variable equal to some number I don’t know? How could you know that?!It also doesn’t make any sense to add one to something you don’t know what it is –
1+NA
is meaningless!R
therefore provides the functionis.na()
that allows us to subset using logicals.
<- c(1,2,3,NA,4,5,6,NA,NA,7,8,9,NA,10)
aaa <- aaa[!is.na(aaa)]
aaa aaa
[1] 1 2 3 4 5 6 7 8 9 10
Here we used the not-operator (
!
) to index everything that is not anNA
This is actually probably the most common way of using
is.na()
.
<- c(1,2,3,NA,4,5,6,NA,NA,7,8,9,NA,10)
aaa is.na(aaa)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[13] TRUE FALSE
!is.na(aaa)
[1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[13] FALSE TRUE
There are a couple other special values of objects
One is
Inf
, which means “infinity.”” It may result from dividing zero by zero.Another is
NaN
, which means “not a number.” You will get this is, e.g., you try to take a logarithm of a negative number.
1.20 Summarizing Data
- The function
table()
is a very useful way of exploring data
# generate 100 Poisson random numbers with mean/variance=5
<- rpois(100,5)
aaa table(aaa)
aaa
1 2 3 4 5 6 7 8 9 10 11 12
5 9 14 13 17 16 9 10 3 2 1 1
<- read.table("./data/donner.dat", header=TRUE, skip=2)
donner # survival=0 == died; male=0 == female
with(donner, table(male,survival))
survival
male 0 1
0 5 10
1 20 10
# table along 3 dimensions
with(donner, table(male,survival,age))
, , age = 15
survival
male 0 1
0 0 1
1 1 0
, , age = 18
survival
male 0 1
0 0 0
1 0 1
, , age = 20
survival
male 0 1
0 0 1
1 0 1
, , age = 21
survival
male 0 1
0 0 1
1 0 0
, , age = 22
survival
male 0 1
0 0 1
1 0 0
, , age = 23
survival
male 0 1
0 0 1
1 2 1
, , age = 24
survival
male 0 1
0 0 1
1 1 0
, , age = 25
survival
male 0 1
0 1 1
1 5 1
, , age = 28
survival
male 0 1
0 0 0
1 2 2
, , age = 30
survival
male 0 1
0 0 0
1 3 1
, , age = 32
survival
male 0 1
0 0 2
1 0 1
, , age = 35
survival
male 0 1
0 0 0
1 1 0
, , age = 40
survival
male 0 1
0 0 1
1 1 1
, , age = 45
survival
male 0 1
0 2 0
1 0 0
, , age = 46
survival
male 0 1
0 0 0
1 0 1
, , age = 47
survival
male 0 1
0 1 0
1 0 0
, , age = 50
survival
male 0 1
0 1 0
1 0 0
, , age = 57
survival
male 0 1
0 0 0
1 1 0
, , age = 60
survival
male 0 1
0 0 0
1 1 0
, , age = 62
survival
male 0 1
0 0 0
1 1 0
, , age = 65
survival
male 0 1
0 0 0
1 1 0
<- rep(0,length(donner$age))
cage # simplify by defining 2 age classes: over/under 25
$age<=25] <- 1
cage[donner$age>25] <- 2
cage[donner<- data.frame(donner,cage=cage)
donner with(donner, table(male,survival,cage))
, , cage = 1
survival
male 0 1
0 1 7
1 9 4
, , cage = 2
survival
male 0 1
0 4 3
1 11 6
- It’s sometimes useful to sort a vector
<- rpois(100,5)
aaa sort(aaa)
[1] 0 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3
[26] 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5
[51] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6
[76] 6 6 6 7 7 7 7 7 7 7 7 7 8 8 8 8 9 9 9 9 9 10 11 11 17
# decreasing order
sort(aaa,decreasing=TRUE)
[1] 17 11 11 10 9 9 9 9 9 8 8 8 8 7 7 7 7 7 7 7 7 7 6 6 6
[26] 6 6 6 6 6 6 6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
[51] 5 5 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3
[76] 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 0
Sorting data frames is a bit more involved, but still straightforward
use the function
order()
# five columns of data again
<- c(1,2,3,4,5)
satu <- c("a","b","c","d","e")
dua <- sample(c(TRUE,FALSE),5,replace=TRUE)
tiga <- LETTERS[7:11]
empat <- rnorm(5)
lima # construct a data frame
<- data.frame(satu,dua,tiga,empat,lima)) (collection
satu dua tiga empat lima
1 1 a TRUE G -0.5836921
2 2 b FALSE H 0.9042409
3 3 c TRUE I -0.1804774
4 4 d FALSE J 1.3276193
5 5 e TRUE K 0.3944727
<- order(collection$lima)
o collection[o,]
satu dua tiga empat lima
1 1 a TRUE G -0.5836921
3 3 c TRUE I -0.1804774
5 5 e TRUE K 0.3944727
2 2 b FALSE H 0.9042409
4 4 d FALSE J 1.3276193
- there are definitely better ways to do this using tidy tools like
dplyr::arrange()
!
1.21 Naming Data
The matrix of aggressive interactions among kids had neither column nor row names
We can add the codes used in the Strayer and Strayer (1976) paper
## load it again because we cleared all objects!
<- read.table("./data/strayer_strayer1976-fig2.txt", header=FALSE)
kids <- c("Ro","Ss","Br","If","Td","Sd","Pe","Ir","Cs","Ka",
kid.names "Ch","Ty","Gl","Sa", "Me","Ju","Sh")
colnames(kids) <- kid.names
rownames(kids) <- kid.names
kids
Ro Ss Br If Td Sd Pe Ir Cs Ka Ch Ty Gl Sa Me Ju Sh
Ro 0 1 3 4 1 0 0 1 1 0 1 0 7 0 1 0 0
Ss 1 0 7 8 2 1 1 12 3 0 1 1 4 1 0 0 2
Br 1 4 0 7 3 2 2 0 1 0 8 1 5 5 0 0 1
If 3 3 2 0 3 1 13 3 5 1 0 0 8 3 0 2 1
Td 1 0 0 3 0 4 6 0 8 5 1 0 1 3 0 2 1
Sd 0 0 0 0 0 0 2 8 11 0 4 0 4 3 0 1 0
Pe 1 0 1 9 3 4 0 2 0 0 1 0 7 9 1 1 0
Ir 0 0 0 1 1 1 2 0 7 5 1 1 1 0 0 0 0
Cs 1 1 1 2 5 11 0 3 0 0 0 0 1 0 0 0 0
Ka 0 0 0 0 0 0 1 0 0 0 0 11 0 1 0 1 4
Ch 4 0 4 3 3 2 1 0 0 0 0 3 11 5 0 2 2
Ty 0 0 0 0 0 0 0 0 0 2 0 0 2 0 8 0 0
Gl 0 1 9 3 0 3 6 0 0 0 11 2 0 1 0 7 5
Sa 0 1 4 0 1 2 1 0 0 0 1 0 1 0 0 0 0
Me 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0
Ju 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0
Sh 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
colnames()
andrownames()
are convenience functions
1.22 Working on Lists
apply()
applies a function along the margins of a matrixlapply()
applies a function to a list and generates a list as its outputsapply()
is similar tolapply()
but generates a vector as its output
# cross-tabulation of sex partners by race/ethnicity from NHSLS
<- read.csv("./data/nhsls_sextable.txt", header=FALSE)
sextable dimnames(sextable)[[1]] <- c("white","black","hispanic","asian","other")
dimnames(sextable)[[2]] <- c("white","black","hispanic","asian","other")
# take a peek at it
sextable
white black hispanic asian other
white 1131 12 16 3 15
black 5 268 5 0 0
hispanic 39 1 115 0 3
asian 12 0 0 10 4
other 7 0 1 0 18
# calculate marginals
<- apply(sextable,1,sum)) (row.sums
white black hispanic asian other
1177 278 158 26 26
<- apply(sextable,2,sum)) (col.sums
white black hispanic asian other
1194 281 137 13 40
# using sapply() gives similar output
sapply(sextable,sum)
white black hispanic asian other
1194 281 137 13 40
# create a list -- each element of the list has a different length
<- list(alpha = 1:10, beta = rnorm(50), x = sample(1:100, 100, replace=TRUE))
aaa lapply(aaa,mean)
$alpha
[1] 5.5
$beta
[1] -0.0248892
$x
[1] 50.63
# more compact as a vector
sapply(aaa,mean)
alpha beta x
5.5000000 -0.0248892 50.6300000
# compare the output of sapply() to lapply()
lapply(sextable,sum)
$white
[1] 1194
$black
[1] 281
$hispanic
[1] 137
$asian
[1] 13
$other
[1] 40
- The
apply
family of functions used to be more widely used and have been largely supplanted by the immensely powerful tools indplyr
and the tidyverse more generally
1.23 Flow Control: if
if
allows you to conditionally evaluate expressions.The basic syntax of an
if
statement is:if(condition) true.branch else false.branch
The
else
part of the statement is optional
<- sample(c("heads","tails"),1)) (coin
[1] "tails"
if(coin=="tails") b <- 1 else b <- 0
b
[1] 1
Sometimes you can use the very efficient
ifelse
statementifelse
takes three arguments: (1) the logical test, (2) the result ifTRUE
, (3) the result ifFALSE
<- 4:-2
x # sqrt(x) produces warnings, but using ifelse to check works without producing warings
sqrt(ifelse(x >= 0, x, NA))
[1] 2.000000 1.732051 1.414214 1.000000 0.000000 NA NA
1.24 Flow Control: for
If you want to repeat an action over and over again you need a loop
Loops are mostly generated using
for
statementsThe basic syntax of a
for
loop is:for(item in sequence) statement(s)
<- 1:5
x for(i in 1:5) print(x[i])
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
That’s a pretty silly
for
loop – there are much more important uses offor
loops!If there are multiple statements executed by a
for
loop, those statements must be enclosed in curly braces,{}
We need to be careful with
for
loops because they can slow code down, particularly when they are nested and the number of iterations is very large.Vectorizing and using mapping functions like
apply
and its relatives can greatly speed your code up
1.25 Using Packages
Much of the functionality of
R
comes from the many contributed packagesTo use a package, you must first install it
This can be done at the command line using
install.packages(package_name)
It is often more convenient to use a menu command
in
RStudio
this is underTools>Install Packages...
Once a package is installed, you must load it in order to use it
Do this using the
library()
command
library(igraph)
#| warning: false
#| message: false
# might as well do something with it
# a small graph
<- make_graph( c(1,2, 1,3, 2,3, 3,5), n=5 )
g plot(g)