3 Data

3.1 Data Formats

We all know that guy who thinks he’s smart because he keeps his relational data in the form of a sociomatrix.¹ But is he actually that smart? If you have \(n\) individuals in your population, the size of your sociomatrix will be \(n^2\). If \(n\) is large, this can create quite a memory hog of a file. Suppose you have 1000 vertices in your graph. You then have a \(1000 \times 1000 \times 4\) bytes, which is 4Mb just for your relational data. This doesn’t include the memory allotted to the delimiter, which will typically double the size of the file. Now, 1000 nodes is not actually that big, of course; this can become a real problem when you move into the realm of big data.

The thing is, most social networks are quite sparse. Graph densities of 1-2% are not unusual for many social relations. This means your matrix has lots of zeros in it and each of those zeros occupies the same amount of memory as the ones. A smarter way to represent data such as those contained in a sociomatrix is to capitalize on the sparseness and simply indicate the location of non-zero entries. There are a number of tools for sparse-matrix representation for high-performance computing. Interestingly, probably the most useful representation of network data is essentially a sparse-matrix format, known in social network analysis as an edgelist. At its minimum, an edgelist is a two-column matrix. Column one contains the ID for egos, while the second column contains the IDs for the egos’ respective alters.

3.2 Importing Data to `R`

Getting your network data into the right format to analyze it can be a surprisingly difficult task that is not often addressed in social network analysis textbooks and courses. Here we go over some of the basics of formatting and importing network data.

For this tutorial, we use networks of support between Hogwarts students, as coded by Bossart and Meidert (2013). Their data for the first six Harry Potter books are available to download from Tom Snijders’ Siena page at Oxford. These unvalued networks contains a directed tie between two Hogwarts students if one student provided verbal support to the other. We’ve reformatted the data to illustrate a couple different methods for handling data. Download our version of the student attribute data, the book-five edgelist, and the book-five sociomatrix.

3.3 Edgelists

An edgelist is usually formatted as a table where the first two columns contain the IDs of a pair of nodes in the network that have a tie between them. Optional additional columns may contain properties of the relationship between the nodes (e.g., the value of a tie). Any pair of nodes that does not have a tie between them is usually not included in an edgelist. This property is what makes edgelists a more efficient network data storage format than sociomatrices (see below). Unobserved edges can be encoded in edgelist format by including “NA” in the value column. Here’s an example of a simple edgelist table with a value column:

require(igraph)
library(knitr)

p1 <- c("Harry", "Harry", "Hermione")
p2 <- c("Hermione", "Ron", "Ron")

values <- c(5,6,5)

schoolmodel <- data.frame(p1, p2, values)
colnames(schoolmodel) <- c("Ego", "Alter", "Value")
knitr::kable(schoolmodel)

Ego	Alter	Value
Harry	Hermione	5
Harry	Ron	6
Hermione	Ron	5

The columns in an edgelist table are usually ordered “Ego” (often the person who completed the interview or who was the subject of a focal follow) followed by “Alter” (the person that the focal individual named or interacted with). In the case of undirected network data, the ordering of these columns does not matter, but in directed data, it does. igraph and statnet software will encode directed edgelist data with the arrow pointing from the first to the second column, so if the ties you recorded have reversed directionality from ego to alter (e.g., the alter gave something to ego) you should flip the order of the columns before converting the data to a network in order to get the edges properly directed (if you want your network to show the direction that support flows in the network). Most directed ties are straightforward to interpret but sometimes it gets complicated, depending on your research design.

The final column in our imaginary edgelist contains a value for the edge. It might be the number of years the pair have known each other, or some measure of the strength or quality of their relationship.

3.3.0.1 Creating your edgelist

A critical first step in constructing an edgelist is to ensure that all individuals in the dataset have unique identifiers. Note that names, even first and last names together, are frequently not unique identifiers, even in relatively small communities. Using some form of ID number or a code generated using a combination of individual characteristics (e.g., initials, location, last four digits of their telephone number) is safer. Implementing a good system for obtaining uniquely identifying information about both egos and alters at the beginning of data collection is important for avoiding serious entity resolution problems later on.

To code up an edgelist from fieldnotes or other unstructured datasets, each time you find a tie between two individuals, simply enter a row with their ID numbers into the edgelist, in the correct order. If the same two individuals are observed to interact multiple times (which may or may not be possible depending on your study design), the best procedure is to record an additional edge in your edgelist every time that interaction is observed, and to maintain an additional column with the date and/or time the interaction was observed, or some other identifying information. It is easy to aggregate or remove these duplicates in R at later stages.

Here’s a few examples of supportive interactions between Hogwarts students, and how I would code them up as support ties:

“Oh, yes,” said Luna [to Harry], “I’ve been able to see them since my first day here. They’ve always pulled the carriages. Don’t worry. You’re just as sane as I am.” Chapter 10, p. 180
“Harry, you’re the best in the year at Defence against the Dark Arts,” said Hermione. Chapter 15, p. 292
“That was quite good,” Harry lied, but when she [Cho] raised her eyebrows he said, “Well, no, it was lousy, but I know you can do it propoerly, I was watching from over there.” Chapter 18, p. 350

From <- c("Luna Lovegood", "Hermione Granger", "Harry James Potter") 
To <- c("Harry James Potter", "Harry James Potter", "Cho Chang")
Chapter <- c(10, 15, 18)
Page <- c(180, 292, 350)
supegs <- data.frame(From, To, Chapter, Page)
knitr::kable(supegs)

From	To	Chapter	Page
Luna Lovegood	Harry James Potter	10	180
Hermione Granger	Harry James Potter	15	292
Harry James Potter	Cho Chang	18	350

The columns “Chapter” and “Page” might not be necessary for any analysis we want to do, but they help us find the original data again in the future, in case we needed to double-check something, and they also allow us to distinguish duplicate ties between characters. If we were coding such data for a real project, we would probably also record some details about the specific nature and nature of the interaction.

A similar procedure could be used to code up networks from field notes, although, obviously, you will have to make careful decisions about how to classify interactions, and about how you are sampling interactions among individuals. Such decisions should usually be made a priori (ideally with the benefit of a pilot study). If you are encoding relations that you view as constituting different networks, the easiest solution is usually to maintain separate edgelists for each type of relation.

As a side note here, we cannot stress enough how important your sampling of interactions in a network is to the conclusions that you can draw from the data. For example, although the Harry Potter data we’re examining here is a complete sample of student support interactions in the books, if this were a real ethnographic study of support ties among students in a school, the data would have a severe sampling bias. This is because this data set only contains interactions among individuals who occur in the presence of Harry Potter. It’s essentially as if we did a very long focal follow on Harry. We don’t know what the students might say to each other when Harry isn’t around; so we can almost be certain that this network will show that Gryffindors are more supportive of each other than Slytherins are. Slytherins might actually be very supportive of each other, but because we’ve only sampled what Harry experiences, the sample is biased towards interactions between Gryffindors and of course, interactions with Harry himself. Although this may seem like a silly example, this is actually an important point to remember for any fieldworker planning to construct networks from observational data. How do your research participants (human or otherwise) interact when you’re not around? Like Harry Potter, you are the center of your own universe and the interactions you observe may not be a representative sample. You need to carefully design your study and sampling strategy to address these potential biases (see Altmann (1974) for a great discussion of these issues).

Finally, if your edgelist is based on a name generator or other survey-based network data collection method, it should be straightforward to generate an edgelist—the main difficulty with these methods is usually entity resolution.

3.3.1 Importing an edgelist

Usually, you will save your edgelist as a tab- or comma-delimited file (.txt or .csv) and then import it to R.

hp5edges <- read.table("data/hp5edgelist.txt", sep="\t", header=TRUE)

#take a look
head(hp5edges)

         X.From.              X.To.
1 Alicia Spinnet     Alicia Spinnet
2 Alicia Spinnet   Angelina Johnson
3 Alicia Spinnet       Fred Weasley
4 Alicia Spinnet     George Weasley
5 Alicia Spinnet Harry James Potter
6 Alicia Spinnet         Katie Bell

#convert to an igraph network
hp5edgesmat <- as.matrix(hp5edges) #igraph wants our data in matrix format
hp5net <- graph_from_edgelist(hp5edgesmat, directed=TRUE)

#extract first names from list of names to make nicer labels. 
##FYI this is insane R notation to extract elements from a list, fear not if you don't get it
firsts <- unlist(lapply(strsplit(V(hp5net)$name,  " "), '[[', 1))

#let's take a look
plot(hp5net, vertex.shape="none", vertex.label.cex=0.6, edge.arrow.size=0.4,
     vertex.label=firsts, layout=layout.kamada.kawai)

Nice. Harry is in the center of the network, as we might expect. But there are lots of self-loops that make the graph hard to read. It’s great that Hogwarts students seem to support themselves, but we are more interested in when they support others.

Removing loops

The function simplify() in igraph handily removes self-loops from a network.

hp5netsimple <- simplify(hp5net)
plot(hp5netsimple, vertex.shape="none",vertex.label.cex=0.7, edge.arrow.size=0.4,
     vertex.label=firsts, layout=layout.kamada.kawai)

Adding isolates not in the edgelist

A flaw in the edgelist format is that nodes that are isolates are not listed in the edgelist. So, if the network contains isolates, these must be added afterwards. There are several characters who appear in Book 5 who do not have any ties in the network:

book5isolates <- c("Lavender Brown", "Millicent Bulstrode", "Michael Corner", 
                   "Roger Davies", "Theodore Nott", "Zacharias Smith")

hp5netfull <- add_vertices(hp5netsimple, nv=length(book5isolates), attr=list(name=book5isolates))
#note that here we add as many vertices as in our list of isolates, and assign them the attribute "name" which is stored in our list of isolates

#regenerate our list of first names to update it to include these new characters
firsts <- unlist(lapply(strsplit(V(hp5netfull)$name,  " "), '[[', 1))

plot(hp5netfull, vertex.shape="none", vertex.label.cex=0.6, edge.arrow.size=0.4, vertex.label=firsts, layout=layout.kamada.kawai)

Duplicate ties

A common problem is that you might have an edgelist where the same edges are present more than once. For instance, you might have observed two monkeys groom each other on multiple occasions, and recorded each occurrence separately; or, if you interviewed both the ego and alter, the same tie might appear more than once (e.g., Ron said he was friends with Harry, and Harry reported the same thing).

If you just want to remove duplicate edges, you can use the same simplify() functions used above to remove self-loops. But, you might prefer to count the number of times an edge appeared and use it as an edge value.

Let’s add some duplicates edges to our Book 5 edgelist, then use the aggregate() functions to count occurrences.

hp5newedges <- data.frame(rbind(hp5edges, c("Harry James Potter", "Ronald Weasley"), 
                                c("Ronald Weasley", "Harry James Potter"), 
                                c("Hermione Granger", "Ronald Weasley"), 
                                c("Hermione Granger", "Harry James Potter"), 
                                c("Hermione Granger", "Harry James Potter")))

#this takes a list of ones and takes the sum of them for each unique row in the edgelist
hp5edgeweights <- aggregate(list(count=rep(1,nrow(hp5newedges))), hp5newedges, FUN=sum)

#take a look at the edges that appear more than once
hp5edgeweights[which(hp5edgeweights$count>1),]

               X.From.              X.To. count
81    Hermione Granger Harry James Potter     3
88      Ronald Weasley Harry James Potter     2
139 Harry James Potter     Ronald Weasley     2
140   Hermione Granger     Ronald Weasley     2

#create a network and assign the counts as an edge weight
hp5netweight <- graph_from_edgelist(as.matrix(hp5edgeweights[,1:2]), directed=TRUE)
E(hp5netweight)$weight <- hp5edgeweights$count

Then you can proceed to remove self-loops and add isolates, as before; and use the edge weight attribute in plotting and analysis.

3.3.1.1 Loading up individual attributes

Normally, in addition to your edgelist (and possibly, edge values) you will have a table of attributes for the nodes in your network; variables like age and gender, for example. Loading these data is just like importing any other data in R:

attributes <- read.csv("data/hpindividuals.csv", header=TRUE)
head(attributes)

  id                   name schoolyear gender house
1 12         Demelza Robins       1993      2     1
2  3       Angelina Johnson       1989      2     1
3 34            Lucian Bole       1988      1     4
4 13         Dennis Creevey       1994      1     1
5 56         Ronald Weasley       1991      1     1
6 28 Justin Finch-Fletchley       1991      1     2

This table appears to contain individuals who are not in the Book 5 network. Let’s use the %in% operator to extract just the individuals in the Book 5 network from our attributes dataframe.

book5students <- attributes[attributes$name %in% V(hp5netfull)$name,]

#then reorder the attributes to match the order that the individuals appear as network vertices
book5students <- book5students[match(V(hp5netfull)$name, book5students$name), ]

#now we can simply assign the dataframe columns as vertex attributes, e.g., 
V(hp5netfull)$house <- book5students$house

#but let's use a factor to get nice colours
housecolours <- c("firebrick", "darkgoldenrod", "darkslateblue", "darkgreen")
V(hp5netfull)$housecolours <-as.character(factor(book5students$house, levels=1:4, labels=housecolours))

plot(hp5netfull, vertex.shape="none", vertex.label.color=V(hp5netfull)$housecolours, vertex.label.cex=0.6, edge.arrow.size=0.3, vertex.label=firsts, layout=layout.kamada.kawai)

3.3.2 Sociomatrices

Network data may also be formatted as a sociomatrix, which is a square matrix where the rows and columns represent individuals in the network and the cell values represent the ties between them (valued or 0/1 for unvalued). In general, it is impractical to record data in this format. In most cases, trying to record data in a sociomatric means you will be constantly scrambling to add more columns and rows that will be mostly filled with zeroes. Much better to maintain a list of individuals and edges as they appear.

Sociomatrices are more useful when there is a non-zero value in most cells (i.e., the data are not sparse). Such is the case when the relationship between nodes involves some kind of distance metric (e.g., physical distance between households or relatedness coefficients). Distance matrices like these are very useful to include as edge covariates in an ERGM, for example, but you will not usually generate such matrices by hand. Instead, you will likely generate these matrices in R or some other software program using calculations based off other data (e.g., latitude and longitude coordinates).

Here’s how to import a sociomatrix (and get it correctly lined up with your attribute data).

#read in data
hp5df <- read.table("data/hp5matrix.txt", sep="\t", header=FALSE) #note that this file has no row or column names

#inspect it
head(hp5df)

                                                                                                                               V1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

dim(hp5df)

[1] 64  1

Uh-oh, R thinks our data frame has only one column. That’s not right. Taking a look at the data in a spreadsheet viewer suggests maybe this dataset is formatted differently. Let’s try a different separator when we read in the data.

#read in data
hp5df <- read.table("data/hp5matrix.txt", sep=" ", header=FALSE) 

#inspect
dim(hp5df) #much better

[1] 64 64

Now the data are loaded correctly but we still have a problem. This sociomatrix has dimensions of 64x64, meaning there are 64 students in the network, but from before, we know that only 34 students actually appear in Book 5. This discrepancy is because Bossart and Meidert included all students who appeared in support relationships in ANY of the first six books so that they could do a temporal analysis. We’d prefer to just look at the students who appear in Book 5, so we need to remove these other students.

#start by adding row and column names to our sociomatrix
#we know from context that the students in this sociomatrix are ordered by ID number
studentsIDs <- sort(attributes$id) #get an ordered list of students IDS
colnames(hp5df) <- rownames(hp5df) <- studentsIDs #assign them as row/col names of the sociomatrix

#extract out the students based on our list of individuals from Book 5
hp5only <- hp5df[book5students$id, book5students$id]
#note that this operation conveniently put the students in the same order as our attributes table from before

#now we can make our network
hp5mat <- as.matrix(hp5only)
hp5net2 <- graph_from_adjacency_matrix(hp5mat)

#get first names again
firsts2 <- unlist(lapply(strsplit(as.character(book5students$name),  " "), '[[', 1))

plot(hp5net2, vertex.shape="none", vertex.label.cex=0.6, edge.arrow.size=0.3,
     vertex.label=firsts2, layout=layout.kamada.kawai)

Now we can proceed as before to remove self-loops and assign attributes.

4 From affiliation data to a network (bipartite networks)

Sometimes networks may be generated from data on the presence or absence of individuals at certain place or time. Sociomatrices are square. If there are \(n\) individuals in the sample, the resulting sociomatrix will be of rank \(n \times n\). When you have matrix where you observe \(n\) people at \(k\) events or associated with \(k\) institutions, the resulting matrix representing these relationships will be \(n \times k\), which will not necessarily be square. This matrix is known as an incidence or affiliation matrix. To convert an affiliation (person-by-event) matrix to a person-to-person network, simply multiply the affiliation matrix by its transpose.

Before doing this though, carefully consider what an affiliation matrix means in terms of relationships between persons. Not all group, place, or event affiliations correspond to relationships between persons that are meaningful with respect to your research question.

4.1 Dual Networks

davismat <- as.matrix(read.table("data/davismat.txt",header=TRUE))
southern <- graph_from_incidence_matrix(davismat)

Warning: `graph_from_incidence_matrix()` was deprecated in igraph 1.6.0.
ℹ Please use `graph_from_biadjacency_matrix()` instead.

southern

IGRAPH 8163e36 UN-B 32 89 -- 
+ attr: type (v/l), name (v/c)
+ edges from 8163e36 (vertex names):
 [1] EVELYN   --E1 EVELYN   --E2 EVELYN   --E3 EVELYN   --E4 EVELYN   --E5
 [6] EVELYN   --E6 EVELYN   --E8 EVELYN   --E9 LAURA    --E1 LAURA    --E2
[11] LAURA    --E3 LAURA    --E5 LAURA    --E6 LAURA    --E7 LAURA    --E8
[16] THERESA  --E2 THERESA  --E3 THERESA  --E4 THERESA  --E5 THERESA  --E6
[21] THERESA  --E7 THERESA  --E8 THERESA  --E9 BRENDA   --E1 BRENDA   --E3
[26] BRENDA   --E4 BRENDA   --E5 BRENDA   --E6 BRENDA   --E7 BRENDA   --E8
[31] CHARLOTTE--E3 CHARLOTTE--E4 CHARLOTTE--E5 CHARLOTTE--E7 FRANCES  --E3
[36] FRANCES  --E5 FRANCES  --E6 FRANCES  --E8 ELEANOR  --E5 ELEANOR  --E6
+ ... omitted several edges

V(southern)$type

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

V(southern)$shape <- c(rep("circle",18), rep("square",14))   
V(southern)$color <- c(rep("blue",18), rep("red", 14))
plot(southern, layout=layout.bipartite)

For perhaps obvious reasons, one typically doesn’t see a lot of bipartite graph plots. It is difficult to see the relational dependencies that are present in such a plot and we are often interested not in the bipartite structure itself but in the relationships that are induced by such a structure. For instance, we may be interested in the political or social opportunities afforded two actors by attending a common social event. By two actors, \(i\) and \(j\) attending the same party (i.e., having mutual ties to \(k\)), we can imagine them having a social relationship which is induced by the shared event. Breiger (1974) wrote about this in his classic paper on the duality of people and groups. Breiger notes, groups and individuals entail a type of duality:

think of each tie between two groups as a set of persons who form the ‘intersection’ of the groups’ memberships. In the dual case, think of each membership tie between two persons as the set of groups in the ‘intersection’ of their individual affiliations.” (Breiger 1974: 182).

Breiger, in fact, analyzed the Davis Southern Women data set. Given an \(n \times k\) incidence matrix \(\mathbf{X}\), where there are \(n\) actors and \(k\) “events” with which actors are associated (\(x_{ij}=1\) means that actor \(i\) participated in event \(j\)), we can perform a projection to acquire the induced sociomatrix \(\mathbf{A}\) in which a tie indicates that two actors share a common event.

\[ \mathbf{A} = \mathbf{X}\, \mathbf{X}^T. \]

That is, the sociomatrix is simply the product of the bipartite graph \(\mathbf{X}\) and its transpose. Remember that the rules of matrix multiplication show that the rank of the product of a conformable matrix multiplication is the outer dimensions of the matrices being multiplied. In this case, we are multiplying a \(n \times k\) and its transpose which has dimensions \(k \times n\). Thus, the resulting matrix is \(n \times n\). It is a square matrix associating actors according to a specific social relation (i.e., sharing an event). In other words, it is a sociomatrix.

In a parallel fashion, we can construct a matrix which associates events with each other – i.e., two events share an edge if they share at least one actor in common.

\[ \mathbf{E} = \mathbf{X}^T\, \mathbf{X}. \]

So, to get the \(18 \times 18\) matrix of women co-attendance, do some matrix multiplication matrix multiplication by %*% (inner product). To calculate the inner product, the inner dimensions must match and the resulting matrix has outer dimensions (in this case) of \(18 \times 14 \cdot 14 \times 18 \Rightarrow 18 \times 18\) (i.e., person-by-person).

#Sociomatrix
(f2f <- davismat %*% t(davismat))

          EVELYN LAURA THERESA BRENDA CHARLOTTE FRANCES ELEANOR PEARL RUTH
EVELYN         8     6       7      6         3       4       3     3    3
LAURA          6     7       6      6         3       4       4     2    3
THERESA        7     6       8      6         4       4       4     3    4
BRENDA         6     6       6      7         4       4       4     2    3
CHARLOTTE      3     3       4      4         4       2       2     0    2
FRANCES        4     4       4      4         2       4       3     2    2
ELEANOR        3     4       4      4         2       3       4     2    3
PEARL          3     2       3      2         0       2       2     3    2
RUTH           3     3       4      3         2       2       3     2    4
VERNE          2     2       3      2         1       1       2     2    3
MYRNA          2     1       2      1         0       1       1     2    2
KATHERINE      2     1       2      1         0       1       1     2    2
SYLVIA         2     2       3      2         1       1       2     2    3
NORA           2     2       3      2         1       1       2     2    2
HELEN          1     2       2      2         1       1       2     1    2
DOROTHY        2     1       2      1         0       1       1     2    2
OLIVIA         1     0       1      0         0       0       0     1    1
FLORA          1     0       1      0         0       0       0     1    1
          VERNE MYRNA KATHERINE SYLVIA NORA HELEN DOROTHY OLIVIA FLORA
EVELYN        2     2         2      2    2     1       2      1     1
LAURA         2     1         1      2    2     2       1      0     0
THERESA       3     2         2      3    3     2       2      1     1
BRENDA        2     1         1      2    2     2       1      0     0
CHARLOTTE     1     0         0      1    1     1       0      0     0
FRANCES       1     1         1      1    1     1       1      0     0
ELEANOR       2     1         1      2    2     2       1      0     0
PEARL         2     2         2      2    2     1       2      1     1
RUTH          3     2         2      3    2     2       2      1     1
VERNE         4     3         3      4    3     3       2      1     1
MYRNA         3     4         4      4    3     3       2      1     1
KATHERINE     3     4         6      6    5     3       2      1     1
SYLVIA        4     4         6      7    6     4       2      1     1
NORA          3     3         5      6    8     4       1      2     2
HELEN         3     3         3      4    4     5       1      1     1
DOROTHY       2     2         2      2    1     1       2      1     1
OLIVIA        1     1         1      1    2     1       1      2     2
FLORA         1     1         1      1    2     1       1      2     2

gf2f <- graph_from_adjacency_matrix(f2f, mode="undirected", diag=FALSE)
gf2f <- simplify(gf2f)
plot(gf2f, vertex.color="skyblue2")

The elements of the matrix count the number of mutual events the row and column individual share. The diagonal elements of the matrix count the number of events attended by that individual. Who is the most central in this dense graph?

cb <- igraph::betweenness(gf2f)
# simple function to rescale values in range [a b] to [c d]
# this is very useful for scaling vertex size for visualization
rescale <- function(x,a,b,c,d) c + (x-a)/(b-a)*(d-c)
plot(gf2f,vertex.size=rescale(cb,0, 1.38, 5, 20), vertex.color="skyblue2")

Turns out that there isn’t actually that much variation in centrality. Now events:

### this gives you the number of women at each event (diagonal) or mutually at 2 events
(e2e <- t(davismat) %*% davismat)

    E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14
E1   3  2  3  2  3  3  2  3  1   0   0   0   0   0
E2   2  3  3  2  3  3  2  3  2   0   0   0   0   0
E3   3  3  6  4  6  5  4  5  2   0   0   0   0   0
E4   2  2  4  4  4  3  3  3  2   0   0   0   0   0
E5   3  3  6  4  8  6  6  7  3   0   0   0   0   0
E6   3  3  5  3  6  8  5  7  4   1   1   1   1   1
E7   2  2  4  3  6  5 10  8  5   3   2   4   2   2
E8   3  3  5  3  7  7  8 14  9   4   1   5   2   2
E9   1  2  2  2  3  4  5  9 12   4   3   5   3   3
E10  0  0  0  0  0  1  3  4  4   5   2   5   3   3
E11  0  0  0  0  0  1  2  1  3   2   4   2   1   1
E12  0  0  0  0  0  1  4  5  5   5   2   6   3   3
E13  0  0  0  0  0  1  2  2  3   3   1   3   3   3
E14  0  0  0  0  0  1  2  2  3   3   1   3   3   3

ge2e <- graph_from_adjacency_matrix(e2e, mode="undirected", diag=FALSE)
ge2e <- simplify(ge2e)
plot(ge2e, vertex.color="skyblue2")

4.1.1 Visualizing Incidence Matrices

The “Spy Matrix” is a handy way to visualize the overall structure of a matrix. These visualizations are particularly useful for incidence matrices for bipartite graphs. Define a function spyR():

spyR <- function(X, pixcol=c("white","black"), xl="",yl="",
                 draw.labs=FALSE,labs=NA,box=TRUE,las=2){
        s <- dim(X)
        x <- 1:s[1]
        y <- 1:s[2]
        yr <- rev(y)
        Xr <- t(X)[,yr]
        image(x,y,Xr,col=pixcol,axes=FALSE,xlab=xl,ylab=yl,asp=NA)
        if(draw.labs){
                axis(1,at=labs,labels=labs,tick=FALSE)
                axis(3,at=labs,labels=rev(labs),tick=FALSE)
        }
        if(box) box()
}

First visualize the nested incidence matrix:

nest <- matrix( c(1,1,1,1,1,
                  1,1,1,1,0,
                  1,1,1,0,0,
                  1,0,0,0,0,
                  1,0,0,0,0), nr=5, nc=5, byrow=TRUE)
dimnames(nest)[[1]] <- c("1","2","3","4","5")
dimnames(nest)[[2]] <- c("A","B","C","D","E")

spyR(nest, pixcol=c("white","blue4"))
axis(3,at=1:5,labels=LETTERS[1:5],tick=FALSE)
axis(2,at=1:5,labels=5:1,tick=FALSE, las=2)

Its associated bipartite graph:

gnest <- graph_from_incidence_matrix(nest,mode="all")
V(gnest)$vertex.color <- c(rep("cyan",5), rep("magenta",5))
plot(gnest, layout=layout_as_bipartite(gnest), 
     vertex.color=V(gnest)$vertex.color, vertex.size=20)

Now look at the projected graph. An \(n\)-star in a bipartite graph (e.g., species 1) becomes a complete graph in the projection.

nums <- nest %*% t(nest)

gnums <- graph_from_adjacency_matrix(nums, mode='undirected', diag=FALSE)
gnums <- simplify(gnums)
plot(gnums, vertex.color="cyan", vertex.size=20)

4.1.2 Visualizing Modular Networks

Completely modular network. Species A-E interact (e.g., prey upon) species 6-10 and so on. Modularity of antagonistic networks dissipates trophic cascades, which can be highly destabilizing. As above, this network has been permuted to show the structure.

mod  <- matrix( c(0,0,0,0,0,0,0,1,1,1,
                  0,0,0,0,0,0,0,1,1,1,
                  0,0,0,0,0,0,0,1,1,1,
                  0,0,0,0,0,1,1,0,0,0,
                  0,0,0,0,0,1,1,0,0,0,
                  1,1,1,1,1,0,0,0,0,0,
                  1,1,1,1,1,0,0,0,0,0,
                  1,1,1,1,1,0,0,0,0,0,
                  1,1,1,1,1,0,0,0,0,0,
                  1,1,1,1,1,0,0,0,0,0), nr=10, nc=10, byrow=TRUE)

spyR(mod, pixcol=c("white","blue4"))
axis(3,at=1:10,labels=LETTERS[1:10],tick=FALSE)
axis(2,at=1:10,labels=10:1,tick=FALSE, las=2)

Now look at the associated bipartite graph:

gmod <- graph_from_biadjacency_matrix(mod,mode="all")
V(gmod)$vertex.color <- c(rep("cyan",10), rep("magenta",10))
plot(gmod, layout=layout_as_bipartite(gmod), 
     vertex.color=V(gmod)$vertex.color, vertex.size=20)

Now look at the projected graph. Disconnected components in the bipartite graph become disconnected components in in the projection, not surprisingly.

nums1 <- mod %*% t(mod)

gnums1 <- graph_from_adjacency_matrix(nums1, mode='undirected', diag=FALSE)
gnums1 <- simplify(gnums1)
plot(gnums1, vertex.color="cyan", vertex.size=20)

5 Reading a Network from a Data Frame

igraph has the capacity to construct a graph object from one or two data frames. It is quite convenient to use the option with two data frames, where the first is simply the edgelist for the network and the second holds all the associated vertex-covariate data. The two data frames are linked by the symbolic vertex names. This function is similar in principle to the base R function merge for joining data frames.

Read in the Colorado Springs high-risk network (e.g., Klovdahl et al. 1994) using graph_from_data_frame().

csel <- read.table("data/colorado_springs/edges.tsv",header=TRUE)
csppl <- read.table("data/colorado_springs/nodes.tsv", header=TRUE)
cs <- graph_from_data_frame(csel,vertices=csppl,directed=FALSE)
cs <- simplify(cs)
## extract just the giant component to plot
cs1 <- induced_subgraph(cs, subcomponent(cs,1))
plot(cs1, vertex.color="skyblue2", vertex.size=2, vertex.label=NA,
     layout=layout_with_fr(cs1))

That’s a whole lotta risk. The network is large enough and the density high enough that it is quite difficult to make out the structure of the network from such a plot. Outputting the graphic to pdf format and large size helps tremendously.

There’s a good chance thay your matrix-guy uses Matlab, which, for a variety of reasons, is not a good language for scientific computing.↩︎