Random Graphs for Data Science

# Random Graphs for Data Science
### Phil Chodrow (MIT → UCLA)

---

---

# The 30-Second Talk

1. Interpreting measurements in network science often requires .embolden[null random graph modeling].
2. Null modeling is a bit of a science and a bit of an art. 
3. Doing it .embolden[rigorously] raises some interesting math problems. 
4. Careful solutions to these problems can alter our perspectives on existing questions and pose interesting new ones. Two extended examples: 
  - Random hypergraphs.
  - Moments of uniform random graphs.

---

# A Universal Workflow for Complex Networks

1) Obtain some network data.

2) Measure a thing. 
- Degree distribution
- Triadic closure
- Max modularity

3) .embolden[Explain] the measurement.

3.5) Call the network "complex."

4) Publish!1!!

---

.large[Measurements need explaining when they indicate structure over and above our **expectations.**]

---

# A Familiar Example

---

# Another Example

`$Q_{\text{max}} \approx 0.25$` in an ER random graph.

`$\rightarrow$` High modularity isn't always even *prima facie* evidence of community structure. 
]]
.pull-right[
.centered-image[
![:scale 100%](img/ER_modularity.png)
]]

How to tell if a modularity score is "real" or "interesting"?

<!-- ---

background-color: #000000
background-image: url(img/context.jpg)
background-size: contain
 -->
---

# Random Graph Nulls

To see if your result needs explaining, compare it to a random graph - **a probability distribution over graphs.**

This probability distribution should be "boring" but not "too boring."

- ER: "too boring." 
- Configuration models: often "just boring enough."
- Lots of other options. 
- Choice should be guided by domain knowledge.

---

How do we **choose** "good" random graph benchmarks for a given data set?

How can we practically do **data science** with these models?]

---

# "Good" Random Graph Models

**Relevance**: The support of the model (set of possible realizations) should be close to the support of the hypothesized generating process.

We often need to balance these against each other.

---

# Today

.embolden[Configuration Models of Random Hypergraphs.] 
- Mostly about **relevance** in the context of polyadic data.

.embolden[Moments of Uniformly Random Multigraphs with Fixed Degree Sequences] 
- Mostly about .highlight[tractability] in the context of dyadic data.

---

# .midi[.large[Configuration Models of Random Hypergraphs]]

---

# Polyadic Networks

Lots of systems we care about contain interactions between embolden[groups] of agents:

- **Collaboration / coauthorship**: Congressional cosponsorship, the network of scholars studying the network of...
- **Co-presence**: drug combinations, patent classifications,
- **Communication**: emails, tweets, etc.

---

# The Dyadic Projection

*"Polyadic structures are hard, but I know about graphs..."*

---

# Dyadic Configuration Models (Graphs)

Observe a graph with degree sequence `$\mathbf{d} \in \mathbb{Z}^n$`.

The (vertex-labeled1) *configuration model* is the uniform probability distribution `$\eta_\mathbf{d}$` over the space 
`$$\mathcal{G}_\mathbf{d} = \{G:\mathrm{deg}(G) = \mathbf{d}\}\;.$$`

.embolden[Common pattern]: project, use `$\mathbf{d}$` from projected graph, randomize with configuration model.

.footnote[1Fosdick et al., "Configuring Random Graph Models with Fixed Degree Sequences", *SIAM Review* (2018)]

---

# Problems with the Dyadic Projection

![](img/projection.png)

Projection distorts meaningful quantities: number of edges, degree.

Fails the **relevance** criterion: never generates multiway interactions. (A world of only 2-author papers)

---

# Representing Polyadic Networks

- .embolden[Bipartite graphs]
  - .midi[Flexible.]
  - .midi[Some probability distributions are hard to formulate.]
- .embolden[Simplicial complexes]
  - .midi[Good data compression and topological properties.]  
  - .midi[Face inclusion is appropriate for some (not all) data generating processes.]  
- .embolden[Hypergraphs]
  - .midi[Flexible and intuitive.]
  - .midi[Few existing null models...]

---

# Hypergraphs

.pull-left[
- `$H = (V,E)$`
- Edges can have *any number* of nodes. 
- Parallel edges are ok `$(\{a,b,c\}, \{a,b,c\})$`
- Degenerate edges are not `$(\{a,a,b,\ldots\})$`
]

---

# Configuration Models (Hypergraphs)

Observe a hypergraph with degree sequence `$\mathbf{d} \in \mathbb{Z}^n$` and **dimension sequence** `$\mathbf{k} \in \mathbb{Z}^m$`.

The (vertex-labeled) *configuration model* is the uniform probability distribution `$\eta_{\mathbf{d}, \mathbf{k}}$` over the space 
`$$\mathcal{H}_{\mathbf{d}, \mathbf{k}} = \{H:\mathrm{deg}(H) = \mathbf{d}\;,\;\color{#0F4C81}{\mathrm{dim}(H) = \mathbf{k}}\}\;.$$`

---

# MCMC Sampling

**Idea**: sample from `$\eta_{\mathbf{d}, \mathbf{k}}$` by crawling through `$\mathcal{H}_{\mathbf{d}, \mathbf{k}}$`.

1. .highlight[Sample]: Pick two edges `$\Delta$` and `$\Gamma$` u.a.r.
2. .highlight[Proposal]: Randomly resample nodes between `$\Delta$` and `$\Gamma$` while avoiding degeneracy
3. .highlight[Transition]: With probability `$m_{\Delta}^{-1}m_{\Gamma}^{-1}$`, accept swap, otherwise start over.

**Theorem**: This Markov chain is ergodic and its invariant measure is the configuration model `$\eta_{\mathbf{d}, \mathbf{k}}$`.  To sample, run this chain a "for long time."

---

---

# Triadic Closure

`$$\bar{C} = \frac{1}{n}\sum_{v \in \mathcal{V}} \frac{T_v}{W_v}$$`

- `$T_v$`: number of triangles incident to `$v$`. 
- `$W_v$`: number of wedges at `$v$`.

Conventional wisdom since Watts-Strogatz (1998): many empirical networks are more clustered than can be explained by (dyadic) configuration models. 
 
What about hypergraph configuration models?

---

# Triadic Closure

| | `$C$` | `$<C>_h$` | `$<C>_{2}$` |
| ------------------------------ | :------: | :-----: | :-------: |
| **congress-bills** | 0.61 | 0.60 | 0.45 | 
| **coauth-Geology** | 0.82 | 0.82 | 0.00 | 
| .highlight[email-Enron] | 0.66 | 0.84 | 0.64 | 
| .highlight[email-Eu] | 0.54 | 0.57 | 0.40 | 
| .highlight[tags-ask-ubuntu] | 0.57 | 0.61 | 0.18 | 
| .highlight[threads-math-sx] | 0.29 | 0.44 | 0.09 |

.footnote[Data from Benson et al. "Simplicial Closure and Higher-Order Link Prediction," *PNAS* (2018)]

---

## Triadic Closure

.midi[
| | `$C$` | `$<C>_h$` | `$<C>_{2}$` |
| ------------------------------ | :------: | :-----: | :-------: |
| **congress-bills** | 0.61 | 0.60 | 0.45 | 
| **coauth-Geology** | 0.82 | 0.82 | 0.00 | 
| .highlight[email-Enron] | 0.66 | 0.84 | 0.64 | 
| .highlight[email-Eu] | 0.54 | 0.57 | 0.40 | 
| .highlight[tags-ask-ubuntu] | 0.57 | 0.61 | 0.18 | 
| .highlight[threads-math-sx] | 0.29 | 0.44 | 0.09 | 
]

Hypergraph configuration model distinguishes **collaboration** networks from .highlight[communication] networks.

---

# The Intersection Profile

How "similar" are two interactions?

Intersection profile:

`$$r(j) = \langle \mathbb{I}(\Delta \cap \Gamma) = j\rangle$$`

"Among all pairs of edges, what proportion have intersection of size `$j$`?"

Measures repeated interactions:  productive **teams** of researchers, chatty **groups** on email, etc.

---

# Enron Intersection Profile

![](img/intersection_profile_single.png)

---

# Approximate Intersection Profiles

Computing null intersection profiles can be very difficult for large data sets. 
Fortunately, we can approximate:

**Theorem:*** With high probability as `$n$` grows large,

.midi[$$r_\eta(j) = (1 + O(n^{-1}))j!\left \langle
\binom{\vert \Delta\vert}{j}\right\rangle^2 \left(\frac{1}{n} \frac{\mathbb{E}[D^2] - \mathbb{E}[D]}{\mathbb{E}[D]^2}\right)^j$$]

*Terms and conditions apply.

---

# Profiles for Big Data™

---

# Summing Up

When the data generating process generates hypergraph-like data, hypergraph configuration models can be a good choice.

Using these models, we can

1. Ask old questions in new ways (triadic closure).
2. As new questions (intersection profiles).

---

# Moments of Uniformly Random Multigraphs with Fixed Degree Sequences

---

# .midi[Our Good Friend, Modularity...]

`$$Q(\mathbf{G}) = \frac{1}{2m}\sum_{ij}\left[W_{ij} - \color{#C49377}{\frac{d_id_j}{2m}}\right]\delta(g_i, g_j)$$`

.highlight[Second term]: expected number of edges between nodes `$i$` and `$j$`, fixing the degree sequence `$d$`.

Maximize `$Q(\mathbf{G})$` to find communities in networks (terms and conditions apply).

---

# A Folk Theorem

.alert[
*The expected number of edges `$w_{ij}$` between two nodes in **"random"** graph **"with degree sequence `$\mathbf{d}$`"** is *

$$ \omega_{ij} = \frac{d_id_j}{2m} + \text{small terms}\;.$$
]

This is right if you choose your random graph model .highlight[conveniently].

- Chung-Lu model (**d** fixed in expectation)
- Stub-matching (biased, not entropy-maximizing)

---

# Which random graph?...

We .highlight[say]: "(entropy-maximizing) random graph."

We .highlight[mean]: "random graph that makes calculations convenient."

.footnote[Fosdick et al., "Configuring Random Graph Models with Fixed Degree Sequences", *SIAM Review* (2018)]

What if we *actually* used the entropy-maximizing random graph for calculations?

---
class: middle

# .large[The Uniform Model]

---

# Which Random Graph?

Let `$\mathcal{G}_\mathbf{d}$` be the set of all .embolden[multigraphs] with fixed degree sequence `$\mathbf{d}$`.

The **uniform random graph model** `$\eta_\mathbf{d}$` is a probability distribution that assigns equal weight to each element of `$\mathcal{G}_\mathbf{d}$`.

`$\eta_\mathbf{d}$` is the entropy-maximizing random graph with fixed degree sequence `$\mathbf{d}$`.

---

# Edge-Swap Monte Carlo

![](img/swaps.png)

We can sample from `$\eta_\mathbf{d}$` by crawling through `$\mathcal{G}_\mathbf{d}$`.

(This is just the dyadic version of the Markov chain we saw for hypergraphs)

.footnote[Image from Fosdick et al., "Configuring Random Graph Models with Fixed Degree Sequences", *SIAM Review* (2018)]

---

# Edge-Swap Monte Carlo

![](img/swaps.png)

1. .embolden[Sample edges] `$(i,j)$` and `$(k,\ell)$` with random orientations. 
2. .embolden[Propose swap] `$(i,j), (k,\ell) \mapsto (i,\ell), (k,j)$`.  
3. .embolden[Accept swap] with probability `$w_{ij}^{-1}w_{k\ell}^{-1}$`.

**Theorem:** This works.1

.footnote[1Fosdick et al., "Configuring Random Graph Models with Fixed Degree Sequences", *SIAM Review* (2018)]

---

Estimating `$\omega_{ij} = \mathbb{E}[W_{ij}]$` via Edge-Swap Monte Carlo is **expensive** (~3 days on my laptop).

How can we estimate efficiently expectations under `$\eta$`?

---

# Approach

Edge-Swap Monte Carlo is a stochastic process with known equilibrium distribution `$\eta_\mathbf{d}$`.

At equilibrium, all **expectations** are fixed.

`$$\mathbb{E}_\eta[f(G_{t+1}) - f(G_t)] = 0\;.$$`

If we choose `$f(G) = w_{ij}^p$` for `$p = 1,\ldots$` and handle .embolden[a lot] of algebra, we can get analytical approximations.

.footnote[PSC, "Moments of Uniformly Random Multigraphs with Fixed Degree Sequences, `arxiv:1909.09037` (2020)]
---

# Expected Number of Neighbors

Let `$\beta_i = \mathbb{E}_\eta[\text{# of distinct neighbors of } i]$`.

.alert[
**Theorem**: `$$\chi_{ij}\triangleq \mathbb{E}_\eta[\mathbb{I}(W_{ij} > 0)] = \frac{\beta_i\beta_j}{\sum_\ell \beta_\ell} + \text{small terms}$$`
]

---

# A Parallel

.funding[.embolden[Folk Theorem]
`$$\omega_{ij} \triangleq \mathbb{E}_\eta[W_{ij}] \stackrel{?}{=} \frac{d_id_j}{\sum_\ell d_\ell} + \text{small terms}$$`
]

.alert[
**Actual Theorem**: `$$\chi_{ij}\triangleq \mathbb{E}_\eta[\mathbb{I}(W_{ij} > 0)] = \frac{\beta_i\beta_j}{\sum_\ell \beta_\ell} + \text{small terms}$$`
]

---

# Expected Edge Densities

.alert[
**Theorem**: `$$\omega_{ij}\triangleq \mathbb{E}_\eta[W_{ij}] = \frac{\chi_{ij}}{1 - \chi_{ij}} + \text{small terms}$$` 
]

Can also get higher-order moments: variances, etc.

---

# Determining `$\beta$`
.midi[
The vector `$\beta$` approximately solves

`$$\sum_{j}\frac{\beta_i\beta_j}{\sum_\ell\beta_\ell - \beta_i\beta_j} = d_i \quad \forall i\;.$$`

.alert[
**Theorem**: Solution (if it exists) is unique on the set 
`$$\mathcal{B} = \left\{\beta: \beta \geq \mathbf{0},\; \beta_i^2 < \sum_\ell \beta_\ell \quad \forall i\right\}\;.$$`
] 
]

# Proof Sketch
.midi[
Let `$h(\beta)_i = \sum_{j}\frac{\beta_i\beta_j}{\sum_\ell\beta_\ell - \beta_i\beta_j}$`. We want to solve `$\mathbf{h}(\beta) = \mathbf{d}$`. 
Equivalently, `$$\mathcal{L}(\beta) \triangleq \lVert \mathbf{h}(\beta) - \mathbf{d} \rVert^2 = \mathbf{0}$$`

1. The Jacobian `$\mathbf{J}$` of `$\mathbf{h}$` has strictly positive eigenvalues on `$\mathcal{B}$` (linear algebra tricks).
2. The Hessian `$\mathcal{H}$` of `$\mathcal{L}$` is positive-definite at all critical points of `$\mathcal{L}$` `$\implies$` all critical points are isolated local minima (more linear algebra tricks). 
3. .embolden[Mountain Pass Theorem]:1 If two or more, `$\exists$` a third which is not a local minimum. 
4. So, `$\exists$` at most one local minimizer of `$\mathcal{L}$` on `$\mathcal{B}$`. 
]

# .large[Experiments]

---

# High School Contact Data

Collected by the SocioPatterns project.

---

## Evaluating Approximations

---

## Estimating `$\mathbb{E}[W]$`

---

## Downstream Consequences: Modularity Maximization

---

# Summing Up

**Rigor** means 
- Carefully articulating our models 
- Checking that our heuristics work.

Entropy-maximizing random graphs with fixed degree sequences break our standard heuristics.

Using a dynamical approach, we can get **rigorous approximations** with **excellent performance** on real data.

---

# The 30-Second Talk

---

# Thanks! See you soon....?

.pull-left-narrow[.thanks[
.centered-image[![:scale 100%](img/bio-photo.jpg)]
Phil Chodrow
.small[
 MIT `$\rightarrow$` UCLA
 
 [philchodrow.com](https://www.philchodrow.com)

[@philchodrow](https://twitter.com/PhilChodrow)
]]]

Configuration Models `arXiv:1902.09302`

Moments of Multigraphs `arxiv:1909.09037`
]
.funding[
![:scale 120px](img/nsf_logo.png)
![:scale 220px](img/mit_logo.png)
]]

---

# Supplementary Slides

---

# A Note on Stub-Matching

Sometimes, "the configuration model" is defined via a stub-matching process.

This is related to the uniform model for .embolden[simple] graphs and hypergraphs.

**Very few data sets are natively simple.**

]]

![](img/stub-matching.png)
]

.footnote[1Fosdick et al., "Configuring Random Graph Models with Fixed Degree Sequences", *SIAM Review* (2018)]