5  Clustering and Community Detection

These notes are sparser than the previous ones, and build on the introduction of the modularity maximization problem in Newman 14.1 and 14.2.

5.1 Another Perspective on Modularity

Recall the defintion of the modularity \(Q\) of a graph \(G\) with label vector \(\mathbf{z}\):

Newman 7.54 or 14.1

\[ Q(G, \mathbf{z}) \triangleq \frac{1}{2m}\sum_{i,j \in N}\left[a_{ij} - \frac{k_ik_j}{2m}\right]\delta(z_i, z_j)\;. \qquad(5.1)\]

This expression highlights two things:

  • First, we are comparing the actual adjacency matrix \(\mathbf{A}\) of the graph to the expected adjacency matrix with entries \(k_ik_j/2m\) under the configuration model.
  • Second, we are performing this comparison only on the edges in which \(\delta(z_i,z_j) = 1\). These are the edges on which \(z_i = z_j\); i.e. the edge joins two nodes in the same community.

The modularity maximization heuristic now says that, in order to find an interesting clustering of our graph, we should find a label vector \(\mathbf{z}\) to make \(Q\) large.

The idea of comparing to a configuration model random graph is a pretty useful way to think about the modularity, but there are also others. Let’s take another point of view.

This derivation follows Newman eq. 7.55 through 7.58

Let \(Z\) be the set of possible group labels. For example, \(Z = 1,2,\ldots,k\) for some \(k\). For each label \(\ell \in Z\), define \[ e_\ell \triangleq \frac{1}{2m}\sum_{i,j\in N}a_{ij}\delta(z_i, \ell)\delta(z_j, \ell) \quad \text{and} \quad f_\ell \triangleq \frac{1}{2m}\sum_{i\in N} k_i \delta(z_i, \ell)\;. \]

We’re going to find copies of these expressions in \(Q\). The “trick” is to note that we can do fancy things with the \(\delta\)-function, like this: \[ \delta(z_i, z_j) = \sum_{\ell \in Z}\delta(z_i,\ell)\delta(z_j,\ell) \qquad(5.2)\] Inserting Equation 5.2 and doing some algebra, we find \[ \begin{aligned} Q(G, \mathbf{z}) &= \frac{1}{2m}\sum_{i,j \in N}\left[a_{ij} - \frac{k_ik_j}{2m}\right]\delta(z_i, z_j) \\ &= \frac{1}{2m}\sum_{i,j \in N}\left[a_{ij} - \frac{k_ik_j}{2m}\right]\sum_{\ell \in Z}\delta(z_i,\ell)\delta(z_j,\ell) \\ &= \frac{1}{2m}\sum_{\ell \in Z}\sum_{i,j \in N}\left[a_{ij}\delta(z_i,\ell)\delta(z_j,\ell) - \frac{k_ik_j}{2m}\delta(z_i,\ell)\delta(z_j,\ell)\right] \\ &= \sum_{\ell \in Z}\left[e_\ell - \frac{1}{(2m)^2}\sum_{i,j \in N}k_i\delta(z_i,\ell)k_j\delta(z_j,\ell)\right] \\ &= \sum_{\ell \in Z}\left[e_\ell - \frac{1}{(2m)^2}\sum_{i \in N}k_i\delta(z_i,\ell)\sum_{j \in N}k_j\delta(z_j,\ell)\right] \\ &= \sum_{\ell \in Z}\left[e_\ell - f_\ell^2\right]\;. \\ \end{aligned} \qquad(5.3)\] This compact expression for the modularity helps us interpret the expression in a new way. To make \(Q(G, \mathbf{z})\) large, we should try to make \(e_\ell\) large while keeping \(f_\ell\) small. This is a pretty common kind of balancing act in many optimization settings. What does it mean to do it here?

Take a minute to figure out why this is true.

Well, \(\sum_{\ell \in Z} e_\ell\) is the fraction of all edges that join nodes in the same group. A good clustering should make this large. Reasonable, right? Now, if we only wanted to make this part of the objective large, there’s an easy answer – put all nodes in the same group! Then, all edges join nodes in the same group, and so \(\sum_{\ell \in Z} e_\ell = 1\). Not very useful. The second term gives us a hand in this. Suppose now that we only focused on that term, and tried to make it small. That is, we’d like to find labels to make \(\sum_{\ell} f_\ell^2\) small. We know that each \(f_\ell\) is a fraction of a whole, and so \(\sum_{\ell}f_\ell = 1\). Consider the related optimization problem \[ \min_{\mathbf{x}\in \mathbb{R}^k} \sum_{\ell\in Z}x_\ell^2 \quad such \; that \quad \sum_{\ell \in Z}x_\ell = 1\;. \] We can solve this problem using Lagrange multipliers, obtaining the solution \(x_\ell = 1/k\) for each \(\ell\). That is, if we focused on making the second term in the modularity small, what we would do is aim for clusters that have approximately equal sizes, where “size” here means “number of edges attached to nodes in this community.”

So, to summarize, Equation 5.3 says that the modularity maximization problem is, roughly:

  1. Try to arrange the labels so that many edges are contained within the same cluster but
  2. Try to make all the groups roughly the same size – none too big nor too small.

This can be done by choosing the labeling so that \(f_\ell \approx f_{\ell'}\) for any choice of groups \(\ell\) and \(\ell'\).

5.2 Resolution Limit

This tradeoff—make clusters with lots of edges in them, but don’t let the cluster sizes be too different—offers us a useful persepctive on an important algorithmic limitation of the modularity objective. This is the resolution limit. The resolution limit in modularity maximization is a phenomenon in which maximization of the objective function Equation 5.1 is not able to find communities when they are small in comparison to the overall size of the network.

Fortunato and Barthelemy (2007)

Let’s work an example . Consider Newman’s case in which we have a network consisting of one large connected component and two \(k\)-cliques, connected by a single edge. We are going to argue that, if \(k\) is too small, then modularity maximization would rather view these two cliques as a single community, rather than separate ones. To see this, we can compare the terms \(e_\ell\) and \(f_\ell\) as they appear in the modularity in each case.

For a different argument with the same punchline, see Newman 14.2

Consider first the case in which the two cliques are in separate communities, which I’ll call \(u\) and \(v\). If we then combine the cliques into a single community, called \(w\), then the change in modularity according to Equation 5.3 is \[ \Delta Q = e_w - (e_u + e_v) - (f_w^2 - (f_u^2 + f_v^2)) \] Modularity maximization prefers to merge the two clusters if \(\Delta Q > 0\) and prefers to keep them seperate if \(\Delta Q < 0\).

The first term is direct to calculate: if we merge the two clusters, there’s exactly one new edge that now lives in a single cluster. The formula for \(e_2\) counts each edge twice. So, \(e_w - (e_u + e_v) = \frac{2}{2m}\).

The second term will require some more detailed algebra. First, since the cliques are the same size and in separate clusters, we have \[ f_u = f_v = \frac{1}{2m}(k^2 - k + 1) \triangleq \frac{1}{2m}s\; , \] where we have defined \(s = k^2 - k + 1\). The \(k^2 - k\) counts all the edges in the clique (twice), and the \(+1\) counts the edge that joins the two cliques. On the other hand, \(f_w = f_v + f_u = 2f_u\), since cluster \(w\) has all the edges from clusters \(u\) and \(v\). So, the inequality \(\Delta Q > 0\) reads \[ \frac{2}{2m} - 4f_u^2 + (f_u^2 + f_u^2) = \frac{1}{2m} - 2f_u^2 > 0\;. \] Inserting the definition of \(f_u\), this becomes \[ \frac{2}{2m} > \frac{1}{(2m)^2}s^2 \] or \[ 2m > s^2 = (k^2 - k + 1)^2\;. \]

This says that \(\Delta Q > 0\) if and only if \(s < \sqrt{2m}\). In other words, if \(s\) is too small relative to \(m\), modularity maximization actually prefers to merge the two cliques into one community, even if they “clearly” should be different.

How small do the clusters have to be before we start running into the resolution limit? Consider a graph with \(n = 5 \times 10^6\) nodes and mean degree \(c = 20\), which gives \(2m = nc = 10^8\).

Using the approximation \(s = k^2\), we find that the resolution limit occurs at \(k < \sqrt[4]{2m} = \sqrt[4]{10^8} = 10^2\). So, in this example, we would expect to be unable to find any communities smaller than 100 nodes using modularity maximization.

5.3 Louvain Algorithm

Our discussion of the Louvain algorithm in lecture closely follows Newman 14.2.5.