The Short Story of Bregman Information for Measuring Segregation
This is a post for folks who are interested in some of the most important mathematical properties of Bregman divergences. Bregman divergences offer a flexible platform from which to derive measures of diversity and segregation, and folks who wish to engineer such measures may benefit from understanding some useful properties.
The Big Picture
A Bregman divergence is a measure of difference between two vectors “through the lens” of a convex function \(f\). A Bregman information is a measure of how variable a set of vectors is, again through the lens of \(f\). Many methods for measuring segregation from the sociology literature can be expressed simply in terms of Bregman informations. These include:
- The Information Theory Index of Thiel and Finezza (1971) and Reardon and O’Sullivan (2004).
- The Ordinal Information Theory, Variation Ratio, and Square Root Indices of Reardon (2009).
- The Generalized Neighborhood Sorting Index of Jargowsky and Kim (2005).
Understanding the mathematical properties of Bregman information is very helpful, in that it allows us to neatly unify the mathematical properties of each of these indices. There are two main mathematical properties of these indices that are extremely helpful in multiscale analysis.
- Termwise interpretation (Theorem 4): a Bregman information can be interpreted as a difference between a measure of local diversity and global diversity. This makes Bregman informations extremely natural as quantifications of certain concepts of segregation.
- Chain rule (Theorem 5): when coarse-graining data, a Bregman information decomposes into terms corresponding to the information contained in the coarse-graining and the information left over. This makes Bregman informations extremely amenable to multiscale spatial analysis, and allows them to quantify the significance of spatial boundaries in separating groups with different demographic features.
Additionally, Bregman informations are related to information geometries, which turn out to be one useful route for incorporating space into our analytical toolbox. In particular, the Bregman information can be used to quantify how quickly demographics shift in space. This turns out to be because they are related to the curvature of a so-called information manifold. We won’t discuss this point too much here, but there’s much more discussion in Chodrow (2017b).
Background
The primary audience for these notes consists of folks with training in math and/or machine learning theory. I’ll assume familiarity with:
- The gradient operator \(\nabla\).
- Joint, marginal, and conditional probability distributions on discrete sets.
- Convex functions.
- Prior background in Shannon information theory is helpful but not assumed. I tell this story in my preferred sequence in Chodrow (2017a).
Introducing Bregman Divergences
You can think of \(d_f\) as a function that compares the two elements \(p\) and \(q\) of \(\mathcal{P}\). Bregman divergences are not distances, but they do have some similar properties. Here are some of them:
Theorem 1 tells us that we can think of \(d_f\) as a comparison operator, but it is important to remember that it is not in general a metric on \(\mathcal{P}\). Proofs for Theorem 1 are outlined on Wikipedia.
Bregman Information
Motivation From Segregation Measurement
What does it mean to say that a city, for example, is segregated by class? One natural image is of a city separated by a line, with high-wealth individuals on one side and low-wealth individuals on another. One framing of this situation is in terms of information and correlation:
- Knowing where someone lives makes you more able to guess their wealth. (“They live on the east side, have you seen those houses?”)
- Knowing someone’s wealth makes you more able to guess where they live. (“Folks with that salary tend to settle on the east side.”)
The concept of Bregman information can be used to make precise what it means for one attribute (like a residential location) to give information about another attribute (like wealth).
Information As Approximation
Let \(\mathcal{X}\) and \(\mathcal{Y}\) be two finite sets. Let \(p_{XY}\) be a probability distribution on \(\mathcal{X}\times \mathcal{Y}\). That is, if I pick a random object from \(\mathcal{X}\times \mathcal{Y}\), the probability that this object has property \(x \in \mathcal{X}\) and \(y \in \mathcal{Y}\) is \(p_{XY}(x,y)\). Concretely, think of \(X\) as being a demographic feature like wealth and \(Y\) as being a spatial location. Then, \(p_{XY}(\text{high wealth}, \mathrm{Providence})\) is the fraction of all people who have high wealth and live in Providence. We’ll develop an information measure to quantify the extent to which \(X\) and \(Y\) are dependent or correlated. I’ll treat the extent of dependence or correlation as an operating definition of “demographic spatial segregation.”
We develop the idea of \(X\) and \(Y\) as being dependent by considering the conditional distributions \(p_{X|y}(x) = \frac{p_{XY}(x,y)}{p_Y(y)}\). Here, \(p_Y(y) = \sum_{x\in \mathcal{X}} p_{XY}(x,y)\) is the marginal distribution of \(Y\). You can think of \(p_{X|y}(x)\) as the probability that an individual in location \(y\) has demographic attribute \(x\). The distribution \(p_{X|y}\) is an element of \(\mathcal{P}_{\mathcal{X}}\), the set of probability distributions over \(\mathcal{X}\). We have one of those distributions for each possible choice of \(y \in \mathcal{Y}\). If these distributions are very different – if they depend strongly on the choice of \(y\) – then knowing \(y\) would make a big difference in what we know about \(x\). We would say that \(X\) and \(Y\) are highly dependent. How do we measure how different the collection of distributions \(\{p_{X|y}\}\) is? A standard, two-step approach is:
- First, find a point \(\bar{p}_X\) that best approximates all the marginal distributions \(p_{X|y}\).
- Compute the average divergence of the conditionals to \(\bar{p}_X\). This average is \[ I_f(X:Y;\bar{p}_X) = \sum_{y \in \mathcal{Y}} p_Y(y) d_f(p_{X|y}, \bar{p}_X) \] This is a gnarly expression, but you can think of as expressing: on average, how far (in the sense of divergence \(d_f\)) is a typical local distribution \(p_{X|y}\) from its approximator \(\bar{p}_X\)? If the answer is small, then we know that all the local distributions aren’t too far from their approximators, and therefore shouldn’t be too far from each other, either.
As it turns out, there’s a correct choice of \(\bar{p}_X\). This result is Proposition 1 of Banerjee et al. (2005).
Proof. Recall that \(I_f(X:Y) = \sum_{y \in \mathcal{Y}} p_Y(y) d_f(p_{X|y}, p_X)\). Since \(p_Y(y) > 0\) for each \(y \in \mathcal{Y}\) by hypothesis, for \(I_f(X:Y) = 0\) we must have \(d_f(p_{X|y}, p_X)\) for each \(y\). If \(f\) is strictly convex, then by Theorem 1, \(d_f(p,q) = 0\) iff \(p = q\). So, we must have \(p_{X|y} = p_X\) for each \(y\), which implies that \(X\) and \(Y\) are independent.
There is another formula for the Bregman information that helps to interpret its meaning. This formula is the version emphasized in this post, which expresses global diversity as a sum of local diversity and global segregation.
Proof. We can directly write out \[ \begin{aligned} I_f(X:Y) &= \sum_{y \in \mathcal{Y}} p_Y(y) d_f(p_{X|y}, p_X) \\ &= \sum_{y \in \mathcal{Y}} p_Y(y) \left[f(p_{X|y}) - f(p_X) - \langle \nabla f(p_X), p_{X|y} - p_X\rangle \right] \\ \end{aligned} \] Next, we use that \(\sum_{y \in \mathcal{Y}} p_Y(y) p_X = \sum_{y \in \mathcal{Y}} p_Y(y) p_{X|y} = p_X\), which causes the inner product to vanish. The only remaining terms are \[ I_f(X:Y) = \sum_{y \in \mathcal{Y}} p_Y(y) f(p_{X|y}) - f(p_X)\;, \] completing the proof.
Theorem 4 is often used as the definition of the mutual information in the context of Shannon information theory; indeed, it’s the definition that I emphasize in my previous post on Shannon information theory. The interpretation of this theorem as relating global segregation, global diversity, and local diversity translates directly.
The Chain Rule
From a practical perspective, one of the most important properties of the Bregman information is the chain rule. We’ll now state and prove it.
Let \(\mathcal{Z}\) be a finite set, and let \(g: \mathcal{Y}\rightarrow \mathcal{Z}\). Let \(Z = g(Y)\). We write \(p_{X|z}(x) = \frac{\sum_{Y \in g^{-1}(z)} p_{XY}(x,y)}{\sum_{Y \in g^{-1}(z)} p_Y(y)}\) for the distribution of \(X\) conditioned on the event \(Z = z\). We similarly write \(p_{Y|z}\) for the distribution of \(y\) conditioned on \(Z = z\).
You can think of \(I_f(X:Y|z)\) as the information that \(Y\) contains about \(X\) in a world in which \(g(Y)\) is always equal to \(z\). The conditional information \(I(X:Y|Z)\) is the average across values of \(z\).
Proof. We’ll start by writing
\[ \begin{aligned} I_f(X:Y) &= \sum_{y \in \mathcal{Y}} p_Y(y) d_f(p_{X|y}, p_X) \\ &= \sum_{z \in \mathcal{Z}} \sum_{y \in g^{-1}(z)} p_Y(y) d_f(p_{X|y}, p_X) \\ &= \sum_{z \in \mathcal{Z}} \sum_{y \in g^{-1}(z)} p_Y(y) \left[d_f(p_{X|y}, p_{X|z}) + d_f(p_{X|z}, p_X) + \langle \nabla f(p_{X|z}) - \nabla f(p_X), p_{X|y} - p_{X|z}\rangle \right] \end{aligned} \]
This last line is the “law of cosines” for Bregman divergences and follows by just rearranging some terms. Let’s break this final line down into three terms.
First,
\[ \begin{aligned} \sum_{z \in \mathcal{Z}} \sum_{y \in g^{-1}(z)} p_Y(y) d_f(p_{X|y}, p_{X|z}) &= \sum_{z \in \mathcal{Z}} p_{Z}(z) \sum_{y \in g^{-1}(z)} \frac{p_Y(y)}{p_Z(z)} d_f(p_{X|y}, p_{X|z}) \\ &= \sum_{z \in \mathcal{Z}} p_{Z}(z) \sum_{y \in g^{-1}(z)} p_{Y|z}(y) d_f(p_{X|y}, p_{X|z}) \\ &= I_f(X:Y|Z)\;. \end{aligned} \]
Second,
\[ \begin{aligned} \sum_{z \in \mathcal{Z}} \sum_{y \in g^{-1}(z)} p_Y(y)d_f(p_{X|z}, p_X) = \sum_{z \in \mathcal{Z}} p_Z(z)d_f(p_{X|z}, p_X) = I_f(X:Z)\;. \end{aligned} \]
Third and finally, the very last term is equal to 0 because
\[ \begin{aligned} \sum_{y \in g^{-1}(z)}p_Y(y)p_{X|y}(x) = p_{X|z}(x)\;. \end{aligned} \] for all \(x\).
So, we have shown that \(I_f(X:Y) = I_f(X:Z) + I_f(X:Y|Z)\), completing the proof.
You can think of the chain rule as saying the following:
The information in \(Y\) about \(X\) is equal to the information that \(Z\) holds about \(X\), plus the additional information that \(Y\) holds in a world in which you already know \(Z\).
Bregman Information Geometry
A Bregman divergence is not a metric. Theorem 1 highlights some ways in which a Bregman divergence is like a metric. There is, however, another important way. In brief, Bregman divergences between “nearby” vectors behave like Riemannian metrics, and therefore allow us to do geometry. Here’s a concrete version of this idea:
Proof. Positive (semi)-definiteness of \(H(q)\) follows from standard results about the Hessians of convex functions. To prove the equation, we Taylor-expand \(d_f(p, q)\) in the first argument around the point \(q\), which we can do in an \(\epsilon\)-neighborhood for \(\epsilon\) chosen sufficiently small.
\[ d_f(p, q) = d_f(q, q) + \langle \nabla_1 d_f(q, q),p - q \rangle + \frac{1}{2} \langle p-q, \bar{H}(p-q)\rangle + O(\lVert p - q \rVert^2)\;. \]
Here, \(\nabla_1 d_f(q, q)\) indicates that derivatives are taken only in the first argument of \(d_f\). \(\bar{H}\) is the Hessian matrix of \(d_f\), again with derivatives taken only in the first argument.
Let’s evaluate terms. We have \(d_f(p, p) = 0\) by Theorem 1. Moving to the second term, we have \[ \begin{aligned} \langle \nabla_1 d_f(q, q), p-q \rangle = \langle \nabla f(q) - \nabla f(q), p-q \rangle \rangle = 0 \end{aligned} \]
It remains to calculate \(\bar{H}\). When taking two derivatives with respect to the first argument of \(d_f\), only the term initially corresponding to \(f(p)\) survives, and we therefore have \(\bar{H} = H\), the Hessian of \(H\). Finally, noting that \(\lVert p - q \rVert < \epsilon\) by construction completes the argument.
One reason that Theorem 6 is so good is that it allows us actually get “real” metrics from divergences. From the simplest perspective, we know that, for \(p\) and \(q\) close, \(d_f(p,q) \approx \frac{1}{2} \langle p-q,H(q)(p-q)\rangle\), which is a metric when \(f\) is strictly convex (as \(H\) is then positive-definite). On the other hand, we can also use the perspective of Riemannian geometry, under which \(H\) plays the role of an ambient metric tensor. Many problems in statistical inference can be framed in terms exploring the structure of a manifold of probability distributions. Any smooth parameterized family of distributions defines such a manifold; for example, the Poisson distribution \(p_K(k;\lambda) = \frac{e^{\lambda}\lambda^k}{k!}\) is parameterized by a single parameter \(\lambda\). We can compare these distributions using the ambient metric, and the result is a Riemannian metric induced on the 1-dimensional manifold of distributions.
For more on information geometry, see Nielsen (2020) or Amari (2016).
Information Geometry and Spatial Analysis
The information geometry point can feel pretty abstract, but there’s a specific reason to be interested in this perspective when studying spatial segregation. In particular, consider the following idealized model of Census data. Let \(S\) be a connected, closed region in \(\mathbb{R}^2\). For each point \(s \in S\), we have a probability distribution \(p_{X|s}\) over some set of demographic variables \(\mathcal{X}\).
We now impose the assumption that \(p_{X|s}\) is smooth as a function of \(s\). This allows us to apply Theorem 6, which in turn allows us to treat \(S\) as a Riemannian manifold and do geometric calculations in a metric that is adapted to the demographic structure. This allow us to do things like measure “demographic distance” between points in space, and also enables some novel approaches to characterizing the spatial scale of segregation (Chodrow 2017b).