The Short Story of Bregman Information for Measuring Segregation

Author

Phil Chodrow

Published

June 24, 2022

This is a post for folks who are interested in some of the most important mathematical properties of Bregman divergences. Bregman divergences offer a flexible platform from which to derive measures of diversity and segregation, and folks who wish to engineer such measures may benefit from understanding some useful properties.

The Big Picture

A Bregman divergence is a measure of difference between two vectors “through the lens” of a convex function \(f\). A Bregman information is a measure of how variable a set of vectors is, again through the lens of \(f\). Many methods for measuring segregation from the sociology literature can be expressed simply in terms of Bregman informations. These include:

The Information Theory Index of Thiel and Finezza (1971) and Reardon and O’Sullivan (2004).
The Ordinal Information Theory, Variation Ratio, and Square Root Indices of Reardon (2009).
The Generalized Neighborhood Sorting Index of Jargowsky and Kim (2005).

Understanding the mathematical properties of Bregman information is very helpful, in that it allows us to neatly unify the mathematical properties of each of these indices. There are two main mathematical properties of these indices that are extremely helpful in multiscale analysis.

Termwise interpretation (Theorem 4): a Bregman information can be interpreted as a difference between a measure of local diversity and global diversity. This makes Bregman informations extremely natural as quantifications of certain concepts of segregation.
Chain rule (Theorem 5): when coarse-graining data, a Bregman information decomposes into terms corresponding to the information contained in the coarse-graining and the information left over. This makes Bregman informations extremely amenable to multiscale spatial analysis, and allows them to quantify the significance of spatial boundaries in separating groups with different demographic features.

Additionally, Bregman informations are related to information geometries, which turn out to be one useful route for incorporating space into our analytical toolbox. In particular, the Bregman information can be used to quantify how quickly demographics shift in space. This turns out to be because they are related to the curvature of a so-called information manifold. We won’t discuss this point too much here, but there’s much more discussion in Chodrow (2017b).

Background

The primary audience for these notes consists of folks with training in math and/or machine learning theory. I’ll assume familiarity with:

The gradient operator \(\nabla\).
Joint, marginal, and conditional probability distributions on discrete sets.
Convex functions.
Prior background in Shannon information theory is helpful but not assumed. I tell this story in my preferred sequence in Chodrow (2017a).

Introducing Bregman Divergences

Bregman Divergence

Let \(\mathcal{P}\subseteq \mathbb{R}^n\) be a convex set. Let \(f:\mathcal{P}\rightarrow \mathbb{R}\) be a continuously differentiable and convex function. Then, the Bregman divergence induced by \(f\) on \(\mathcal{P}\) is the function \(d_f:\mathcal{P}^2 \rightarrow \mathbb{R}\) given by

\[ d_f(p,q) = f(p) - f(q) - \langle \nabla f(q), p-q \rangle\;. \]

You can think of \(d_f\) as a function that compares the two elements \(p\) and \(q\) of \(\mathcal{P}\). Bregman divergences are not distances, but they do have some similar properties. Here are some of them:

Some more detail on Wikipedia.

Theorem 1 (Fundamental Properties of Bregman Divergences) For any continuously differentiable and convex \(f\):

\(d_f(p, p) = 0\) for all \(p \in \mathcal{P}\).
\(d_f(p,q) \geq 0\) for all \(p, q \in \mathcal{P}\).
If \(f\) is strictly convex, then \(d_f(p,q) = 0\) iff \(p = q\).

On the other hand, \(d_f(p, q) \neq d_f(q,p)\) in general, and there is no triangle inequality.

Theorem 1 tells us that we can think of \(d_f\) as a comparison operator, but it is important to remember that it is not in general a metric on \(\mathcal{P}\). Proofs for Theorem 1 are outlined on Wikipedia.

Examples of Bregman Divergences

Let \(\mathcal{X}\) be a finite set with \(\left|\mathcal{X}\right| = n\). Let \(\mathcal{P}_\mathcal{X}\) be the set of all discrete probability distributions over the elements of \(\mathcal{X}\). Let \(f(p) = \sum_{x \in \mathcal{X}} p(x) \log p(x)\). Then, \(d_f\) is the Kullback-Leibler divergence, given by \[ d_f(p,q) = \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)}\;. \] This follows a direct calculation of the gradient \(\nabla f\) and subsequent simplification.
Let \(\mathcal{P}= \mathbb{R}^n\) and let \(f(x) = \lVert x \rVert^2\), the squared Euclidean norm. Then, \(d_f\) is the squared Euclidean distance: \[ d_f(x, y) = \lVert x - y \rVert^2\;. \]

Bregman Information

Motivation From Segregation Measurement

What does it mean to say that a city, for example, is segregated by class? One natural image is of a city separated by a line, with high-wealth individuals on one side and low-wealth individuals on another. One framing of this situation is in terms of information and correlation:

Knowing where someone lives makes you more able to guess their wealth. (“They live on the east side, have you seen those houses?”)
Knowing someone’s wealth makes you more able to guess where they live. (“Folks with that salary tend to settle on the east side.”)

The concept of Bregman information can be used to make precise what it means for one attribute (like a residential location) to give information about another attribute (like wealth).

Information As Approximation

Let \(\mathcal{X}\) and \(\mathcal{Y}\) be two finite sets. Let \(p_{XY}\) be a probability distribution on \(\mathcal{X}\times \mathcal{Y}\). That is, if I pick a random object from \(\mathcal{X}\times \mathcal{Y}\), the probability that this object has property \(x \in \mathcal{X}\) and \(y \in \mathcal{Y}\) is \(p_{XY}(x,y)\). Concretely, think of \(X\) as being a demographic feature like wealth and \(Y\) as being a spatial location. Then, \(p_{XY}(\text{high wealth}, \mathrm{Providence})\) is the fraction of all people who have high wealth and live in Providence. We’ll develop an information measure to quantify the extent to which \(X\) and \(Y\) are dependent or correlated. I’ll treat the extent of dependence or correlation as an operating definition of “demographic spatial segregation.”

We develop the idea of \(X\) and \(Y\) as being dependent by considering the conditional distributions \(p_{X|y}(x) = \frac{p_{XY}(x,y)}{p_Y(y)}\). Here, \(p_Y(y) = \sum_{x\in \mathcal{X}} p_{XY}(x,y)\) is the marginal distribution of \(Y\). You can think of \(p_{X|y}(x)\) as the probability that an individual in location \(y\) has demographic attribute \(x\). The distribution \(p_{X|y}\) is an element of \(\mathcal{P}_{\mathcal{X}}\), the set of probability distributions over \(\mathcal{X}\). We have one of those distributions for each possible choice of \(y \in \mathcal{Y}\). If these distributions are very different – if they depend strongly on the choice of \(y\) – then knowing \(y\) would make a big difference in what we know about \(x\). We would say that \(X\) and \(Y\) are highly dependent. How do we measure how different the collection of distributions \(\{p_{X|y}\}\) is? A standard, two-step approach is:

The notation \(p_{X|y}(x)\) is nonstandard, and is chosen to make some later calculations a bit more compact.

First, find a point \(\bar{p}_X\) that best approximates all the marginal distributions \(p_{X|y}\).
Compute the average divergence of the conditionals to \(\bar{p}_X\). This average is \[ I_f(X:Y;\bar{p}_X) = \sum_{y \in \mathcal{Y}} p_Y(y) d_f(p_{X|y}, \bar{p}_X) \] This is a gnarly expression, but you can think of as expressing: on average, how far (in the sense of divergence \(d_f\)) is a typical local distribution \(p_{X|y}\) from its approximator \(\bar{p}_X\)? If the answer is small, then we know that all the local distributions aren’t too far from their approximators, and therefore shouldn’t be too far from each other, either.

As it turns out, there’s a correct choice of \(\bar{p}_X\). This result is Proposition 1 of Banerjee et al. (2005).

Theorem 2 (Mean is a Minimizer) For any Bregman divergence, the smallest value of \(I_f(X:Y;\bar{p}_X)\) is attained when \(\bar{p}_X = p_X\), the marginal distribution of \(X\). This distribution is given by \(p_X(x) = \sum_{y \in Y}p_{XY}(x,y)\). More formally,

\[ \mathop{\mathrm{argmin\;}}_{\bar{p}_X} I_f(X:Y;\bar{p}_X) = p_X\;. \]

Bregman Information

We will use the abbreviation \(I_f(X: Y) = I_f(X:Y;p_X)\) for the minimizing value of \(I_f(X:Y;\bar{p}_X)\). We call \(I_f(X:Y)\) the Bregman information in \(Y\) about \(X\).

Theorem 3 (Bregman Information and Independence) Let \(f\) be strictly convex and let \(p_Y(y) > 0\) for all \(y \in \mathcal{Y}\). Then, \(I_f(X:Y) = 0\) if and only if \(X\) and \(Y\) are independent random variables.

Proof. Recall that \(I_f(X:Y) = \sum_{y \in \mathcal{Y}} p_Y(y) d_f(p_{X|y}, p_X)\). Since \(p_Y(y) > 0\) for each \(y \in \mathcal{Y}\) by hypothesis, for \(I_f(X:Y) = 0\) we must have \(d_f(p_{X|y}, p_X)\) for each \(y\). If \(f\) is strictly convex, then by Theorem 1, \(d_f(p,q) = 0\) iff \(p = q\). So, we must have \(p_{X|y} = p_X\) for each \(y\), which implies that \(X\) and \(Y\) are independent.

There is another formula for the Bregman information that helps to interpret its meaning. This formula is the version emphasized in this post, which expresses global diversity as a sum of local diversity and global segregation.

Theorem 4 The Bregman information satisfies

\[ I_f(X:Y) = \sum_{y \in \mathcal{Y}} p_Y(y) f(p_{X|y}) - f(p_X)\;. \]

Proof. We can directly write out \[ \begin{aligned} I_f(X:Y) &= \sum_{y \in \mathcal{Y}} p_Y(y) d_f(p_{X|y}, p_X) \\ &= \sum_{y \in \mathcal{Y}} p_Y(y) \left[f(p_{X|y}) - f(p_X) - \langle \nabla f(p_X), p_{X|y} - p_X\rangle \right] \\ \end{aligned} \] Next, we use that \(\sum_{y \in \mathcal{Y}} p_Y(y) p_X = \sum_{y \in \mathcal{Y}} p_Y(y) p_{X|y} = p_X\), which causes the inner product to vanish. The only remaining terms are \[ I_f(X:Y) = \sum_{y \in \mathcal{Y}} p_Y(y) f(p_{X|y}) - f(p_X)\;, \] completing the proof.

Theorem 4 is often used as the definition of the mutual information in the context of Shannon information theory; indeed, it’s the definition that I emphasize in my previous post on Shannon information theory. The interpretation of this theorem as relating global segregation, global diversity, and local diversity translates directly.

The Chain Rule

From a practical perspective, one of the most important properties of the Bregman information is the chain rule. We’ll now state and prove it.

This result is a rewording of Theorem 1 in Banerjee et al. (2005).

Let \(\mathcal{Z}\) be a finite set, and let \(g: \mathcal{Y}\rightarrow \mathcal{Z}\). Let \(Z = g(Y)\). We write \(p_{X|z}(x) = \frac{\sum_{Y \in g^{-1}(z)} p_{XY}(x,y)}{\sum_{Y \in g^{-1}(z)} p_Y(y)}\) for the distribution of \(X\) conditioned on the event \(Z = z\). We similarly write \(p_{Y|z}\) for the distribution of \(y\) conditioned on \(Z = z\).

Conditional Bregman Information

The conditional Bregman information in \(Y\) about \(X\) given \(Z\), written \(I(X:Y|Z)\), is \[ I(X:Y|Z) = \sum_{z \in Z} p_Z(z) I_f(X:Y|z) = \sum_{z \in \mathcal{Z}} p_Z(z) \sum_{y \in \mathcal{Y}}p_{Y|z}(y) d_f(p_{X|y}, p_{X|z})\;. \]

You can think of \(I_f(X:Y|z)\) as the information that \(Y\) contains about \(X\) in a world in which \(g(Y)\) is always equal to \(z\). The conditional information \(I(X:Y|Z)\) is the average across values of \(z\).

Theorem 5 (Chain Rule of Bregman Information) If \(Z = g(Y)\) for some function \(g\), then

\[ I_f(X:Y) = I_f(X:Z) + I_f(X:Y|Z) \]

Proof. We’ll start by writing

\[ \begin{aligned} I_f(X:Y) &= \sum_{y \in \mathcal{Y}} p_Y(y) d_f(p_{X|y}, p_X) \\ &= \sum_{z \in \mathcal{Z}} \sum_{y \in g^{-1}(z)} p_Y(y) d_f(p_{X|y}, p_X) \\ &= \sum_{z \in \mathcal{Z}} \sum_{y \in g^{-1}(z)} p_Y(y) \left[d_f(p_{X|y}, p_{X|z}) + d_f(p_{X|z}, p_X) + \langle \nabla f(p_{X|z}) - \nabla f(p_X), p_{X|y} - p_{X|z}\rangle \right] \end{aligned} \]

This last line is the “law of cosines” for Bregman divergences and follows by just rearranging some terms. Let’s break this final line down into three terms.

See Wikipedia

First,

\[ \begin{aligned} \sum_{z \in \mathcal{Z}} \sum_{y \in g^{-1}(z)} p_Y(y) d_f(p_{X|y}, p_{X|z}) &= \sum_{z \in \mathcal{Z}} p_{Z}(z) \sum_{y \in g^{-1}(z)} \frac{p_Y(y)}{p_Z(z)} d_f(p_{X|y}, p_{X|z}) \\ &= \sum_{z \in \mathcal{Z}} p_{Z}(z) \sum_{y \in g^{-1}(z)} p_{Y|z}(y) d_f(p_{X|y}, p_{X|z}) \\ &= I_f(X:Y|Z)\;. \end{aligned} \]

Second,

\[ \begin{aligned} \sum_{z \in \mathcal{Z}} \sum_{y \in g^{-1}(z)} p_Y(y)d_f(p_{X|z}, p_X) = \sum_{z \in \mathcal{Z}} p_Z(z)d_f(p_{X|z}, p_X) = I_f(X:Z)\;. \end{aligned} \]

Third and finally, the very last term is equal to 0 because

\[ \begin{aligned} \sum_{y \in g^{-1}(z)}p_Y(y)p_{X|y}(x) = p_{X|z}(x)\;. \end{aligned} \] for all \(x\).

So, we have shown that \(I_f(X:Y) = I_f(X:Z) + I_f(X:Y|Z)\), completing the proof.

You can think of the chain rule as saying the following:

The information in \(Y\) about \(X\) is equal to the information that \(Z\) holds about \(X\), plus the additional information that \(Y\) holds in a world in which you already know \(Z\).

Bregman Information Geometry

A Bregman divergence is not a metric. Theorem 1 highlights some ways in which a Bregman divergence is like a metric. There is, however, another important way. In brief, Bregman divergences between “nearby” vectors behave like Riemannian metrics, and therefore allow us to do geometry. Here’s a concrete version of this idea:

Theorem 6 The Hessian (matrix of second derivatives) \(H(q)\) of \(f\) at point \(q\) is positive semi-definite for any \(q\), and positive-definite if \(f\) is strictly convex. Furthermore, there exists \(\epsilon > 0\) such that, if \(\lVert p - q \rVert < \epsilon\), then \[ d_f(p, q) = \frac{1}{2} \langle p-q,H(q)(p-q) \rangle + O(\epsilon^2) \]

Proof. Positive (semi)-definiteness of \(H(q)\) follows from standard results about the Hessians of convex functions. To prove the equation, we Taylor-expand \(d_f(p, q)\) in the first argument around the point \(q\), which we can do in an \(\epsilon\)-neighborhood for \(\epsilon\) chosen sufficiently small.

See e.g. these notes, Cor. 2.18 and Remark 2.19

\[ d_f(p, q) = d_f(q, q) + \langle \nabla_1 d_f(q, q),p - q \rangle + \frac{1}{2} \langle p-q, \bar{H}(p-q)\rangle + O(\lVert p - q \rVert^2)\;. \]

Here, \(\nabla_1 d_f(q, q)\) indicates that derivatives are taken only in the first argument of \(d_f\). \(\bar{H}\) is the Hessian matrix of \(d_f\), again with derivatives taken only in the first argument.

Let’s evaluate terms. We have \(d_f(p, p) = 0\) by Theorem 1. Moving to the second term, we have \[ \begin{aligned} \langle \nabla_1 d_f(q, q), p-q \rangle = \langle \nabla f(q) - \nabla f(q), p-q \rangle \rangle = 0 \end{aligned} \]

It remains to calculate \(\bar{H}\). When taking two derivatives with respect to the first argument of \(d_f\), only the term initially corresponding to \(f(p)\) survives, and we therefore have \(\bar{H} = H\), the Hessian of \(H\). Finally, noting that \(\lVert p - q \rVert < \epsilon\) by construction completes the argument.

One reason that Theorem 6 is so good is that it allows us actually get “real” metrics from divergences. From the simplest perspective, we know that, for \(p\) and \(q\) close, \(d_f(p,q) \approx \frac{1}{2} \langle p-q,H(q)(p-q)\rangle\), which is a metric when \(f\) is strictly convex (as \(H\) is then positive-definite). On the other hand, we can also use the perspective of Riemannian geometry, under which \(H\) plays the role of an ambient metric tensor. Many problems in statistical inference can be framed in terms exploring the structure of a manifold of probability distributions. Any smooth parameterized family of distributions defines such a manifold; for example, the Poisson distribution \(p_K(k;\lambda) = \frac{e^{\lambda}\lambda^k}{k!}\) is parameterized by a single parameter \(\lambda\). We can compare these distributions using the ambient metric, and the result is a Riemannian metric induced on the 1-dimensional manifold of distributions.

For more on information geometry, see Nielsen (2020) or Amari (2016).

Information Geometry and Spatial Analysis

The information geometry point can feel pretty abstract, but there’s a specific reason to be interested in this perspective when studying spatial segregation. In particular, consider the following idealized model of Census data. Let \(S\) be a connected, closed region in \(\mathbb{R}^2\). For each point \(s \in S\), we have a probability distribution \(p_{X|s}\) over some set of demographic variables \(\mathcal{X}\).

We now impose the assumption that \(p_{X|s}\) is smooth as a function of \(s\). This allows us to apply Theorem 6, which in turn allows us to treat \(S\) as a Riemannian manifold and do geometric calculations in a metric that is adapted to the demographic structure. This allow us to do things like measure “demographic distance” between points in space, and also enables some novel approaches to characterizing the spatial scale of segregation (Chodrow 2017b).

References

Amari, Shun-ichi. 2016. Information Geometry and Its Applications. Vol. 194. Springer.

Banerjee, Arindam, Srujana Merugu, Inderjit S Dhillon, Joydeep Ghosh, and John Lafferty. 2005. “Clustering with Bregman Divergences.” Journal of Machine Learning Research 6 (10).

Chodrow, Philip. 2017a. “Divergence, Entropy, Information: An Opinionated Introduction to Information Theory.” arXiv Preprint arXiv:1708.07459.

———. 2017b. “Structure and Information in Spatial Segregation.” Proceedings of the National Academy of Sciences 114 (44): 11591–96.

Jargowsky, Paul A, and Jeongdai Kim. 2005. “A Measure of Spatial Segregation: The Generalized Neighborhood Sorting Index.” University of Texas, Dallas.

Nielsen, Frank. 2020. “An Elementary Introduction to Information Geometry.” Entropy 22 (10): 1100.

Reardon, Sean F. 2009. “Measures of Ordinal Segregation.” In Occupational and Residential Segregation. Emerald Group Publishing Limited.

Reardon, Sean F, and David O’Sullivan. 2004. “Measures of Spatial Segregation.” Sociological Methodology 34 (1): 121–62.

Thiel, Henri, and Anthony Finezza. 1971. “A Note on the Measurement of Racial Integration of Schools by Means of Informational Concepts.” Journal of Mathematical Sociology 1: 187–94.