9  Link Prediction and Feedback Loops

In this set of lectures, we’ll study an important task in network data science: link prediction. The link prediction task is to, given a current network and possibly some additional data, predict future edges. This task has many applications:

Link prediction was popularized as a task in network analysis and machine learning by Liben-Nowell and Kleinberg (2007)

In the first part of these lecture notes, we’ll implement a simple link prediction model. In the second part, we’ll do a simple simulation to learn about how link prediction models can change the structure of social networks.

9.2 Impact of Algorithmic Recommendations on Social Networks

Link prediction algorithms are often used by apps and platforms to make recommendations. When Twitter suggests a new profile for you to follow, for example, they often do this on the basis of a link prediction algorithm: users like you have often followed profiles like that one in the past, and so they think that you might like to follow it now. From the perspective of the company making these recommendations, the overall purpose is to increase “engagement” on their platform. More engagement leads to more time spent scrolling, which leads to more time watching money-making ads.

But what happens to the structure of social networks under the influence of link-prediction algorithms? The details of course here depend on the algorithm, but let’s use a version of the one we used in the previous section. We’re going to wrap the whole thing up in a Python class in order to be able to keep track of the current state of the network.

class LinkPredictionSimulator:

    def __init__(self, edge_df, **kwargs):
        self.edge_df = edge_df.copy()
        self.G = nx.from_pandas_edgelist(self.edge_df)
        self.kwargs = kwargs
        self.node_list = list(self.G.nodes)
        self.comm_dict, self.comms = louvain_communities(self.G, True)

    # ---------------------------------------------
    # ---------------------------------------------

    def prep_data(self):
        add negative examples and compute features on the current data frame of edges, using stored community labels for community features. 
        self.train_df = add_negative_examples(self.edge_df)
        self.train_df = compute_features(self.train_df, comm_dict = self.comm_dict, **self.kwargs)

        # store the names of the feature columns for later
        self.feature_cols = [col for col in self.train_df.columns if col not in ["source", "target", "link"]]
    def train_model(self):
        Train a logistic classifier on the current data after features have been added. 
        X = self.train_df[self.feature_cols]
        y = self.train_df["link"]

        self.model = LogisticRegression(solver = "liblinear")
        self.model.fit(X, y)
    def get_predicted_edges(self, m_replace):
        Return a data frame containing the m_replace most likely new edges that are not already present in the graph. 
        # data frame of candidate pairs
        pairs = pd.DataFrame(product(self.node_list, self.node_list), columns = ["source", "target"])
        pairs = pairs[pairs["source"] < pairs["target"]]

        # add features to the candidate pairs
        pairs = compute_features(pairs, comm_dict = self.comm_dict, G = self.G, **self.kwargs)

        # add the model predictions
        pairs["edge_score"] = self.model.predict_proba(pairs[self.feature_cols])[:,1]

        # remove pairs that already present in the graph
        pairs = pd.merge(pairs, self.edge_df, on = ["source", "target"], indicator = True, how = "outer")
        pairs = pairs[pairs._merge == "left_only"]

        # get the m_replace pairs with the highest predicted probability
        # and return them
        pairs = pairs.sort_values("edge_score", ascending = False).head(m_replace)
        return pairs[["source", "target"]]

    def update_edges(self, m_replace):
        removes m_replace edges from the current graph, and replaces them with m_replace predicted edges from get_predicted_edges. 

        # remove m_replace random edges
        self.edge_df = self.edge_df.sample(len(self.edge_df) - m_replace)

        # add m_replace recommended edges 
        new_edges = self.get_predicted_edges(m_replace)
        self.edge_df = pd.concat([self.edge_df, new_edges])

    def step(self, m_replace = 1, train = True):
        main simulation function. In each step, we do the data preparation steps, train the model, and update the graph. 
        self.G = nx.from_pandas_edgelist(self.edge_df)

    # ---------------------------------------------
    # ---------------------------------------------

    def degree_gini(self):
        The Gini coefficient is a measure of inequality. We are going to use it to measure the extent of inequality in the degree distribution. Higher Gini = more inequality in the degree distribution. 

        code from https://stackoverflow.com/questions/39512260/calculating-gini-coefficient-in-python-numpy
        degs = np.array([self.G.degree[i] for i in self.G.nodes])
        mad = np.abs(np.subtract.outer(degs, degs)).mean()
        rmad = mad/np.mean(degs)
        g = 0.5 * rmad
        return g

    def modularity(self):
        modularity of the stored partition
        return nx.algorithms.community.modularity(self.G, self.comms)

Phew, that’s a lot of code! Let’s now create a simulator, using the entire contact network.

edges = contact.groupby(["source", "target"]).count().reset_index()
LPS = LinkPredictionSimulator(edges)

Now we’re going to conduct our simulation. Along the way, I’ve set up code so that we can see the graph (and its community partition) before and after the simulation. While we do the simulation, I’m also going to collect the modularity and degree Gini coefficients.

# LPS.prep_data()


fig, axarr = plt.subplots(1, 2)

pos = nx.fruchterman_reingold_layout(LPS.G)

louvain_plot(LPS.G, comm_dict = LPS.comm_dict, ax = axarr[0], pos = pos)

Q    = []
gini = []

LPS.step(0, train = True)
for i in range(50):
    LPS.step(100, train = True)

louvain_plot(LPS.G, comm_dict = LPS.comm_dict, ax = axarr[1], pos = pos)

Here’s what happened to the modularity and the degree Gini coefficient as the simulation progressed:

plt.plot(Q, label = "Modularity of original partition")
plt.plot(gini, label = "Gini inequality in degrees")

plt.gca().set(xlabel = "Timestep")
[Text(0.5, 0, 'Timestep')]

Evolution of the modularity and the Gini coefficient of degrees during our simulation of algorithmic link prediction.

In this simulation, with this model, algorithmic recommendations led to a network that has:

  • More closed-off, insular communities, indicated by the high modularity.
  • Increased inequality of influence, at least as measured by degrees.

It’s important to note that these results have multiple interpretations. Tighter communities could just mean that the platform is better at helping people connect to their interests, and in some cases this might be harmless. On the other hand, such tight communities also smack of echo chambers; in cases related to opinion exchange or debate, it might be difficult for people to actually encounter contrary opinions in this setting. Equality of influence might seem like a good thing, but could also indicate that people with extreme or repugnant viewpoints have become mainstreamed. So, while it’s clear that the algorithm has significantly changed the overall structure of the social network, it’s important to think about this in specific contexts in order to determine whether this is a bad thing or not.

Overall, our findings suggest that the influence of automated recommendation algorithms have the possibility to change the overall shape of social networks in possibly harmful ways. For some perspectives on how algorithmic influence shapes collective behavior, and what this might imply, see Bak-Coleman et al. (2021).