Efficiently computing shared neighbors in graph

Problem background

Assume that we have a large graph containing many nodes and edges. For this graph, we need to compute the number of shared neighbors for each pair of nodes.

An instance example

For nodes 0 and 1, the result is 1 as the nodes share a single neighbor, which is node 2. For nodes 0 and 3, the result is 2 as both of them share nodes 2 and 4 as neighbors.

Naive solution

Assume that the input graph is represented in python networkx graphs. A trivial solution is to compute for all pairs the number of shared neighbors by neighbors intersection:

from itertools import combinations
import networkx as nx
from networkx.generators.random_graphs import gnp_random_graph

g: nx.Graph = gnp_random_graph(1000, 0.2)  # get a graph
shared_neighbors = {}
for node_1, node_2 in combinations(g.nodes, r=2):
    shared_neighbors[(node_1, node_2)] = (
        sum(1 for _ in nx.common_neighbors(g, node_1, node_2))
    )

This code is relatively trivial. The most expensive part of the code is getting the common neighbors. This code can be improved by computing shared neighbors faster.

Almost naive solution

A straight forward approach to optimize the performance is by indexing the neighbors beforehand in sets for each node. Using that, we don’t need to access the graph’s data structure, and avoid an expensive iteration.

from itertools import combinations
import networkx as nx
from networkx.generators.random_graphs import gnp_random_graph

g: nx.Graph = gnp_random_graph(1000, 0.2)  # get a graph
neighbors = {node: set(g.neighbors(node)) for node in g.nodes}
for node_1, node_2 in combinations(g.nodes, r=2):
    shared_neighbors[(node_1, node_2)] = (
        len(neighbors[node_1] & neighbors[node_2])
    )

The difference between the naive solution and the almost naive one is significant. On average, an example of 1000 nodes takes 90 seconds using the naive solution, and 3 seconds using the almost naive solution.

Now, can we do any better?

Optimizing using an adjacency matrix

Firstly, let us refresh our memory regarding how graphs are represented in a matrix.

Adjacency matrix

For undirected graphs, a common representation is an adjacency matrix. Denote by n the number of nodes, and denote by A a matrix of size n×n. The cell A[i,j]=1 if the nodes i and j are adjacent, and 0 otherwise.

For example, the following graph:

Is represented by:

0101
1001
0001
1101
Adjacency matrix

Simply stating, each cell corresponds to a “potential” edge. If an edge connecting the ith node to a jth node exists in the graph, the cell value is 1, otherwise it is 0.

Intersection using multiplication

Let us also remember how matrix multiplication is performed. Given matrices A and B, both of size n×n, the (i,j) cell in the result is the sum of A[i,k]•B[k,j] over all values of k from 0 to n-1. Since our values are binarized, then A[i,k]•B[k,j] is either 0 or 1, where 1 means that A[i,k]=1 and B[k,j]=1, otherwise A[i,k]=0 or B[k,j]=0. Since in adjacency matrix A[i,j] indicates if i and j are adjacent, then the sum A[i,k]•B[k,j] over k is a counter of the shared neighbors of i and j.

Using the adjacency matrix above, set i=0 and j=1: A[0,0]•B[0,1]+A[0,1]•B[1,1]+A[0,2]•B[2,1]+A[0,3]B[3,1]=0+0+0+1=1, which corresponds to the single shared neighbor, node 3.

It is easy to see that if we multiply A by itself, each cell A[i,j] will contain the number of shared neighbors by i and j.

Now, let us look at a code that does exactly that:

A = nx.to_scipy_sparse_matrix(g)
A = (A**2).todok()

The code does three simple operations:

  • Convert the graph to adjacency matrix (a csr matrix in this case)
  • Multiply the adjacency matrix by itself
  • Convert the result matrix to dok matrix (a dictionary of keys representation)

The first two steps perform and the expected operations described for the optimizations. The last step is performed to make the comparison to the previous versions valid, as they yielded a dictionary that allowed access in O(1) to the intersection size.

On average, an example of 1000 nodes takes 3 seconds using the almost naive solution, and 0.8 seconds using the matrix-based solution. This is more than 3 times faster, and more than 100 times faster than the naive solution.

We won’t be going into two essential details of why matrix-based operations are faster, and representation by sparse matrix. Yet, we’ll assume for now that you already possess the processor-based intuition for it.

Summary

In this post, we presented a common problem of nodes close-neighborhood intersection. We presented how easy is it to get a working solution, and how to optimize by a factor of more than a 100 using a conversion of the representation to an adjacency matrix.

In general, many problems in graphs are easily translatable to matrix problems. When a performance issue arises in the original representation, a conversion to a matrix may ease the issue significantly.

Easy Wins Optimizations – HashSet

HashSet is a data structure introduced in .Net 3.5 framework. The main advantage of HashSet is that it adds and fetches items in very low complexity (constant amortized time). We’ll try to see how much faster than List it can be. In the example we’ll go over code that stores all historical lottery results and checks if some results appear among the historical ones.

Registering results

public abstract class ResultsRegistrarBase
{
protected ICollection<long> codes;

public void AddResult(int num1, int num2, int num3, int num4, int num5, int num6)
{
long code = ConvertToCode(num1, num2, num3, num4, num5, num6);

codes.Add(code);
}

public bool ContainsResult(int num1, int num2, int num3, int num4, int num5, int num6)
{
long code = ConvertToCode(num1, num2, num3, num4, num5, num6);

return codes.Contains(code);
}

private static long ConvertToCode(int num1, int num2, int num3, int num4, int num5, int num6)
{
return num1 + (long) Math.Pow(10, 0) + num2 * (long) Math.Pow(10, 2) +
num3 * (
long) Math.Pow(10, 4) + num4 * (long) Math.Pow(10, 6) +
num5 * (
long) Math.Pow(10, 8) + num6 * (long) Math.Pow(10, 10);
}
}

This class is in charge of storing and finding the results. It assumes it gets the numbers sorted and the ConvertToCode method maps all the numbers to one “big” number”. For example: 1,5,20,30,45,49 will be mapped to 494530200501. The codes collection will be initialized in a base class:
The naive code:

public class NaiveResultsRegistrar : ResultsRegistrarBase
{
public NaiveResultsRegistrar()
{
codes =
new List<long>();
}
}

The code based on HashSet:

public class HashSetResultsRegistrar : ResultsRegistrarBase
{
public HashSetResultsRegistrar()
{
codes =
new HashSet<long>();
}
}

How much faster is it than the naive code?

The difference is huge. In a simple case where we have 250,000 historical results and we’re querying the collection 50,000 times the average running time of the naive code is 1:28 minutes while the code based on HashSet takes only 0.2 seconds. The HashSet code run ~360 times faster than the naive code. The reason for this major difference is that List has to iterate over all the items in order to determine if it contains the requested item. HashSet uses a hash method, without getting to implementation details, it enables the Contains method look in very few entries in order to determine whether or not an item is contained in the collection.
This major improvement was received with minimal change in the code. So in cases where we query a large collection about whether some items are in it, usage of HashSet can give us major running time  improvement with minimal change.

Easy Wins Optimizations – Sorted List

This is the first post of easy wins optimizations posts. The need for these posts comes from the fact the most of us hardly ever need it 🙂 Sounds weird? It should. What it means is that we don’t need it most of the time; the most naive code does the best work for us. When we encounter the case that requires optimization we hardly remember the simplest ways of doing it.
In this post on sorted list we’ll go over an example of optimizing a simple case.

How to find a person in a collection whose age is closest to mine?

Assume we have a person class with age defined on it:

public class Person
{
public double Age { get; private set; }

public Person(double age)
{
Age = age;
}
}

What we want is to collect many people into a list, and any time we asked about an age, we find the closest person to this age. We can think of it as an application for a dating service that matches people by age.

The naive solution

So what can be simpler than collecting all the people in a list? Probably nothing, this is how we do it:

public class NaivePersonFinder : IPersonFinder
{
private readonly List<Person> persons = new List<Person>();

public void Add(Person person)
{
persons.Add(person);
}

public Person FindPeopleInRange(double requiredAge)
{
Person closestPerson = null;
double smallestDiffrence = double.MaxValue;
foreach (var person in persons)
{
double difference = Math.Abs(person.Age - requiredAge);
if (difference < smallestDiffrence)
{
smallestDiffrence = difference;
closestPerson = person;
}
}

return closestPerson;
}
}

The code iterates over the collection in the most straightforward way there is and updates the closest person whenever it encounter a closer person. Simple! Isn’t it?

But I need faster results!

The sorted list solution
How does sorted list help us here? Sorted list allow us to search items using Binary Search. Binary search saves a lot of time since it has to query very few items in the collection in order to find the closest one. For example, to find the closest item in a collection of 1,000 items it’ll have to query ~10 and saves iteration over the hole 1,000.

Some code

public class SortedPersonFinder : IPersonFinder
{
private readonly SortedList<double, Person> persons = new SortedList<double, Person>();

public void Add(Person person)
{
persons.Add(person.Age, person);
}

public Person FindPeopleInRange(double requiredAge)
{
if (persons.ContainsKey(requiredAge))
{
return persons[requiredAge];
}

IList<double> sortedAges = persons.Keys;

int index = BinarySearchIndex(requiredAge);

var closestPersons = new List<Person>(3);

if (index > 0)
{
closestPersons.Add(persons[sortedAges[index -
1]]);
}

if (index < sortedAges.Count - 1)
{
closestPersons.Add(persons[sortedAges[index +
1]]);
}

closestPersons.Add(persons[sortedAges[index]]);

return closestPersons.OrderBy(person => Math.Abs(person.Age - requiredAge)).First();
}

private int BinarySearchIndex(double requiredAge)
{
var sortedAges = persons.Keys;

var scopeStart = 0;
var scopeEnd = sortedAges.Count - 1;

int index = 0;
while (scopeEnd > scopeStart )
{
index = (scopeStart + scopeEnd) /
2;
double ageAtIndex = sortedAges[index];
if (requiredAge > ageAtIndex)
{
scopeStart = index +
1;
}
else
{
scopeEnd = index;
}
}
return index;
}
}

The code is more complex than the naive version, but it finds the results much faster. It uses .Net Sorted List collection. This is a collection that updates a collection of  keys whenever the collection changes. The keys are kept ordered and we can easily access them and binary search them.

How much faster is it?

In a collection of 100,000 people and 50,000 queries the running time of the sorted list is much more faster than the naive:
The naive solution did it in average time of ~100 seconds
The sorted list solution had average time of ~6 seconds. This is a major improvement; it means that it ran more the 15 times faster than the naive solution.

Conclusion

If we have a collection in which order can be applied AND we also need quick results for queries based on the order, sorted list can easily improve running time. We must always remember that the running time is usually negligible even in the most naive code. If no optimization required – don’t optimize, more gain will be earned from simple and maintainable code.