Efficiently computing shared neighbors in graph

Posted by Eli Shalom on December 25, 2020

Problem background

Assume that we have a large graph containing many nodes and edges. For this graph, we need to compute the number of shared neighbors for each pair of nodes.

An instance example

For nodes 0 and 1, the result is 1 as the nodes share a single neighbor, which is node 2. For nodes 0 and 3, the result is 2 as both of them share nodes 2 and 4 as neighbors.

Naive solution

Assume that the input graph is represented in python networkx graphs. A trivial solution is to compute for all pairs the number of shared neighbors by neighbors intersection:

from itertools import combinations
import networkx as nx
from networkx.generators.random_graphs import gnp_random_graph

g: nx.Graph = gnp_random_graph(1000, 0.2)  # get a graph
shared_neighbors = {}
for node_1, node_2 in combinations(g.nodes, r=2):
    shared_neighbors[(node_1, node_2)] = (
        sum(1 for _ in nx.common_neighbors(g, node_1, node_2))
    )

This code is relatively trivial. The most expensive part of the code is getting the common neighbors. This code can be improved by computing shared neighbors faster.

Almost naive solution

A straight forward approach to optimize the performance is by indexing the neighbors beforehand in sets for each node. Using that, we don’t need to access the graph’s data structure, and avoid an expensive iteration.

from itertools import combinations
import networkx as nx
from networkx.generators.random_graphs import gnp_random_graph

g: nx.Graph = gnp_random_graph(1000, 0.2)  # get a graph
neighbors = {node: set(g.neighbors(node)) for node in g.nodes}
for node_1, node_2 in combinations(g.nodes, r=2):
    shared_neighbors[(node_1, node_2)] = (
        len(neighbors[node_1] & neighbors[node_2])
    )

The difference between the naive solution and the almost naive one is significant. On average, an example of 1000 nodes takes 90 seconds using the naive solution, and 3 seconds using the almost naive solution.

Now, can we do any better?

Optimizing using an adjacency matrix

Firstly, let us refresh our memory regarding how graphs are represented in a matrix.

Adjacency matrix

For undirected graphs, a common representation is an adjacency matrix. Denote by n the number of nodes, and denote by A a matrix of size n×n. The cell A[i,j]=1 if the nodes i and j are adjacent, and 0 otherwise.

For example, the following graph:

Is represented by:

0	1	1
1	0	1
0	0	1
1	1	1

Adjacency matrix

Simply stating, each cell corresponds to a “potential” edge. If an edge connecting the ith node to a jth node exists in the graph, the cell value is 1, otherwise it is 0.

Intersection using multiplication

Let us also remember how matrix multiplication is performed. Given matrices A and B, both of size n×n, the (i,j) cell in the result is the sum of A[i,k]•B[k,j] over all values of k from 0 to n-1. Since our values are binarized, then A[i,k]•B[k,j] is either 0 or 1, where 1 means that A[i,k]=1 and B[k,j]=1, otherwise A[i,k]=0 or B[k,j]=0. Since in adjacency matrix A[i,j] indicates if i and j are adjacent, then the sum A[i,k]•B[k,j] over k is a counter of the shared neighbors of i and j.

Using the adjacency matrix above, set i=0 and j=1: A[0,0]•B[0,1]+A[0,1]•B[1,1]+A[0,2]•B[2,1]+A[0,3]•B[3,1]=0+0+0+1=1, which corresponds to the single shared neighbor, node 3.

It is easy to see that if we multiply A by itself, each cell A[i,j] will contain the number of shared neighbors by i and j.

Now, let us look at a code that does exactly that:

A = nx.to_scipy_sparse_matrix(g)
A = (A**2).todok()

The code does three simple operations:

Convert the graph to adjacency matrix (a csr matrix in this case)
Multiply the adjacency matrix by itself
Convert the result matrix to dok matrix (a dictionary of keys representation)

The first two steps perform and the expected operations described for the optimizations. The last step is performed to make the comparison to the previous versions valid, as they yielded a dictionary that allowed access in O(1) to the intersection size.

On average, an example of 1000 nodes takes 3 seconds using the almost naive solution, and 0.8 seconds using the matrix-based solution. This is more than 3 times faster, and more than 100 times faster than the naive solution.

We won’t be going into two essential details of why matrix-based operations are faster, and representation by sparse matrix. Yet, we’ll assume for now that you already possess the processor-based intuition for it.

Summary

In this post, we presented a common problem of nodes close-neighborhood intersection. We presented how easy is it to get a working solution, and how to optimize by a factor of more than a 100 using a conversion of the representation to an adjacency matrix.

In general, many problems in graphs are easily translatable to matrix problems. When a performance issue arises in the original representation, a conversion to a matrix may ease the issue significantly.

QuickSelect in CoffeeScript

Posted by Eli Shalom on June 21, 2014

QuickSelect is a known and simple algorithm for finding the kth smallest element in an array. The advantages of this algorithm are its linear average performance and the constant memory it requires, in addition to simple implementation.

The simplest solution is of course sorting the array using a built-in method of any language and then accessing the kth element in the sorted array. But, it may become an issue when working with large sets – O(n*log n) vs. O(n).

Implementation

class QuickSelect
  constructor: (@arr) ->

  kth: (k) ->
    @_select(0, @arr.length - 1, k)

  median: ->
    @kth Math.floor(@arr.length / 2)

  _swap: (i, j) ->
    tmp = @arr[i]
    @arr[i] = @arr[j]
    @arr[j] = tmp

  _partition: (left, right, pivotIndex) ->
    pivotValue = @arr[pivotIndex]
    @_swap pivotIndex, right
    storeIndex = left
    for i in [left..right]
      if @arr[i] < pivotValue
        @_swap storeIndex, i
        storeIndex++
    @_swap right, storeIndex
    return storeIndex

  _choose_random_pivot: (left, right) ->
    left + Math.floor(Math.random() * (right - left + 1))

  _select: (left, right, k) ->
    if left == right
      return @arr[left]

    while true
      pivotIndex = @_choose_random_pivot left, right
      pivotIndex = @_partition(left, right, pivotIndex)
      if k == pivotIndex
        return @arr[k]
      else if k < pivotIndex
        right = pivotIndex - 1
      else
        left = pivotIndex + 1

Usage examples

describe "QuickSelect", ->
  describe "1 item", ->
    selector = new QuickSelect [1]
    it "Should return the item", -> selector.kth(0).should.equal(1)

  describe "2 items", ->
    selector = new QuickSelect [2,1]
    it "Min should be 1", -> selector.kth(0).should.equal(1)
    it "Max should be 2", -> selector.kth(1).should.equal(2)

  describe "3 items", ->
    selector = new QuickSelect [2,1,3]
    it "Min should be 1", -> selector.kth(0).should.equal(1)
    it "Second should be 2", -> selector.kth(1).should.equal(2)
    it "Max should be 3", -> selector.kth(2).should.equal(3)

These simple tests explains how the access to kth element is done. The tests are written also in CoffeeScript.

Performance comparison

Measuring time for 10,000 arrays, each accessed once at random location:

Array size	5,000	10,000	20,000
QuickSelect	1758ms	3393ms	6912ms
Sort	14302ms	32136ms	70566ms

As can be easily seen, the access to the kth smallest element is much faster. In cases where large arrays are used and repeated accesses are performed, this could be a bottleneck which can be easily solved.

Improving performance using page guards

Posted by Eli Shalom on December 21, 2013

The problems we’re facing today is, a little bit, unique. Given:

n contiguous arrays
Each array has m cells
Each cell is a Boolean flag

We receive a stream of signals, each signal is an absolute offset from the first array. For each signal we need to set the correct flag AND the first flag of the array. The motivation for setting the first flag is to enable quick filtering of arrays having some flags set.
For example, we have a usage tracking system for n websites and m users. If user i visited website j we’d like to signal that by setting the ith flag in the jth array. After some time, we’d like to query which sites had any visit and who visited them.

The intuitive solution

Assuming you don’t care too much for the performance the solution is straight forward. Whenever setting a flag in an array set also the array in offset 0. If the input is index, then the array index is index / m and the item index is index % m. Pretty simple. For simplicity the source of indexes will be an array named items and the address of the first array will be baseAddress:

for (int i = 0; i < numOfItems; ++i)
{
    char* hitAddress = baseAddress + items[i];
    *hitAddress = 1;
    char* blockStartAddress = hitAddress - (hitAddress - baseAddress) % dwPageSize;
    *blockStartAddress = 1;
}

It is clear that the first action, *hitAddress = 1, is impossible to avoid. But, what about the set of the signal at index 0? We can replace it with a condition but it is clear it won’t affect much the performance. So, how can we improve that part?

Enabling page guards

Windows provides several memory protections, one of them is the page guard. When allocating a new memory scope we can declare it as protected. Defining it as protected means that each page (page is an arbitrary partition of the memory based on OS page size) will throw an exception on the first access to it. After throwing the exception the protection is removed. We would like to use this mechanism to avoid re-setting the flag at index 0.

In order to define such a scope, we will use the VirtualAlloc method:

VirtualAlloc(NULL, TOTAL_SIZE,
             MEM_RESERVE | MEM_COMMIT, 
             PAGE_READWRITE | PAGE_GUARD)

It returns a pointer to the memory scope with size of TOTAL_SIZE in bytes. If a page size P then the new scope has TOTAL_SIZE / P pages.

Tracking page hit

As mentioned, at the first time the memory inside a page is accessed an exception is being thrown. We would like to catch it. In order to do so in the fastest way, we will use windows SetUnhandledExceptionFilter API. The filter is a simple method receiving the exception information and deciding how to treat it. Treating it has three options:

Handling it
Handling it and continue the code execution
Pass the decision to other handler

As a simple filter we can request the runtime to ignore all page guards exceptions:

LONG WINAPI SmartFilter(_EXCEPTION_POINTERS *ep)
{
    if (ep->ExceptionRecord->ExceptionCode != STATUS_GUARD_PAGE_VIOLATION)
    {
        return EXCEPTION_CONTINUE_SEARCH;
    }

    return EXCEPTION_CONTINUE_EXECUTION;
}

So after setting it as the filter all page guards exceptions will be ignored:

SetUnhandledExceptionFilter(&SmartFilter);

Extending the exceptions filter logic

Let’s assume that all arrays are smaller than the opration system page size and we’ll assume that we don’t care about reserving extra space to pad each array. We’ll denote the page size with dwPageSize.
Now, we can make our SmartFilter really smart. We will add to it the logic for setting the first flag on each array. Assuming baseAddress is some global variable:

LONG WINAPI SmartFilter(_EXCEPTION_POINTERS *ep)
{
    if (ep->ExceptionRecord->ExceptionCode != STATUS_GUARD_PAGE_VIOLATION)
    {
        return EXCEPTION_CONTINUE_SEARCH;
    }

    char* hitAddress = (char*)ep->ExceptionRecord->ExceptionInformation[1];
    char* blockStartAddress = hitAddress - (hitAddress - baseAddress) % dwPageSize;

    *blockStartAddress = 1;

    return EXCEPTION_CONTINUE_EXECUTION;
}

We extract the exact address being touched by accessing ep->ExceptionRecord->ExceptionInformation[1]. Through it it’s easy to get the start address of the page. When having this filter method registered we can be sure that whenever we set a flag in the array the first flag will be set too.
Now, we can alter the original code which was in charge of setting the first signal whenever a flag was set:

for (int i = 0; i < numOfItems; ++i)
{
    char* hitAddress = baseAddress + items[i];
    *hitAddress = 1;
}

Comparing the results

In order to make our comparison interesting let’s assume that we have 10000 arrays (websites in the tracking system) and each array has 25000 flags (users for example). In order to make it intense we’ll assume that during a short period 10% of the arrays were visited, for exmaple having 250000000 signals sent through the stream (repeating actions in a website by same users are allowed). On average, the time it took to run:

	Seconds
Straight forward	1.744
Page guards	0.78

As can easily seen, the page guards solution saves ~50% of the runtime.

Conclusion

The operating system provides a few very fast facilities which can be exploited. Even though most of those facilities are designed for different purposes they can still be useful in different cases, like the one here requiring single time signal for a scope. Since Windows puts a lot of focus in being backward compatible, those exploitations are not too risky. As usual – if it doesn’t require performance optimization, don’t do it. The price of maintaining the code might not worth it.

Conditional attribute and arguments evaluation

Posted by Eli Shalom on November 24, 2013

What is the conditional attribute?

The conditional attribute enables including/omitting methods calls during compilation depending on compilation symbols. For example, we can condition that a specific log method calls will be included only when we compile in debug. The compiler in this case will omit the calls to the method. Looking at the next code:

public class Logger
{
    [Conditional("DEBUG")]
    public void LogDebugMessage(string str)
    {

    }
}

And the code calling it:

class MyClass
{
    private readonly Logger logger = new Logger();

    public void Foo()
    {
        logger.LogDebugMessage("Foo");
    }
}

We expect the compiler to omit the body of Foo(). As we can see with a disassembler this is exactly what happens:

.method public hidebysig instance void  Foo() cil managed
{
  // Code size       1 (0x1)
  .maxstack  8
  IL_0000:  ret
} // end of method MyClass::Foo

How method arguments are treated?

Temp variable assignment optimization

Regardless the conditional attribute, in release mode the compiler performs many optimization which one of them is skipping local variable assignment. You’re most likely to notice it when you assign a value into a local variable and pass it to a method as an argument (while this is the only variable usage). For example:

public void Foo()
{
    var foo = "Foo";
    logger.LogDebugMessage(foo);
}

Translates into:

0527C028  mov         edx,dword ptr ds:[2EE78B0h]  	// Load the string address
0527C02E  mov         ecx,dword ptr [ecx+4]             // Load the logger instance
0527C031  cmp         dword ptr [ecx],ecx  		// Null check
0527C033  call        dword ptr ds:[50C5BD8h]  		// Call LogDebugMessage
0527C039  ret

While:

public void Foo()
{
    logger.LogDebugMessage("Foo");
}

Translates into:

0552C028  mov         ecx,dword ptr [ecx+4]  		// Load the logger instance
0552C02B  mov         edx,dword ptr ds:[31478B0h]  	// Load the string address
0552C031  cmp         dword ptr [ecx],ecx  		// Null check
0552C033  call        dword ptr ds:[5375C30h]  		// Call LogDebugMessage
0552C039  ret

Which are basically the same. So in case we’re not using the conditional attribute we shouldn’t care about local assignment. We can expect to have no difference in runtime.

Temp variable sent to omitted call optimization?

So an interesting question is what happens to an argument we’re about to send to a conditional method? If call to LogDebugMessage are omitted, what should we expect in this case:

public void Foo()
{
    var method = MethodBase.GetCurrentMethod().Name;
    logger.LogDebugMessage(method);
}

And in this case:

public void Foo()
{
    logger.LogDebugMessage(MethodBase.GetCurrentMethod().Name);
}

The answer can be easily found by looking at the methods IL. The first version with temp assignment to a variable compiles into:

.method public hidebysig instance void  Foo() cil managed
{
  // Code size       12 (0xc)
  .maxstack  8
  IL_0000:  call       class [mscorlib]System.Reflection.MethodBase [mscorlib]System.Reflection.MethodBase::GetCurrentMethod()
  IL_0005:  callvirt   instance string [mscorlib]System.Reflection.MemberInfo::get_Name()
  IL_000a:  pop
  IL_000b:  ret
} // end of method MyClass::Foo

While the second version compiles into:

.method public hidebysig instance void  Foo() cil managed
{
  // Code size       1 (0x1)
  .maxstack  8
  IL_0000:  ret
} // end of method MyClass::Foo

As we can see, in this case the argument was not even evaluated and the whole statement was omitted from the IL. Meaning that in this case, inlining the variable would have influence on the performance. It didn’t happen by chance, this is the defined behavior of the compiler as stated in the Conditional attribute documentation:

“If the symbol is defined, the call is included; otherwise, the call (including evaluation of the parameters of the call) is omitted.”

Conclusion

The most common scenario in which the conditional attribute is involved is logging. Since the main advantage of omitting the logs is usually to avoid performance hit in production it is important to take into consideration the price of evaluating the arguments values. The simplest solution is to inline the variable. This can be done easily when the argument is string.Format() or similar. In case it is more complicated or unreadable it can always be solved by preprocessor directive such as #if.

Retrieving property value by name using dynamic method

Posted by Eli Shalom on February 19, 2012

In the previous post we compared some alternatives of the dynamic keyword. One important and very interesting alternative is based on reflection emit. Reflection emit enables us to generate code using IL at runtime, compile it and execute it straightaway.
In this post we’ll see how to extract a string property named ‘Name’ from an unknown type using a dynamic method.

The code

public static string GetNameByDynamicMethod(object arg)
{
    Type type = arg.GetType();

    Func<object, string> getterDelegate;
    if (!typeToEmitDelegateMap.TryGetValue(type, out getterDelegate))
    {
        string typeName = type.Name;

        PropertyInfo nameProperty = type.GetProperty("Name");
        Type returnType = typeof (string);
                
        // Define a new dynamic method
        // The method returns a string type
        // The method expects single parameter
        var method = new DynamicMethod("GetNameFrom" + typeName, 
                                       returnType, new[] {typeof(object)});
                
        ILGenerator ilGenerator = method.GetILGenerator();

        // Load to the stack the first method argument.
        //In our case, this is an object whose type we already know
        ilGenerator.Emit(OpCodes.Ldarg_0);
        
        // Cast the object to the type we already know
        ilGenerator.Emit(OpCodes.Castclass, type);

        // Call the getter method on the casted instance
        ilGenerator.EmitCall(OpCodes.Call, nameProperty.GetGetMethod(), null);

        // Return the value from Name property
        ilGenerator.Emit(OpCodes.Ret);

        // Compile the method and create a delegate to the new method
        getterDelegate = (Func<object, string>)method.CreateDelegate(typeof(Func<object, string>));

        typeToEmitDelegateMap.Add(type, getterDelegate);
    }

    return getterDelegate(arg);
}

What we did here was to define a new method, generate its code with IL, compile it and execute it. This new method is equivalent in many ways to a method we had generated in the original program. This new method will be hosted in a dynamic module in the memory.

The advantage of this kind of method over reflection is that it compiles the code once and doesn’t need to explore the type again whenever we need to get the property value.

Performance

A quick comparison for calling these alternatives 10,000,000 times each:

	Seconds	Ratio to directly
Directly	0.0131	1
Dynamic	0.4609	35
Expression	0.9154	70
Reflection emit	0.9832	75

As can be seen, using the dynamic keyword works much faster than compiling an expression or a dynamic method at runtime.

Another interesting data set shows the time that each alternative takes to set up (The time to perform the first call):

	Seconds
Directly	0.00003
Dynamic	0.08047
Expression	0.00114
Reflection emit	0.02169

Performance of the dynamic keyword

Posted by Eli Shalom on February 17, 2012

In .NET 4.0 a new keyword was introduced: the dynamic keyword. One of the things it allows is calling methods on an instance and bypassing the compile time type checks. It can be useful in many scenarios, for example duck typing.
In this post, we’ll see that in some cases the keyword might have an unnecessary performance hit. Another thing we’ll see is how to save some of that time.

Simple performance measure

Let’s compare the performance of 3 ways of getting a property value – directly, using dynamic and using reflection:

public static string GetName(Student arg)
{
    return arg.Name;
}
public static string GetNameByDynamic(dynamic arg)
{
    return arg.Name;
}
public static string GetNameByReflection(object arg)
{
    Type type = arg.GetType();

    MethodInfo getter;
    if (!typeToMethodMap.TryGetValue(type, out getter))
    {
        PropertyInfo property = type.GetProperty("Name");
        getter = property.GetGetMethod();
        typeToMethodMap.Add(type, getter);
    }

    return (string) getter.Invoke(arg, null);
}

Calling each method 10,000,000 times sums to: GetName=0.02 seconds, GetNameByDynamic=0.47 seconds, GetNameByReflection=15.41. Meaning, dynamic compared to strong type call is ~20 times slower.

Improving performance using interface

One way to deal with this performance hit is to extract an interface from all possible objects, through using it we can get back to work with strong type:

public interface INameProvider{
    string Name { get; set; }
}

And change our method to:

public static string GetNameByInterface(INameProvider arg)
{
    return arg.Name;
}

Luckily this code runs at 0.07 seconds, which is ~7 times faster than the dynamic version. The conclusion from this is that if our code is in a critical performance area, we better extract an interface (as long as it makes sense – don’t abuse the interface if the types have no logical commonality!).

Improving reflection version using expressions

What should we do if our code is written in pre-.NET 4.0 version and our solution is based on reflection? In this case, our code runs ~750 times slower than the strong type version. Since we can’t use dynamic, which was introduced first at .NET 4.0, we should find some other solution. A simple one is generating a method using expressions. The main advantage of expressions here is that they can be compiled into a new method which we can reuse.

public static string GetNameByExpression(object arg)
{
    Type type = arg.GetType();

    Func<object, string> getterDelegate;
    if (!typeToDelegateMap.TryGetValue(type, out getterDelegate))
    {
        var parameterExpression = Expression.Parameter(typeof (object), "arg");
        var castExpression = Expression.TypeAs(parameterExpression, type);
        MemberExpression memberExpression = Expression.Property(castExpression, "Name");
        var lambda = Expression.Lambda<Func<object, string>>(memberExpression, parameterExpression);

        getterDelegate = lambda.Compile();
        typeToDelegateMap.Add(type, getterDelegate);
    }

    return getterDelegate(arg);
}

This code here is basically equivalent to generating a lambda which looks like:

(object arg) => ((Student)arg).Name;

After we compile the code once we can skip the reflection invocation each time and end with much faster code. Running this method times 10,000,000 takes 0.86 seconds, which is ~18 times faster than the reflection solution.

Conclusion

If you are writing code which must run as fast as possible, this is the performance summary:

	Seconds	Ratio to directly
Directly	0.02	1
Through interface	0.07	3.5
Using dynamic	0.47	23.5
Using expression	0.86	43
Reflection	15.41	770

	Eli Shalom on Shortest path on sparse g…
	anon on Shortest path on sparse g…
	kopiluwak nya on Chicken, Shut Up!
	Omer Mor on Improving performance using pa…
	ravyoli on Improving performance using pa…

Theory to Practice to Theory

Category Archives: Performance

Efficiently computing shared neighbors in graph

Problem background

An instance example

Naive solution

Almost naive solution

Optimizing using an adjacency matrix

Adjacency matrix

Intersection using multiplication

Summary

QuickSelect in CoffeeScript

Implementation

Usage examples

Performance comparison

Improving performance using page guards

The intuitive solution

Enabling page guards

Tracking page hit

Extending the exceptions filter logic

Comparing the results

Conclusion

Conditional attribute and arguments evaluation

What is the conditional attribute?

How method arguments are treated?

Temp variable assignment optimization

Temp variable sent to omitted call optimization?

Conclusion

Retrieving property value by name using dynamic method

The code

Performance

Performance of the dynamic keyword

Simple performance measure

Improving performance using interface

Improving reflection version using expressions

Conclusion