# Efficiently computing shared neighbors in graph

## Problem background

Assume that we have a large graph containing many nodes and edges. For this graph, we need to compute the number of shared neighbors for each pair of nodes.

### An instance example

For nodes 0 and 1, the result is 1 as the nodes share a single neighbor, which is node 2. For nodes 0 and 3, the result is 2 as both of them share nodes 2 and 4 as neighbors.

## Naive solution

Assume that the input graph is represented in python networkx graphs. A trivial solution is to compute for all pairs the number of shared neighbors by neighbors intersection:

```from itertools import combinations
import networkx as nx
from networkx.generators.random_graphs import gnp_random_graph

g: nx.Graph = gnp_random_graph(1000, 0.2)  # get a graph
shared_neighbors = {}
for node_1, node_2 in combinations(g.nodes, r=2):
shared_neighbors[(node_1, node_2)] = (
sum(1 for _ in nx.common_neighbors(g, node_1, node_2))
)
```

This code is relatively trivial. The most expensive part of the code is getting the common neighbors. This code can be improved by computing shared neighbors faster.

## Almost naive solution

A straight forward approach to optimize the performance is by indexing the neighbors beforehand in sets for each node. Using that, we don’t need to access the graph’s data structure, and avoid an expensive iteration.

```from itertools import combinations
import networkx as nx
from networkx.generators.random_graphs import gnp_random_graph

g: nx.Graph = gnp_random_graph(1000, 0.2)  # get a graph
neighbors = {node: set(g.neighbors(node)) for node in g.nodes}
for node_1, node_2 in combinations(g.nodes, r=2):
shared_neighbors[(node_1, node_2)] = (
len(neighbors[node_1] & neighbors[node_2])
)
```

The difference between the naive solution and the almost naive one is significant. On average, an example of 1000 nodes takes 90 seconds using the naive solution, and 3 seconds using the almost naive solution.

Now, can we do any better?

## Optimizing using an adjacency matrix

Firstly, let us refresh our memory regarding how graphs are represented in a matrix.

For undirected graphs, a common representation is an adjacency matrix. Denote by n the number of nodes, and denote by A a matrix of size n×n. The cell A[i,j]=1 if the nodes i and j are adjacent, and 0 otherwise.

For example, the following graph:

Is represented by:

Simply stating, each cell corresponds to a “potential” edge. If an edge connecting the ith node to a jth node exists in the graph, the cell value is 1, otherwise it is 0.

### Intersection using multiplication

Let us also remember how matrix multiplication is performed. Given matrices A and B, both of size n×n, the (i,j) cell in the result is the sum of A[i,k]•B[k,j] over all values of k from 0 to n-1. Since our values are binarized, then A[i,k]•B[k,j] is either 0 or 1, where 1 means that A[i,k]=1 and B[k,j]=1, otherwise A[i,k]=0 or B[k,j]=0. Since in adjacency matrix A[i,j] indicates if i and j are adjacent, then the sum A[i,k]•B[k,j] over k is a counter of the shared neighbors of i and j.

Using the adjacency matrix above, set i=0 and j=1: A[0,0]•B[0,1]+A[0,1]•B[1,1]+A[0,2]•B[2,1]+A[0,3]B[3,1]=0+0+0+1=1, which corresponds to the single shared neighbor, node 3.

It is easy to see that if we multiply A by itself, each cell A[i,j] will contain the number of shared neighbors by i and j.

Now, let us look at a code that does exactly that:

```A = nx.to_scipy_sparse_matrix(g)
A = (A**2).todok()
```

The code does three simple operations:

• Convert the graph to adjacency matrix (a csr matrix in this case)
• Multiply the adjacency matrix by itself
• Convert the result matrix to dok matrix (a dictionary of keys representation)

The first two steps perform and the expected operations described for the optimizations. The last step is performed to make the comparison to the previous versions valid, as they yielded a dictionary that allowed access in O(1) to the intersection size.

On average, an example of 1000 nodes takes 3 seconds using the almost naive solution, and 0.8 seconds using the matrix-based solution. This is more than 3 times faster, and more than 100 times faster than the naive solution.

We won’t be going into two essential details of why matrix-based operations are faster, and representation by sparse matrix. Yet, we’ll assume for now that you already possess the processor-based intuition for it.

## Summary

In this post, we presented a common problem of nodes close-neighborhood intersection. We presented how easy is it to get a working solution, and how to optimize by a factor of more than a 100 using a conversion of the representation to an adjacency matrix.

In general, many problems in graphs are easily translatable to matrix problems. When a performance issue arises in the original representation, a conversion to a matrix may ease the issue significantly.

# Shortest path on sparse graphs

This post is an introduction to SciPy sparse graphs. It will present a variation of a known problem followed by a simple solution and implementation.

## Problem background

We are given a directed acyclic graph (DAG) with dynamic edge costs. The graph is large in regard to nodes, it is expected to have millions of nodes. Additionally, the graph is expected to have very few edges, so the average degree is very small.

We need to construct a data structure that given two states, a source and a target, can figure our efficiently if there’s a path from the source to the target. In case it can, what would be the minimum cost of such path?

Our goal is to reduce the query time, where we expected the pairs of source and target to be uniformly picked.

For example, this representation could be of a system with many workflows, mostly independent, where each step in a workflow requires different effort to finish.

### An instance example

The following graph is an example of such graph, where the edge labels represent the costs: A query could be: Can 2 reach 6? The answer, in this case, is yes. The cost of reaching 6 from 2 is 7 (2→6). Please note the cost is the current cost. The path will always exist, but the edges price may change in the future.

Another query could be: Can 1 reach 2? The answer is no. No path from 1 to 2 exists. This answer will stay no forever, regardless any cost changes.

## Problem characteristics

1. Since the input is a graph, then any shortest-path algorithm could work. For example, Dijkstra’s algorithm. Yet, for any target node, the expected query time is at least as the number of nodes that can reach the target node.
2. Remember that most states have very few transitions and that the graph is a DAG. This means that for a random pair of states, it is very unlikely that a path will exist.
3. The edge costs are dynamic. This means that any solution that involves pre-calculation of the results will require re-calculation.
4. The nodes set is large. Therefore, any exhaustive caching requires a lot of memory. If the number of nodes is 1,000,000 and each pair is represented using one bit, then cache would require ~125gb.

## Solution guidelines

A straightforward approach would aim to reduce the number of shortest-path algorithm executions as most would yield “no path”. Therefore, some sort of method to predict paths existence is needed.

Assuming we want to initialize a cache, we would store in the cache the fact that i is reachable from rather than the distance itself since the prices are dynamic. In the case where the average degree is very low, we can expect most pairs to be false. Therefore, we can store only the pairs whose value is true. For example, in the case of 1M nodes and an average degree of 2, we expect <<1% to be true.

## OK, let’s implement!

### Sparse graph representation

A common representation of graphs is weighted adjacency matrix. Denote the matrix with A, and we say that the cost of the edge from i to price is A[i, jif the edge from i to j exists and 0 otherwise. 0 1 2 3 0 0 10 0 11 1 0 0 3 18 2 0 0 0 0 3 0 0 0 0

In this example, has an edge to 1, so A[0, 1] = 10. 0 and 2 are not directly connected, so A[0, 2] = 0. The rows of 2 and 3 are all zeros since both are leaves, meaning their out degree is 0.

#### SciPy sparse matrix

We expect the majority of cells in the matrix to be 0. In this case, we can take advantage of a sparse matrix representation. The format which we will use is compressed sparse row (CSR), which supports efficient matrices multiplication. It requires O(num of edges) memory. Luckily, SciPy provides an implementation to the CSR matrix.

Given an input in a form of dictionary of edges to weight:

```edges  = {
(0, 1): 2,
(0, 2): 3,
(2, 3): 10,
(3, 4): 1,
(3, 5): 7
}
```

We can transform it to a sparse matrix:

```raw_weights = list(edges.values())
sources = [s for s, _ in edges.keys()]
targets = [t for _, t in edges.keys()]
n = max(sources + targets) + 1  # assume no isolated nodes
weights = csr_matrix((raw_weights, (sources, targets)), shape=(n, n))
```

This will result in our case to:

```[[ 0 2 3 0 0 0]
[ 0 0 0 0 0 0]
[ 0 0 0 10 0 0]
[ 0 0 0 0 1 7]
[ 0 0 0 0 0 0]
[ 0 0 0 0 0 0]]```

Remember that the zeros in the matrix are “implicit” so don’t let the current representation mislead you regarding the memory consumption.

### Computing reachable nodes

After representing the graph as a matrix, we can create a boolean adjacencies matrix. A boolean adjacencies matrix A is one where A[i, j] is true iff there is an edge from to j or i=j.

We will take advantage of the following propertyAk[i, j]=true iff exists a path from i to j of length k-1 or less. Let us look at an example: It is easy to create the adjacency matrix for this graph, and to follow its powers:

```# A
[[ True True False False]
[False True True False]
[False False True True]
[False False False True]]

# A2
[[ True True True False]
[False True True True]
[False False True True]
[False False False True]]

# A3
[[ True True True True]
[False True True True]
[False False True True]
[False False False True]]```

#### Graph diameter

According to the graph diameter definition, any simple path is as long as the diameter. This means that if we know the graph diameter, we can easily resolve the reachable components by computing Adiameter.

Another property which is easy to notice is that the minimal k that satisfies Ak=Ak+1 is the graph diameter. Also, if k>diameter, then Ak+m=Ak for any positive m (since no new node will become reachable after walking all paths in the length of the diameter). Looking at A2, A4, A8,…, we can see that if two consecutive powers of A equal, we have reached the matrix that represents the reachable components.

All the values in the reachable components are either true or false. Also, for any Ak[i, j] that is true for some kAk+m[i, j] will be true for any positive m. This means that when we compare two powers of A, it is enough to compare the number of true values in the matrix.

#### Computing with adjacency matrix powers

Given an initial weights matrix, we can create the adjacencies matrix:

```adjacency = (weights +
sparse.diags(np.ones(weights.shape), 0,
format='csr')).astype(bool)
```

The code takes the weights matrix and adds 1 to each value over the diagonal to ensure a positive value. Then it converts the matrix to a boolean one, so any cell with non-zero value turns into true.

Now, we can raise the matrix by powers of 2, until the number of true values does not change:

```components = adjacency.copy()
previous_nnz = 0
while previous_nnz != components.nnz:
previous_nnz = components.nnz
components **= 2
```

The code compares the property nnz, which is the number of non-zero values, to the previous iteration count. Oofficially, nnz it is the number of explicit values stored, but in our case, it equals the number of non-zeros). When it reaches a stable matrix, the components matrix contains the reachable components, meaning that components[i, j]=true iff exists a path from i to j.

### Querying nodes

Given a query with nodes i and j, two actions are required – check if i can reach j, and if so, compute the minimum cost of the path between them.

#### Reachability query

The format we’ve used for the components matrix is CSR, which is efficient for computing matrix powers, but, it does not allow constant time access for keys. Therefore, we would like to convert it into a more efficient representation, such as a set of pairs.

```components = components.tocoo()
reachable_pairs = set(zip(components.row, components.col))
```

Firstly, we convert the CSR matrix into a coordinates matrix, which is more iteration friendly. The row and col attributes are arrays containing the coordinates of the true stored values.

For example, if the matrix is:

```[[False True]
[True False]]```

Then:

```row = [0, 1]
col = [1, 0]```

This results with

`reachable_pairs = {(0, 1), (1, 0)}`

#### Path cost querying

Now that we have a quick way to check if a path exists, we can execute the dijkstra algorithm on the weights matrix:

```if (source, target) in reachable_pairs:
distances = dijkstra(weights, indices=[source])
distance = distances[0, target]
else:
distance = np.inf
```

The call to dijkstra returns a costs vector from source to all the nodes in the graph. In our case we expect the vast majority to be infinity.

## Summary

In this post, we’ve seen one approach to efficiently handle shortest path problem on very sparse graphs. We chose a representation based on sparse adjacency matrix and pre-computed all the connected components.

I will mention the benchmarks, ~500 times faster on 100K nodes with an average degree of 2, but please note that this is not enough to generalize. The main reason is that the problem introduced is too naive – no prior knowledge of the components and queries with a uniform distribution over the nodes. This is usually not the case in real life problems. Information about the components, or a distribution with a higher expectation of positive hits, might have changed the approach.

There are other approaches to optimize, yet, it is very encouraging to know how simple can solutions can be, ~20 lines of code, and very lightweight. It can be useful in many cases to translate problems into classic problems in graphs and take advantage of the existing theory and implementations. This is also true even when large amounts of data are involved.

# Python low-level AOP using AST rewriting – part II

In the previous post we started modifying functions body for AOP purposes. So far we’ve transformed a function to an AST node with extended code. In this post we will transform it to bytecode and replace the original method code with the new one.

As to this moment, we have these two:

1. original_function – the decorated function, without any modification.
2. modified_node – the modified node, with extended code and fixed locations.

## Compiling the node into a python module

Given the node, we can use Python built-in compile. Compile takes either string or AST node. The node is transformed into a module which we can later execute.

```def compile_node(modified_node):
compiled_method = compile(modified_node, '<string>', 'exec')
return compiled_method
```

The two arguments we send to compile are the source file name to which the method will be linked. In our case, we’re not going to keep the method compiled method, and only use its bytecode. Therefore, we can leave a default value ‘<string>’ which is commonly used. The second argument is ‘exec’, which signals the compiler that multiple statements may appear in the input code and that we don’t care about the returned value. This is a simplified explanation, but sufficient for this context.

## Getting replacement bytecode

Python code object does not allow setting the bytecode. Therefore we need to create a new code object, which is identical to the original code, other than the bytecode and stacksize:

```def create_replacement_code(original_function, compiled_function):
# these are the code object field names, ordered by the expected order of code object constructor
code_arg_names = ['co_argcount', 'co_nlocals', 'co_stacksize',
'co_flags', 'co_code', 'co_consts', 'co_names',
'co_varnames', 'co_filename', 'co_name',
'co_firstlineno', 'co_lnotab']
# fill all the args based on the original code
code_args = [getattr(original_function.func_code, key, None)
for key in code_arg_names]
# replace the bytecode and stacksize with the compiled bytecode and stacksize from the modified function
code = extract_code(compiled_function)
# replace bytecode and stacksize with the compiled method values
code_args[code_arg_names.index('co_code')] = code.co_code
code_args[code_arg_names.index('co_stacksize')] = code.co_stacksize
return CodeType(*code_args)
```

The instantiation of the code object is undocumented. Yet, some examples are available online. Please note that this signature fits python 2.7 and was changed in 3.0. Now we finished transforming a python code (which is the original function) into a new code object, containing the modified bytecode.

The last two methods calls can be extracted to:

```def create_func_code(original_function, modified_node):
compiled_method = compile_node(modified_node)
return create_replacement_code(original_function, compiled_method)
```

## Replacing the code

The last step required is to switch the function’s code into the modified one. Luckily, this step is very simple:

```def replace_code(original_function, code):
original_function.func_code = code
```

Now, when the decorated method is executed, it’ll contain the modified bytecode, which checks if variables used in comparisons are not None.

# The full code

```from _ast import BoolOp, And, Name, Compare, IsNot, Load, Num
from ast import NodeTransformer, fix_missing_locations
import ast
import inspect
from types import FunctionType, CodeType

class ComparisonTransformer(NodeTransformer):
def visit_Compare(self, node):
parts = [node.left] + node.comparators

# check if any constant number is involved in the comparison
if not any(isinstance(part, Num) for part in parts):
return node

# get all the "variables" involved in the comparison
names = [element for element in parts if isinstance(element, Name)]
if len(names) == 0:
return node

# create a reference to None
# create for each variable a node that represents 'var is not None'
node_verifiers = [Compare(left=name, ops=[IsNot()], comparators=[none]) for name in names]
# combine the None checks with the original comparison
# e.g. 'a &lt; b &lt; 1' --&gt; 'a is not None and b is not None and a &lt; b &lt; 1
return BoolOp(op=And(), values=node_verifiers + [node])

def rewrite_comparisons(original_function):
assert isinstance(original_function, FunctionType)

node = parse_method(original_function)
rewrite_method(node)
code = create_func_code(original_function, node)
replace_code(original_function, code)
return original_function

def replace_code(original_function, code):
original_function.func_code = code

def parse_method(original_function):
return ast.parse(inspect.getsource(original_function))

def rewrite_method(node):
# assuming the method has single decorator (which is the rewriter) - remove it
node.body.decorator_list.pop()
# we rename the method to ensure separation from the original one.
# this step has no real meaning and not really required.
node.body.name = 'internal_method'
# transform Compare nodes to fit the 'is not None' requirement
ComparisonTransformer().visit(node)
# let python try and fill code locations for the new elements
fix_missing_locations(node)

def create_func_code(original_function, modified_node):
compiled_method = compile_node(modified_node)
return create_replacement_code(original_function, compiled_method)

def compile_node(modified_node):
compiled_method = compile(modified_node, '&lt;string&gt;', 'exec')
return compiled_method

def extract_code(compiled_method):
exec compiled_method
generated_func = locals()['internal_method']
return generated_func.func_code

def create_replacement_code(original_function, compiled_function):
# these are the code object field names, ordered by the expected order of code object constructor
code_arg_names = ['co_argcount', 'co_nlocals', 'co_stacksize',
'co_flags', 'co_code', 'co_consts', 'co_names',
'co_varnames', 'co_filename', 'co_name',
'co_firstlineno', 'co_lnotab']
# fill all the args based on the original code
code_args = [getattr(original_function.func_code, key, None)
for key in code_arg_names]
# replace the bytecode and stacksize with the compiled bytecode and stacksize from the modified function
code = extract_code(compiled_function)
# replace bytecode and stacksize with the compiled method values
code_args
[code_arg_names.index('co_code')] = code.co_code
code_args[code_arg_names.index('co_stacksize')] = code.co_stacksize
return CodeType(*code_args)

@rewrite_comparisons
def foo(x):
return x < 1

print foo(None)
```

This code, will output the following, as expected:

`False`

To prove that this code is works in other context, the following code:

```@rewrite_comparisons
def bar(x, y, z, w):
if x < 1 or y < 2 or z < 3:
return 'Default behavior'
if w < 1:
return 'Expected behavior'
return 'Failure'

print bar(None, None, None, 0)
```

Prints the following result:

`Expected behavior`

# Summary

Python is amazingly flexible and using ~50 lines of code we can create a micro-framework to manipulate methods behavior. In addition, this flexibility allows code changes in many levels, including instructions level changes.

The vast majority of use cases are on the method level and not the instructions. Yet, this facilities can be useful in some cases an we should be able to take advantage of them.

# Python low-level AOP using AST rewriting – part I

This post and the next post will address AOP in Python. In general AOP in Python is very simple thanks to Python’s decorators. The aspects which we would like to apply in this post are low-level, meaning they’ll be applied on in-body instructions and not just on method level. The way in which we’re going to implement it will be using code weaving and rewriting.
I previously blogged about similar concept in .Net using Mono Cecil, where we tracked IL instructions.

The topic will be covered by two posts, where the first one will address rewriting code and the second one will deal with replacing the original code.

# Background

## Motivation

The general motivation for AOP is to separate the business logic from other functional logic, like logging, security or error handling. Most of the common examples fit the pattern of wrapping the function with new one. Then, perform logic before/after the method is executed. This is very useful, yet, limits our ability to change behavior of specific instructions inside the method which are relevant to the aspect.

## Example

During the post we will use a concrete simple example. Let us observe the following example (Python 2.7):

```def foo(x):
return x < 1

print foo(None)
```

As you probably know, this will print:

```True
```

This is a common Python (2.7) behavior but might not be intuitive. In general, assuming we had many variables and many comparisons, we’d like to change all to the pattern: VAR is not None and VAR < CONST

The goal of our process will be to transform the method to:

```def foo(x):
return x is not None and x &lt; 1
```

Where the aspect we’re applying is Update Comparison of None and Constants.

# The required steps

The steps required by this solution are the following:

1. Decorate the method – create an entry point for the mechanism which’ll apply the aspect.
2. Create an AST from the method –  prepare a modifiable syntax tree from the original method.
3. Rewrite the AST – find the instructions influenced by the aspect and modify them.
4. Create bytecode – create identical code to the original one other than newly generated bytecode.
5. Update the method – replace the original method code with the new one.

## Decorating the method

Like the common approach, we will use a decorator to modify the function. We will start from this simple decorator and build over it:

```def rewrite_comparisons(original_function):
assert isinstance(original_function, FunctionType)
return original_function

@rewrite_comparisons
def foo(x):
return x &lt; 1
```

This decorator does nothing, so far.

## Getting code from function

The first challenge is getting the method code from a function and make it modifiable. Since Python provides bytecode by default for a method, we will use built-in inspect to extract the original source code:

```function_source_code = inspect.getsource(original_function)
```

Inspect uses the code locations linked to the function and read them from the source file. The return value is a string with function. This is different from disassembling code from the method bytecode.

We can assume that for our functions the source code is available. Otherwise, this first step will fail, and the processing will need to be in bytecode level (which might be covered in other post). In addition, this constrains us to ensure decorator is called before any other decorator. Otherwise, previous decorators might be ignore since their effect is not reflected in the original source code.

## Building an AST (abstract syntax tree)

After the previous line of code extracted the source, we can parse it to an AST. The motivation for building an abstract syntax tree, is that it’s modifiable and we can compile is back to bytecode.

```function_source_code = inspect.getsource(original_function)
node = ast.parse(function_source_code)
```

The node we get is the root one of the parsed code. It links to all the elements in the hierarchy and represents a simplified module code.

Taking for example the foo function, the tree is:

```Module
# the method declaration (foo)
FunctionDef
# the arguments list (x)
arguments
Name
Param
# return instruction
Return
# comparison of two elements
Compare
Name
# comparison operator (<)
Lt
Num
```

The AST represents the function, while the decorator is omitted for simplicity. As can easily be seen, the tree represents all the content of the method, including declaration, other methods in context if there are and more. Given the AST, we’d like to modify it a fit the need that our aspect requires.

## Transforming the AST

### AST visitors

We will use the AST visitors as an introduction to syntax tree traversal. The node visitors follow a convention where callback names are of pattern visit_NODETYPE(self, node), where node type can be any these. For example, if we want a callback on method calls, we can define one for the Call node and name it visit_Call(self, node).

In our example, we can visit the compare nodes, and print all the operands:

```from ast import NodeVisitor

class ComparisonVisitor(NodeVisitor):
def visit_Compare(self, node):
operands = [node.left] + node.comparators
print '; '.join(type(operand).__name__ for operand in operands)
```

For every callback, we are assured the type of the node fits the Compare node type. Given the type, we can investigate it’s members. Comparison in Python is composed of operators (one or more) and operands (two or more). In the case of Compare node, the first operand is called left, and the rest are called comparators. One of the reason for the complicated structure is to support expressions like:

`0 &lt; x &lt; 100`

Using the visitor we can query the nodes, but not modify them. If we visit the the original foo function:

```&lt;/pre&gt;
node = ast.parse(inspect.getsource(foo))
ComparisonVisitor().visit(node)
```

The result we expect is:

`Name; Num`

Since comparison is x < 1, where x is Name load in the context and 1 is a Constant Number in the context.

### AST transformers

Python provides transformers, which are a special type of AST visitors. The transformers, in contrast to nodes visitors,  modify the nodes they visit. In our example, we’ll look for nodes that represent comparison between variables and numbers, and then extend them to comply with the aspect.

```from ast import NodeTransformer
from _ast import BoolOp, And, Name, Compare, IsNot, Load, Num

class ComparisonTransformer(NodeTransformer):
def visit_Compare(self, node):
parts = [node.left] + node.comparators

# check if any constant number is involved in the comparison
if not any(isinstance(part, Num) for part in parts):
return node

# get all the "variables" involved in the comparison
names = [element for element in parts if isinstance(element, Name)]
if len(names) == 0:
return node

# create a reference to None
# create for each variable a node that represents 'var is not None'
node_verifiers = [Compare(left=name, ops=[IsNot()], comparators=[none]) for name in names]
# combine the None checks with the original comparison
# e.g. 'a &lt; b &lt; 1' --&gt; 'a is not None and b is not None and a &lt; b &lt; 1
return BoolOp(op=And(), values=node_verifiers + [node])
```

This chunk of code is a simplified (relaxed type input checks no attempts to code location fixes) version of a transformer that visits all nodes of type Compare. The transformer methods names use the same convention as the visitors.

According to the original behavior a new node is being built. This node is a new Boolean expression, which requires all the variables in use to be not None and to satisfy the original comparison.

If we’d look at the output, the the AST will be modified and verify variables are not None before they’re compared to None. The out tree for the modified foo is:

```Module
# the method declaration (foo)
FunctionDef
# the arguments list (x)
arguments
Name
Param
# return instruction
Return
# the bool expression that combines with And:
# 1. the original comparison
# 2. the new check 'VAR is not None'
BoolOp
And
# the 'x is not None' comparison
Compare
Name
IsNot
Name
# the original comparison 'x < 1'
Compare
Name
Lt
Num
```

## Prepare the node forrecompilation

In the next phase, we’re going to import the new code as temporary module, which will case the declaration of the new method to be executed again. In order to do so, we’d like to remove the rewriter decorator, since we don’t want it to process the modified function. In addition, we rename the function for safety to avoid collisions between the declared function and other locals.  Lastly, we ask python to fix code locations for the new nodes so they can be compiled later on. This is done using fix_missing_locations.

```from ast fix_missing_locations

def rewrite_method(node):
# assuming the method has single decorator (which is the rewriter) - remove it
node.body.decorator_list.pop()
# we rename the method to ensure separation from the original one.
# this step has no real meaning and not really required.
node.body.name = 'internal_method'
# transform Compare nodes to fit the 'is not None' requirement
ComparisonTransformer().visit(node)
# let python try and fill code locations for the new elements
fix_missing_locations(node)
```

# Summary

During the first phase we got as an input a function (through a decorator), then modified it’s body by visiting it’s body using a syntactic level. Lastly, we modified it’s declaration and source locations so it can be safely imported as a new function.

As you probably notice, the only part in this code which is concerned by the aspect is the transformer. Meaning, if we’d like to apply a different aspect the only part which’ll change is the transformer. In our example the ComparisonTransformer is hard-coded for simplicity, but in real solution we’d provide it as an argument to the decorator.

## Next phase

In the next phase we’ll use the modified function to generate replacement bytecode.

# QuickSelect in CoffeeScript

QuickSelect is a known and simple algorithm for finding the kth smallest element in an array. The advantages of this algorithm are its linear average performance and the constant memory it requires, in addition to simple implementation.

The simplest solution is of course sorting the array using a built-in method of any language and then accessing the kth element in the sorted array. But, it may become an issue when working with large sets – O(n*log n) vs. O(n).

## Implementation

``class QuickSelect  constructor: (@arr) ->  kth: (k) ->    @_select(0, @arr.length - 1, k)  median: ->    @kth Math.floor(@arr.length / 2)  _swap: (i, j) ->    tmp = @arr[i]    @arr[i] = @arr[j]    @arr[j] = tmp  _partition: (left, right, pivotIndex) ->    pivotValue = @arr[pivotIndex]    @_swap pivotIndex, right    storeIndex = left    for i in [left..right]      if @arr[i] < pivotValue        @_swap storeIndex, i        storeIndex++    @_swap right, storeIndex    return storeIndex  _choose_random_pivot: (left, right) ->    left + Math.floor(Math.random() * (right - left + 1))  _select: (left, right, k) ->    if left == right      return @arr[left]    while true      pivotIndex = @_choose_random_pivot left, right      pivotIndex = @_partition(left, right, pivotIndex)      if k == pivotIndex        return @arr[k]      else if k < pivotIndex        right = pivotIndex - 1      else        left = pivotIndex + 1``

## Usage examples

``describe "QuickSelect", ->  describe "1 item", ->    selector = new QuickSelect     it "Should return the item", -> selector.kth(0).should.equal(1)  describe "2 items", ->    selector = new QuickSelect [2,1]    it "Min should be 1", -> selector.kth(0).should.equal(1)    it "Max should be 2", -> selector.kth(1).should.equal(2)  describe "3 items", ->    selector = new QuickSelect [2,1,3]    it "Min should be 1", -> selector.kth(0).should.equal(1)    it "Second should be 2", -> selector.kth(1).should.equal(2)    it "Max should be 3", -> selector.kth(2).should.equal(3)``

These simple tests explains how the access to kth element is done. The tests are written also in CoffeeScript.

## Performance comparison

Measuring time for 10,000 arrays, each accessed once at random location:
 Array size 5,000 10,000 20,000 QuickSelect 1758ms 3393ms 6912ms Sort 14302ms 32136ms 70566ms
As can be easily seen, the access to the kth smallest element is much faster. In cases where large arrays are used and repeated accesses are performed, this could be a bottleneck which can be easily solved.

# Improving performance using page guards

The problems we’re facing today is, a little bit, unique. Given:

• n contiguous arrays
• Each array has m cells
• Each cell is a Boolean flag

We receive a stream of signals, each signal is an absolute offset from the first array. For each signal we need to set the correct flag AND the first flag of the array. The motivation for setting the first flag is to enable quick filtering of arrays having some flags set.
For example, we have a usage tracking system for n websites and m users. If user i visited website j we’d like to signal that by setting the ith flag in the jth array. After some time, we’d like to query which sites had any visit and who visited them.

## The intuitive solution

Assuming you don’t care too much for the performance the solution is straight forward. Whenever setting a flag in an array set also the array in offset 0. If the input is index, then the array index is index / m and the item index is index % m. Pretty simple. For simplicity the source of indexes will be an array named items and the address of the first array will be baseAddress:

`for (int i = 0; i < numOfItems; ++i){    char* hitAddress = baseAddress + items[i];    *hitAddress = 1;    char* blockStartAddress = hitAddress - (hitAddress - baseAddress) % dwPageSize;    *blockStartAddress = 1;}`

It is clear that the first action, *hitAddress = 1, is impossible to avoid. But, what about the set of the signal at index 0? We can replace it with a condition but it is clear it won’t affect much the performance. So, how can we improve that part?

## Enabling page guards

Windows provides several memory protections, one of them is the page guard. When allocating a new memory scope we can declare it as protected. Defining it as protected means that each page (page is an arbitrary partition of the memory based on OS page size) will throw an exception on the first access to it. After throwing the exception the protection is removed. We would like to use this mechanism to avoid re-setting the flag at index 0.

In order to define such a scope, we will use the VirtualAlloc method:

`VirtualAlloc(NULL, TOTAL_SIZE,             MEM_RESERVE | MEM_COMMIT,              PAGE_READWRITE | PAGE_GUARD)`

It returns a pointer to the memory scope with size of TOTAL_SIZE in bytes. If a page size P then the new scope has TOTAL_SIZE / P pages.

## Tracking page hit

As mentioned, at the first time the memory inside a page is accessed an exception is being thrown. We would like to catch it. In order to do so in the fastest way, we will use windows SetUnhandledExceptionFilter API. The filter is a simple method receiving the exception information and deciding how to treat it. Treating it has three options:

1. Handling it
2. Handling it and continue the code execution
3. Pass the decision to other handler

As a simple filter we can request the runtime to ignore all page guards exceptions:

`LONG WINAPI SmartFilter(_EXCEPTION_POINTERS *ep){    if (ep->ExceptionRecord->ExceptionCode != STATUS_GUARD_PAGE_VIOLATION)    {        return EXCEPTION_CONTINUE_SEARCH;    }    return EXCEPTION_CONTINUE_EXECUTION;}`

So after setting it as the filter all page guards exceptions will be ignored:

`SetUnhandledExceptionFilter(&SmartFilter);`

## Extending the exceptions filter logic

Let’s assume that all arrays are smaller than the opration system page size and we’ll assume that we don’t care about reserving extra space to pad each array. We’ll denote the page size with dwPageSize.
Now, we can make our SmartFilter really smart. We will add to it the logic for setting the first flag on each array. Assuming baseAddress is some global variable:

`LONG WINAPI SmartFilter(_EXCEPTION_POINTERS *ep){    if (ep->ExceptionRecord->ExceptionCode != STATUS_GUARD_PAGE_VIOLATION)    {        return EXCEPTION_CONTINUE_SEARCH;    }    char* hitAddress = (char*)ep->ExceptionRecord->ExceptionInformation;    char* blockStartAddress = hitAddress - (hitAddress - baseAddress) % dwPageSize;    *blockStartAddress = 1;    return EXCEPTION_CONTINUE_EXECUTION;}`

We extract the exact address being touched by accessing ep->ExceptionRecord->ExceptionInformation. Through it it’s easy to get the start address of the page. When having this filter method registered we can be sure that whenever we set a flag in the array the first flag will be set too.
Now, we can alter the original code which was in charge of setting the first signal whenever a flag was set:

`for (int i = 0; i < numOfItems; ++i){    char* hitAddress = baseAddress + items[i];    *hitAddress = 1;}`

## Comparing the results

In order to make our comparison interesting let’s assume that we have 10000 arrays (websites in the tracking system) and each array has 25000 flags (users for example). In order to make it intense we’ll assume that during a short period 10% of the arrays were visited, for exmaple having 250000000 signals sent through the stream (repeating actions in a website by same users are allowed). On average, the time it took to run:

 Seconds Straight forward 1.744 Page guards 0.78

As can easily seen, the page guards solution saves ~50% of the runtime.

## Conclusion

The operating system provides a few very fast facilities which can be exploited. Even though most of those facilities are designed for different purposes they can still be useful in different cases, like the one here requiring single time signal for a scope. Since Windows puts a lot of focus in being backward compatible, those exploitations are not too risky. As usual – if it doesn’t require performance optimization, don’t do it. The price of maintaining the code might not worth it.

# Conditional attribute and arguments evaluation

## What is the conditional attribute?

The conditional attribute enables including/omitting methods calls during compilation depending on compilation symbols. For example, we can condition that a specific log method calls will be included only when we compile in debug. The compiler in this case will omit the calls to the method. Looking at the next code:

`public class Logger{    [Conditional("DEBUG")]    public void LogDebugMessage(string str)    {    }}`

And the code calling it:

`class MyClass{    private readonly Logger logger = new Logger();    public void Foo()    {        logger.LogDebugMessage("Foo");    }}`

We expect the compiler to omit the body of Foo(). As we can see with a disassembler this is exactly what happens:

`.method public hidebysig instance void  Foo() cil managed{  // Code size       1 (0x1)  .maxstack  8  IL_0000:  ret} // end of method MyClass::Foo`

## How method arguments are treated?

### Temp variable assignment optimization

Regardless the conditional attribute, in release mode the compiler performs many optimization which one of them is skipping local variable assignment. You’re most likely to notice it when you assign a value into a local variable and pass it to a method as an argument (while this is the only variable usage). For example:

`public void Foo(){    var foo = "Foo";    logger.LogDebugMessage(foo);}`

Translates into:

`0527C028  mov         edx,dword ptr ds:[2EE78B0h]  	// Load the string address0527C02E  mov         ecx,dword ptr [ecx+4]             // Load the logger instance0527C031  cmp         dword ptr [ecx],ecx  		// Null check0527C033  call        dword ptr ds:[50C5BD8h]  		// Call LogDebugMessage0527C039  ret  `

While:

`public void Foo(){    logger.LogDebugMessage("Foo");}`

Translates into:

`0552C028  mov         ecx,dword ptr [ecx+4]  		// Load the logger instance0552C02B  mov         edx,dword ptr ds:[31478B0h]  	// Load the string address0552C031  cmp         dword ptr [ecx],ecx  		// Null check0552C033  call        dword ptr ds:[5375C30h]  		// Call LogDebugMessage0552C039  ret  `

Which are basically the same. So in case we’re not using the conditional attribute we shouldn’t care about local assignment. We can expect to have no difference in runtime.

### Temp variable sent to omitted call optimization?

So an interesting question is what happens to an argument we’re about to send to a conditional method? If call to LogDebugMessage are omitted, what should we expect in this case:

`public void Foo(){    var method = MethodBase.GetCurrentMethod().Name;    logger.LogDebugMessage(method);}`

And in this case:

`public void Foo(){    logger.LogDebugMessage(MethodBase.GetCurrentMethod().Name);}`

The answer can be easily found by looking at the methods IL. The first version with temp assignment to a variable compiles into:

`.method public hidebysig instance void  Foo() cil managed{  // Code size       12 (0xc)  .maxstack  8  IL_0000:  call       class [mscorlib]System.Reflection.MethodBase [mscorlib]System.Reflection.MethodBase::GetCurrentMethod()  IL_0005:  callvirt   instance string [mscorlib]System.Reflection.MemberInfo::get_Name()  IL_000a:  pop  IL_000b:  ret} // end of method MyClass::Foo`

While the second version compiles into:

`.method public hidebysig instance void  Foo() cil managed{  // Code size       1 (0x1)  .maxstack  8  IL_0000:  ret} // end of method MyClass::Foo`

As we can see, in this case the argument was not even evaluated and the whole statement was omitted from the IL. Meaning that in this case, inlining the variable would have influence on the performance. It didn’t happen by chance, this is the defined behavior of the compiler as stated in the Conditional attribute documentation:

“If the symbol is defined, the call is included; otherwise, the call (including evaluation of the parameters of the call) is omitted.”

## Conclusion

The most common scenario in which the conditional attribute is involved is logging. Since the main advantage of omitting the logs is usually to avoid performance hit in production it is important to take into consideration the price of evaluating the arguments values. The simplest solution is to inline the variable. This can be done easily when the argument is string.Format() or similar. In case it is more complicated or unreadable it can always be solved by preprocessor directive such as #if.