Easy Wins Optimizations – HashSet

HashSet is a data structure introduced in .Net 3.5 framework. The main advantage of HashSet is that it adds and fetches items in very low complexity (constant amortized time). We’ll try to see how much faster than List it can be. In the example we’ll go over code that stores all historical lottery results and checks if some results appear among the historical ones.

Registering results

public abstract class ResultsRegistrarBase
{
protected ICollection<long> codes;

public void AddResult(int num1, int num2, int num3, int num4, int num5, int num6)
{
long code = ConvertToCode(num1, num2, num3, num4, num5, num6);

codes.Add(code);
}

public bool ContainsResult(int num1, int num2, int num3, int num4, int num5, int num6)
{
long code = ConvertToCode(num1, num2, num3, num4, num5, num6);

return codes.Contains(code);
}

private static long ConvertToCode(int num1, int num2, int num3, int num4, int num5, int num6)
{
return num1 + (long) Math.Pow(10, 0) + num2 * (long) Math.Pow(10, 2) +
num3 * (
long) Math.Pow(10, 4) + num4 * (long) Math.Pow(10, 6) +
num5 * (
long) Math.Pow(10, 8) + num6 * (long) Math.Pow(10, 10);
}
}

This class is in charge of storing and finding the results. It assumes it gets the numbers sorted and the ConvertToCode method maps all the numbers to one “big” number”. For example: 1,5,20,30,45,49 will be mapped to 494530200501. The codes collection will be initialized in a base class:
The naive code:

public class NaiveResultsRegistrar : ResultsRegistrarBase
{
public NaiveResultsRegistrar()
{
codes =
new List<long>();
}
}

The code based on HashSet:

public class HashSetResultsRegistrar : ResultsRegistrarBase
{
public HashSetResultsRegistrar()
{
codes =
new HashSet<long>();
}
}

How much faster is it than the naive code?

The difference is huge. In a simple case where we have 250,000 historical results and we’re querying the collection 50,000 times the average running time of the naive code is 1:28 minutes while the code based on HashSet takes only 0.2 seconds. The HashSet code run ~360 times faster than the naive code. The reason for this major difference is that List has to iterate over all the items in order to determine if it contains the requested item. HashSet uses a hash method, without getting to implementation details, it enables the Contains method look in very few entries in order to determine whether or not an item is contained in the collection.
This major improvement was received with minimal change in the code. So in cases where we query a large collection about whether some items are in it, usage of HashSet can give us major running time  improvement with minimal change.

Advertisement

Easy Wins Optimizations – Sorted List

This is the first post of easy wins optimizations posts. The need for these posts comes from the fact the most of us hardly ever need it 🙂 Sounds weird? It should. What it means is that we don’t need it most of the time; the most naive code does the best work for us. When we encounter the case that requires optimization we hardly remember the simplest ways of doing it.
In this post on sorted list we’ll go over an example of optimizing a simple case.

How to find a person in a collection whose age is closest to mine?

Assume we have a person class with age defined on it:

public class Person
{
public double Age { get; private set; }

public Person(double age)
{
Age = age;
}
}

What we want is to collect many people into a list, and any time we asked about an age, we find the closest person to this age. We can think of it as an application for a dating service that matches people by age.

The naive solution

So what can be simpler than collecting all the people in a list? Probably nothing, this is how we do it:

public class NaivePersonFinder : IPersonFinder
{
private readonly List<Person> persons = new List<Person>();

public void Add(Person person)
{
persons.Add(person);
}

public Person FindPeopleInRange(double requiredAge)
{
Person closestPerson = null;
double smallestDiffrence = double.MaxValue;
foreach (var person in persons)
{
double difference = Math.Abs(person.Age - requiredAge);
if (difference < smallestDiffrence)
{
smallestDiffrence = difference;
closestPerson = person;
}
}

return closestPerson;
}
}

The code iterates over the collection in the most straightforward way there is and updates the closest person whenever it encounter a closer person. Simple! Isn’t it?

But I need faster results!

The sorted list solution
How does sorted list help us here? Sorted list allow us to search items using Binary Search. Binary search saves a lot of time since it has to query very few items in the collection in order to find the closest one. For example, to find the closest item in a collection of 1,000 items it’ll have to query ~10 and saves iteration over the hole 1,000.

Some code

public class SortedPersonFinder : IPersonFinder
{
private readonly SortedList<double, Person> persons = new SortedList<double, Person>();

public void Add(Person person)
{
persons.Add(person.Age, person);
}

public Person FindPeopleInRange(double requiredAge)
{
if (persons.ContainsKey(requiredAge))
{
return persons[requiredAge];
}

IList<double> sortedAges = persons.Keys;

int index = BinarySearchIndex(requiredAge);

var closestPersons = new List<Person>(3);

if (index > 0)
{
closestPersons.Add(persons[sortedAges[index -
1]]);
}

if (index < sortedAges.Count - 1)
{
closestPersons.Add(persons[sortedAges[index +
1]]);
}

closestPersons.Add(persons[sortedAges[index]]);

return closestPersons.OrderBy(person => Math.Abs(person.Age - requiredAge)).First();
}

private int BinarySearchIndex(double requiredAge)
{
var sortedAges = persons.Keys;

var scopeStart = 0;
var scopeEnd = sortedAges.Count - 1;

int index = 0;
while (scopeEnd > scopeStart )
{
index = (scopeStart + scopeEnd) /
2;
double ageAtIndex = sortedAges[index];
if (requiredAge > ageAtIndex)
{
scopeStart = index +
1;
}
else
{
scopeEnd = index;
}
}
return index;
}
}

The code is more complex than the naive version, but it finds the results much faster. It uses .Net Sorted List collection. This is a collection that updates a collection of  keys whenever the collection changes. The keys are kept ordered and we can easily access them and binary search them.

How much faster is it?

In a collection of 100,000 people and 50,000 queries the running time of the sorted list is much more faster than the naive:
The naive solution did it in average time of ~100 seconds
The sorted list solution had average time of ~6 seconds. This is a major improvement; it means that it ran more the 15 times faster than the naive solution.

Conclusion

If we have a collection in which order can be applied AND we also need quick results for queries based on the order, sorted list can easily improve running time. We must always remember that the running time is usually negligible even in the most naive code. If no optimization required – don’t optimize, more gain will be earned from simple and maintainable code.