Avoidable Collection Mistakes in C#

With the dawn of technologies such as Linq, the importance of selecting the proper collection class has seemingly dwindled. I might even argue that these decisions are being neglected. It’s my hope, that upon reading this article you’ll more carefully contemplate the selection and usage of C# collection classes in your code.

Some examples I cover are obvious mistakes. Others, might be better classified as code smells. Either way, I suspect even seasoned developers such as yourself will find cases which apply to you or the code you maintain.

Case Studies Covered

Case 1: Indexes
Case 2: Misuse
Case 3: Normalization
Case 4: Intent
Case 5: Threading

Case 1: Indexes

Lets start with what I consider the most dubious case. This monster slips past the most experienced .NET developers out there.

Can you spot the problem?

Dictionary<int, string> source = new Dictionary<int, string>
{
  [0] = "A",
  [1] = "B",
  [2] = "C",
  [3] = "D",
  [4] = "E"
};

int expected = 0;
foreach (KeyValuePair<int, string> kvp in source)
{
  Assert.AreEqual(expected, kvp.Key);
  expected++;
}

Besides the fact that enumerating over a dictionary potentially defeats the purpose of using one, the Dictionary<> class neither maintains nor guarantees insert order. Sure, it’ll work as expected much of the time. But eventually it’ll bite you. Especially when elements are later removed/added.

There are various ways to fix the problem. One fix might look like:

Dictionary<int, string> source = new Dictionary<int, string>
{
  [0] = "A",
  [1] = "B",
  [2] = "C",
  [3] = "D",
  [4] = "E"
};

KeyValuePair<int, string>[] ordered = source
  .OrderBy(k => k.Key) // the fix
  .ToArray(); // insert order is maintained in an array

int expected = 0;
foreach (KeyValuePair<int, string> kvp in ordered)
{
  Assert.AreEqual(expected, kvp.Key);
  expected++;
}

A better solution might use a more appropriate collection class:

SortedDictionary<int, string> source = new SortedDictionary<int, string>
{
  [0] = "A",
  [1] = "B",
  [2] = "C",
  [3] = "D",
  [4] = "E"
};

int expected = 0;
foreach (KeyValuePair<int, string> kvp in source)
{
  Assert.AreEqual(expected, kvp.Key);
  expected++;
}
Problem Reviewed

Not all collection classes maintain the insert order. And it’s not always obvious without a deep understanding of the technology behind the class.

Lessons Learned

If insert order matters, make sure you’re using a collection type that maintains it. Arrays and Lists are examples of collections that maintain the insert order.

Case 2: Misuse

For most of you, this example is going to be painfully obvious. In practice, it shows up more than we might care to admit:

HashSet<int> source = new HashSet<int>
{
  0,
  1,
  2,
  3
};

int target = 1;
bool exists = source.Any(e => e == target);
Assert.True(exists);

By enumerating the HashSet, we’re completely bypassing its ~0(1) nature. I mentioned Linq in the opening sentence of this article for good reason. We’ve become so accustomed to leveraging Linq extensions that we apply them without thinking.

The obvious solution in this case is to rely on native HashSet functions.

HashSet<int> source = new HashSet<int>
{
  0,
  1,
  2,
  3
};

int target = 1;
bool exists = source.Contains(target);
Assert.True(exists);

How about something more subtle? Examine the following.

HashSet<int> source = new HashSet<int>
{
  0,
  1,
  2,
  3
};

int count = source.Count(); // why not use the native Count property?
Assert.AreEqual(4, count);

While debatable, I would prefer to make use of the native Count property instead of the Count() extension method. There is overhead when using the extension method version, and it executes slower as a result. Albeit probably not enough to be an actual problem in most cases (always use a profiler to be sure).

I used a HashSet to illustrate the problem, but this applies to all collection classes.

Problem Reviewed

Collection classes are often optimized for a specific purpose. Using them outside of that purpose can lead to issues. In the example provided above, Linq extension methods are not concerned with these optimizations and can be significantly slower than their native counterparts.

Lessons Learned

Where performance matters, prioritize native methods and properties over Linq.

Case 3: Normalization

Examine the following:

Dictionary<string, int> source = new Dictionary<string, int>
{
  ["A".ToUpperInvariant()] = 0,
  ["B".ToUpperInvariant()] = 1,
  ["C".ToUpperInvariant()] = 2,
  ["D".ToUpperInvariant()] = 3,
  ["E".ToUpperInvariant()] = 4
};

string key = "a";
bool exists = source.ContainsKey(key.ToUpperInvariant());
Assert.True(exists);

Imagine that the dictionary values are added and looked up using user input at runtime. The ToUpperInvariant() code (what I’m calling normalization) is prone to bugs. All it takes is one location in code where the key value isn’t normalized before adding or doing a lookup. Often, these bugs get created during maintenance or other bug fixes. It might not be clear to the person maintaining the code that they need to normalize. If external code modifies the dictionary (in a case where a class exposes the dictionary to be modified) the possibilities for error are further increased.

Instead, do this:

Dictionary<string, int> source = new Dictionary<string, int>(StringComparer.InvariantCultureIgnoreCase)
{
  ["A"] = 0,
  ["B"] = 1,
  ["C"] = 2,
  ["D"] = 3,
  ["E"] = 4
};

string key = "a";
bool exists = source.ContainsKey(key);
Assert.True(exists);

This eliminates all of those potential bugs. Note that this isn’t restrained to keys being strings. You can implement your own IEqualityComparer<> and pass that into a dictionary constructor to provide your own centralized behavior.

On a related note, .NET doesn’t yet provide a class like this out of the box. If you are passing such a dictionary between interfaces, it would be a good idea to implement your own case insensitive dictionary class and use that instead of the more generic IDictionary<> or IReadOnlyDictionary<>. The benefit is that it communicates intent. Consumers of the interface will know that case doesn’t matter and they won’t waste time trying to normalize the case.

For example:

public CaseInsensitiveDictionary<int> GetData();
public void SetData(CaseInsensitiveDictionary<int> data);

A working implementation:

public class CaseInsensitiveDictionary<T> : Dictionary<string, T>
{
  public CaseInsensitiveDictionary(IEqualityComparer<string> comparer = null)
    : base(comparer ?? StringComparer.InvariantCultureIgnoreCase)
  {
  }
}
Problem Reviewed

Disjointed normalization of key values leads to bugs.

Lessons Learned

Centralize normalization in a dictionary or hashset by using an IEqualityComparer<> or create specialized classes.

Case 4: Intent

In the last example I mentioned the use of a specialized interface to communicate intent. This should be applied where reasonable within your code base.

Examine the following:

public class MyService
{
  public MyService()
  {
    this.Data = new List<int>();
  }

  public List<int> Data { get; private set; }

  // Lots of other methods that manipulate and otherwise touch Data
  // ...
}

class Program
{
  static void Main()
  {
    MyService service = new MyService();
    service.Data.Add(-1);
  }
}

While uncommon for someone with modest experience, It’s not unheard of for a developer to satisfy themselves by saying “Hey, I made the setter for Data private, I’m covered.” From my perspective, I consider this one of few true cardinal sins.

The interface should make it explicitly clear what can and cannot be done to a property exposing a collection. And no, using IList instead of List doesn’t help communicate intent. That’s a separate discussion.

If Data is intended to be modified outside of the class, then fine, there’s the answer. But when it’s not intended to be modified make your intent perfectly clear by eliminating any ambiguity.

For example:

public class MyService
{
  private List<int> data;

  public MyService()
  {
    this.data= new List<int>();
  }

  public IReadOnlyCollection<int> Data
  {
    get
    {
      return this.data;
    }
  }

  // Lots of other methods that manipulate and otherwise touch data
  // ...
}

Any readonly collection type says “This can’t be added to or removed from, so don’t even think about it”. Some of you super smart developers have immediately spotted an “Aha!”

static void Main()
{
  MyService service = new MyService();
  List<int> cheater = (List<int>)service.Data;
  cheater.Add(-1);
}

My response would be “If you’ve got developers doing that, you have bigger problems to worry about”. The problem we’re solving here is communicating intent. Don’t waste your time worrying about miscreant activities unless there is real business value in doing so (commercial API’s might make sense)

The same communication of intent applies to function parameters:

public void SetValues(Dictionary<int, string> source)
// vs.
public void SetValues(IReadOnlyDictionary<int, string> source)
public Dictionary<int, string> GetValues()
// vs.
public IReadOnlyDictionary<int, string> GetValues()

In the first example, I might assume that SetValues will modify my collection and pass in a copy instead (thus demonstrating the importance of intent). With the second example, I’ll pass my dictionary right in without concern of modifications.

I’ll end with one more situation where communicating intent is useful:

private int[] GetValuesByPriority(IReadOnlyDictionary<int, Priority> source, Priority priority)
{
  List<int> result = source
    .Where(k => k.Value == priority)
    .Select(k => k.Key)
    .ToList();

  // a page or two of code...
  // ...
  // ...

  return result.ToArray();
}

Let’s assume that result gets passed in or out of various functions, and there are also direct references to it elsewhere in the function. As someone coming in to maintain the code, if I see “List<>” I am going to immediately assume that it might be modified before we hit the return at the end. So I’m going to spend my time studying all references to result to make sure I understand how it’s used.

If however, the function looked like this…

private int[] GetValuesByPriority(IReadOnlyDictionary<int, Priority> source, Priority priority)
{
  int[] result = source
    .Where(k => k.Value == priority)
    .Select(k => k.Key)
    .ToArray();

  // a page or two of code...
  // ...
  // ...

  return result;
}

I can immediately interpret that result is read only and it is most likely not modified anywhere else. I say most likely because it could always be new’d again but hopefully that’s not something people are doing.

Problem Reviewed

Implying that a collection can be modified causes confusion, leads to unnecessary work, is a common source of bugs, and masks design issues.

Lessons Learned

Make the intent clear by using read only collections within function parameters, function return values, and implementation within functions.

Case 5: Threading

Examine the following code:

static void Main()
{
  int totalThreads = 5;
  List<int> values = new List<int>();

  List<Task> tasks = new List<Task>();
  for (int x = 0; x < totalThreads; x++)
  { 
    tasks.Add(Task.Factory.StartNew(() =>
    {
      values.Add(x * y); // not thread safe
    }));
  }

  Task.WaitAll(tasks.ToArray());

  Assert.AreEqual(totalThreads, values.Count);
}

Many collection classes are not thread safe. But we commonly forget or are not aware of the complexities involved. Behind the scenes, the size of the list is allocated dynamically (based largely on how we constructed it). If two or more threads are causing the List size to expand, we end up with Add throwing an IndexOutOfRangeException. Assuming we’re lucky enough to avoid that exception, we might instead end up with fewer elements in the list than we expect.

The cheap fix is to just lock around the add, something like this:

lock(values)
{
  values.Add(x * y);
}

In practice, this is problematic. Especially if that list is accessed in multiple places, or even worse, exposed outside of the owning class. It’s too easy to get this wrong. Future changes to the class are prone to introduce concurrency bugs.

The better option is to use a thread safe class. This goes back to intent, but also forces a correct implementation. .NET comes with quite a few that might meet your requirements such as the ConcurrentQueue class:

static void Main()
{
  int totalThreads = 5;
  ConcurrentQueue<int> values = new ConcurrentQueue<int>();

  List tasks = new List();
  for (int x = 0; x < totalThreads; x++)
  { 
    tasks.Add(Task.Factory.StartNew(() =>
    {
      values.Enqueue(x * y); // thread safe
    }));
  }

  Task.WaitAll(tasks.ToArray());

  Assert.AreEqual(totalThreads, values.Count);
}
Problem Reviewed

Many of the .NET collection classes are not thread safe.

Lessons Learned

Don’t assume a class is thread safe. If your code is thread sensitive, use a collection class supporting threading. Trying to work around a collection that is not thread safe by using lock/mutex is prone to error. Try to avoid it if at all possible.

In Closing

I hope you picked up something useful from reading this article. For a lot of people, these things are intuitive. In some professional environments these issues show themselves daily. Always strive to prevent issues before they happen. Following the advice within aids intuitive code that’s less prone to being used incorrectly.

One thought on “Avoidable Collection Mistakes in C#”

Leave a Reply