There's a common programming interview question that asks you to find the single non-duplicated value in a list of duplicated integers. This can be done in O(n) time and O(1) space by XOR'ing the values, and doing so is almost always the wrong answer. A better solution when you can afford to do it1 is to use a hash table to count occurrences.

The problem

The XOR solution only works in one very specific instance, and fails silently if the assumptions are violated. Good luck debugging an algorithm that seems to be working.

Here's some examples:

No non-duplicates

What's the non-duplicate number in this list? [0, 0, 1, 1, 2, 2]

The XOR algorithm says "0".

Two non-duplicates

What's the non-duplicate number in this list? [1, 1, 2, 3]

The XOR algorithm says "1".

Better solution

The simplest solution is to have a hash table mapping a integer value to the count of times you've seen it2 3. You can populate the hash table by reading through the list, then read through the hash table looking for any value with an occurrence count of 1. This uses O(n) time and O(n) space, but in exchange for using more memory, it will always do something sane and is easy to debug.

Note that even if you're trying to find your single missing drone out of a million drones, and you're somehow absolutely certain that exactly one is missing, the hash table algorithm using 32-bit integers will only use 8 MB of memory. This solution is less impressive than XOR'ing every value, but you can afford the memory, and your coworkers will thank you when they need to debug it.

So, next time you're considering being overly fancy, consider if you actually can't afford to use a little more memory (and can afford to have a solution that's impossible to debug), and maybe consider just using a simpler algorithm that uses little bit more memory.

  1. I don't want to say that the XOR solution is never useful, but it would be very rare and essentially only include very specific low-level memory-constrained code and interviews at companies that don't know what they're doing.
  2. If your numbers are small, you might also be able to use an array where the index is the value and the contents are the occurrence count. This situation also feels contrived but at least it fails in a more obvious way if your assumptions are violated (an out-of-memory error).
  3. You could also use a hash set and add each value you see unless it's already in the set, in which case you remove it. This is better, but you'll silently ignore cases where a value occurs 3 or more times, which might make debugging annoying. It also has less predictable memory usage which might cause problems in production (i.e. [n, n+1, ..., n+k, n, n+1, ..., n+k] uses k times more memory to process than [n, n, n+1, n+1, ..., n+k, n+k]). This version might be worth it, but I'd still lean toward the debuggable version by default.