Indexing and sorting to find data quickly

A reader emailed me asking how to quickly find pieces of data without needing to fully read every data file they have. I thought the answer could be interesting to other people, so I answer both in theory and what I would do in practice.

To make your data faster to lookup, you can either store it in an order that makes it easier to search, or add one or more indexes. For practical work, you can let your file system do this for you, or use a pre-built database (either relational or not). I'll describe from the lowest-level to highest level so you can understand what I'm suggesting, but my real-world answer is that I would store most kinds of data in a relational database like PostgreSQL and put indexes on any column that I want to do lookups by.

For the purposes of the article, we're going to assume your data is some information about employees, and you want to be able to look it up by employee ID.

Storing in order

If you put all of the employees in one file, sorted by Employee ID, and ensured that each employee data record was the same length, you could do a binary search through the file:

Put pointers at the beginning and end of your file.
If the "end of file" pointer is before the "beginning of file" pointer, stop; the employee you want doesn't exist.
Look at the middle employee between those two pointers.
If the employee ID is what you're looking for, you're done.
If the ID is higher than the one you're looking for, move your "end of file" pointer to the record before the current one and go back to step 2.
If it's lower, move your "start of file" pointer to the record after the current one.

This is called binary search, because every time you repeat steps 2-5, you cut the number of records you need to search in half.

The advantages of this algorithm are that you only need to lookup log2(number of employees) employees maximum (in short, if you double the number of employees, you only need to do one extra check). The logic is also relatively simple, and you don't need a separate index.

The downside is that the logic of writing this file becomes a lot more complicated, you can't insert new records into the middle without rewriting large portions of the file, and need to ensure that every record in the file is precisely the same size.

Using an index

As an alternative, you can create an index file that just tells you which file (and potentially what part of a file) a particular records is at.

For example, your index might consider of binary records containing an employee ID and file number. For example:

1:2 # employee 1 is in 2.dat
2:2 # employee 2 is in 2.dat
3:1 # employee 3 is in 3.dat

The advantage of this is that to find a particular employee, you only need to read the index and the file they're in. You also have the option of storing the index in order, which lets you perform a binary search over the index itself (but then you would need to rewrite the index in some cases). Another option is making the index a hash table, which is more complex but usually faster.

You could also store the byte offset for the record the index points to, which lets you jump directly to where they are in the file instead of having to read the entire data file (but this makes the logic a lot more complicated, and means you can't easily re-order the data file).

1:2:120 # employee 1 is in 2.dat starting at byte 120
2:2:0 # employee 2 is in 2.dat starting at byte 0
3:1:0 # employee 3 is in 3.dat starting at byte 0

Unlike sorting, you can also have multiple indexes. If you want to find employees by ID, and also by who their manager is, you could create one index by employee ID, and a second index by manager.

The downsides of this are that your code becomes a lot more complex, since you need to ensure every update, create and delete operation keeps the index in-sync. It also increases the number of possible race conditions if your code is allowed to run multi-threaded (you would need to keep the index in-sync no matter what multiple threads are doing). This also uses more disk space (although probably not a significant amount), and lookups are not quite as fast as if the data itself was just sorted. All of these downsides increase as you add more indexes, although I find that over-indexing is rarely the problem when it comes to slow data access.

Indexing without an index

In some cases, it's useful to have something like an index but without having to store it in a file. For these purposes, you can hash your data instead. For example, if you want to split your data across 100 files, and don't particularly care about which file each record goes into, you could run your employee ID through a hash function that returns a number between 1 and 100, and then store the employee in whichever file it returns. If you want to lookup the same employee, you run the ID through the hash function again to find the file to look it up in.

The advantage of this is that the hashing part is extremely fast, and it doesn't matter how many records you're storing (no matter how many records you have, each store and lookup only requires hashing one number). The downside is that you need to rewrite your entire index if you want to change the number of output "bins" (data files), and you also need to optimize the output of the hash function (how many records do you want to store per file; how many empty files are you willing to deal with?).

In practice, this makes very little sense with files (see next point), but is a common way to split data across a network (deciding which database to put a record in; deciding which server to contact).

Use the filesystem

At this point we stop being theoretical and keep in mind that the problem you're trying to solve is something filesystems are designed to do (how do a find an employee in a big list of files is a very similar problem to how to find a file in a hard drive with 10 million files).

Instead of storing your employees in files with meaningless names, you could store them all in one directory and name them by the employee they contain. For example, employee 1 could be stored in employees/1.dat and employee 2 could be stored in employees/2.dat.

The advantages of this is that the filesystem is basically doing indexing for you, so you get all of those advantages with no effort.

The downside is that opening files is fairly slow compared to reading multiple pieces of data in one file (because of how read ahead works, and because you'd need a lot of system calls). For non-heavily optimized code you may not notice the difference but the maximum performance of this is lower than using huge files.

Use a database

Which brings us to why everyone uses databases. Databases do almost exactly what I described above, and they store all of the data in a small number of huge files so they can be heavily optimized.

In a database, storing the data in sorted order and using a binary search is called a clustered index. Using a separate index is called a nonclustered index. Databases also optimize things by keeping the index partially or fully in memory depending on usage and memory available. They also ensure that even if you're using multi-threading that the indexes are always kept in-sync (ACID databases have additional guarantees).

Note: For this question, it doesn't really matter if a database is relational or not. A relational database like PostgreSQL will have the advantage of being much older, more heavily optimized, and lessy buggy, and more powerful, but a non-relational database like Redis or CouchDB would also do what you want and may be easier to understand at first. I frequently use SQLite when I need a lightweight relational database.

It's not just about files

Everything I said above also applies to in-memory data structures. When you create a hash table of integer ID's pointing to pointer/object records, the hash table is an in-memory index. If you store all of the records in a sorted list, you can use a binary search. You might find that you want to mix this, by keeping an index in memory and the data in files.

If you're looking for more, this is just an taste of what gets discussed in a data structures class. The ones I mentioned above are the most common, but there are a variety of things similar to binary search (binary trees, B-trees) and different ways of handling the downsides of hash tables (chaining, etc.). For a lot of practical programming, the details don't matter very much (someone else thought about this when designing a database), but it can still be good to know, especially when dealing with complicated in-memory structures.