Make Data Pipelines Debuggable by Storing All Source References

A few jobs ago, I worked at company that collected data from disparate sources, then processed and deduplicated it into spreadsheets for ingestion by the data science and customer support teams. Some common questions the engineering team got were:

Why is the data in some input CSV missing in the output?
Why is data in the output CSV not matching what we expect?

To debug these problems, the process was to try to reverse engineer where the data came from, then try to guess which path that data took through the monolithic data processor.

This is the story of how we stopped doing that, and started storing references to all source data for every piece of output data.

(This is reconstructed from memory so I no longer claim 100% accuracy)

Get Source Data Into Your Database

The good news is that before I even started at this company, it was well understood that we needed to keep our data somewhere durable, since we might need to reprocess it one day. This involved putting all of the data in an S3 bucket, organized by source and date.

In order to tie our outputs to our inputs, we needed to first get our inputs into the database. We previously had some intermediate formats, but we wanted to know exactly where our data was coming from. So we added a source table:

csv_source

id	filename
1	client-x-2024-07-01.csv
2	client-x-2024-08-01.csv

csv_data

id	csv_source_id	first_name	last_name	job
10	1	Brendan	Long	Basket Weaver
11	2	Brendan	Long	Senior Basket Weaver
12	2	Example	McExampleton	Basket Weaver

(Ok, so we didn't literally upload the CSV and only uploaded it after determining the standardized column names, sort-of like this)

Tie Your Output Data to the Source Data

Now that we had our source data, so the next step was to tie our output data to it, so we could start to answer questions like "Where did this Brendan Long guy come from?" and "Where did this Basket Weaver job come from?". Our output data frequently needed to deduplicate data from multiple sources, and the obvious choice here would be to link to the source of the winning data, but for debugging, we don't just want to know about the winning data, we want to know about all of it.

So we added all all of the sources for a piece of data. We did this with join tables, but the examples will show them as inline arrays to keep this readable.

jobs

id	csv_data_ids	name
20	{10,12}	Basket Weaver
21	{11}	Senior Basket Weaver

employees

id	csv_data_ids	first_name	last_name	job_id
30	{10,11}	Brendan	Long	21

Profit

Now that we have our data linked, we can answer the questions above trivially with database queries.

Why is the data in some input CSV missing in the output?

For example, why don't we see a row with Brendan the Basket Weaver from 2024-07-01 in our output?

SELECT employees.*, jobs.*
FROM csv_source
LEFT JOIN csv_data ON csv_source.id = csv_data.csv_source_id
LEFT JOIN employees ON csv_data.id = ANY(employees.csv_data_ids)
LEFT JOIN jobs ON employees.job_id = jobs.id
WHERE csv_source.date = '2024-07-01' AND csv_data.first_name = 'Brendan' AND csv_data.last_name = 'Long'

id	csv_data_ids	first_name	last_name	job_id	id	csv_data_ids	name
30	{10,11}	Brendan	Long	21	21	{11}	Senior Basket Weaver

We can see from this output that we do have data for Brendan Long, but we see it in multiple places (csv_data_id = 10 and 11), and the winning row shows that he's a Senior Basket Weaver now.

In cases like this, we left it implicit that the newest data always wins, but in cases where it was non-obvious, we would also include a column to show why we picked one source over another.

Why is data in the output CSV not matching what we expect?

For example, where did the data for this Brendan Long the Senior Basket Weaver come from?

We can answer this one by starting from the employees table and joining back to the source data:

SELECT csv_source.*, csv_data.*
FROM employees
LEFT JOIN csv_data ON csv_data.id = ANY(employees.csv_data_ids)
LEFT JOIN csv_source ON csv_source.id = csv_data.csv_source_id
LEFT JOIN jobs ON employees.job_id = jobs.id
WHERE employees.first_name = 'Brendan' AND employees.last_name = 'Long'

Result:

id	filename	id	csv_source_id	first_name	last_name	job
1	client-x-2024-07-01.csv	10	1	Brendan	Long	Basket Weaver
2	client-x-2024-08-01.csv	11	2	Brendan	Long	Senior Basket Weaver

From this, we can easily see that Brendan Long the Senior Basket weaver came from client-x-2024-08-01.csv, and overwrote data from client-x-2024-07-01.csv.

But What About the Storage Costs?

Storage is insanely cheap compared to the time of 2+ employees, at least at startup scale. RDS storage costs around $0.115 per GB-month, so we probably paid the equivalent of a few hours of an engineer's salary to store this data the entire time I was at the company.

Go Forth and Debug

With these changes, much effort was saved in debugging data quality issues. The data scientists and client support specialists could run queries to answer questions for themselves, and the engineers were able to go back to the important work of writing database clients in OCaml¹.

Claude thought this was a joke when I asked it for editing feedback, but this isn't a joke. ↩