A few jobs ago, I worked at company that collected data from disparate sources, then processed and deduplicated it into spreadsheets for ingestion by the data science and customer support teams. Some common questions the engineering team got were:
- Why is the data in some input CSV missing in the output?
- Why is data in the output CSV not matching what we expect?
To debug these problems, the process was to try to reverse engineer where the data came from, then try to guess which path that data took through the monolithic data processor.
This is the story of how we stopped doing that, and started storing references to all source data for every piece of output data.
