Editorials

My Best Failure

Today I want to share the best failure I ever had. It was early in my career, and it marked me for decades to come. I was building a data import for a large data warehouse in the ‘90s. We imported medical claims and eligibility information on a monthly basis. The import tool we bought reached a point where there were not enough resources to process the whole thing for a single monthly extract.

Because the database was so large, and disk space was very expensive at the time, we only had one system capable of holding all the data, and importing monthly data at the same time. I requested to have a development and QA environment, where we could do development and testing. What I received was a powerful desktop; actually two.

The first desktop had the same Mhz as the Pentium chip on our server. We added a bunch of memory. And, then someone decided to put a PCI raid controller, and some SCSI drives in the computer, so I would have a development machine with enough horsepower and storage. The machine worked for two weeks before the mother board melted, literally.

What did we then do? Well, we got another powerful desktop, moved the memory, SCSI controller, and drives, into the new computer, and turned it on. So that we didn’t repeat the original mistake, and melt down the computer, the side was removed from the case, and a 20” box fan blew continuously on the computer internals to keep it cool. We all knew this was a disaster. And, we still did not have a QA environment.

In the midst of this project, we purchased another competitor company, because they used the same systems we did, and we could easily integrate their data into our warehouse. Again, we did not have enough space to contain all the data except in the production server. I was assured that there was no data that would overlap from the two separate systems. They were wrong. But, we didn’t find out until importing 20% of the new data from both companies. We found out in production. We immediately canceled the import, and it began the rollback of the current batch in process, and ran, and ran, for two days it ran and never completed.

Much of the data import code was only in the database, and was not backed up. We couldn’t restore to a save point because we would lose the code. We couldn’t recover the existing database after two weeks of processing (rolling back transactions).

This was a great failure. I knew what to do, but couldn’t convince management it was necessary. Now I not only know the situation was dangerous, but I also know how to convince management regarding basic tools and practices to mitigate against this kind of disaster. The fault was completely mine. I take full responsibility for what occurred. I spent hours cleaning it up. I won’t do it again.

Cheers,

Ben