At my work, we process a lot of data from raw data sources. Data feeds, IIS Log files, etc, etc are few ones. To get accurate information out of these requires a lot of manual labor. Vendors change their names, columns, data types, etc, etc all the time causing packages to break. Packages throwing error is often fine, because it actually threw an error message and stopped. I can come in the morning, troubleshoot, and ensure that right data gets into the database. However, if the raw data has duplicates, chances are that they wont break the package or throw any kind of error message and we wont be able to know about it. Finding this out at a later time will become more pain in the rear as you have to go and find out where the duplicate data is, and ensure that you only delete that. Almost have to re-invent the wheel….
Today, I had to design a package to handle log files from one of our vendors. We fetch the data using iMacro because they don’t have any API, or any mechanism to send data to us. Basically what happens is, the iMacro runs, logs in to their website, choose the date range and extract the data in the form of CSV file, and saves it to one of our network location. Where it gets tricky is that it gives us the ability to select the date range. If the macro runs at 11 p.m. and gets all data, we will miss an hour worth of logs for the given day and so on. So I decided to use the Checksum transformation (Available from http://www.konesans.com/products.aspx). I designed the package which looks like this:
So when I created the table, I created with IGNORE Duplicate ON on the index. I pass certain rows which has chances of changing, and pass it through the checksum transformation which generates a checksum (int,bigint). I use this checksum as my primary key. It ignores all the duplicate records. This ensures that there is no duplicate data inserted into the table.
Hope this helps. Happy ETLing..