Tuesday, April 28, 2015

[ETL] A dynamic CSV ripper that's forwards/backwards compatible, in T-SQL

Had this problem coming, so wanted to get ahead of it:
Dev releases code, version 2.1, which has a CSV with 4 columns.
[2 days pass]
Dev releases code, version 2.1.1, which has a CSV with 5 columns.
[2 days pass]
Dev releases code, version 2.1.2, which has a CSV with 6 columns.

Since we relational DBAs have this fun thing called "schema", importing this gets hard.  SSIS will choke because it's different each time.  BULK INSERT will choke as the file type has changed.  Inserting via OPENROWSET and the text connector sucks because 64-bit servers require you to install the Excel pack. Modifying the import each time blows for obvious reasons.

So, time to use the DBAs secret weapon, Dynamic SQL.  (Which, yes, lots of potential problems, holes, etc.  See Erland's seminal paper "The Curse and Blessings of Dynamic SQL").

So what did I build?  Practically, it imports any delimited file with a header row, compares that to the end table, and inserts just those fields.  Need that field in your end table? Add it to the target table, and the next run will pick it up. 

Warning: it's SLOW. Not sure which part, but there's plenty of blame to go around on my part. Dynamic Shreds, Dynamic Pivots, Pivots, etc. You could replace the Pivot with a CLR Dynamic Pivot floating around out there (which is probably the majority of it), replace the comma parsing with Adam Machanic's, and that may get you to a happy spot.  For me, though, this is ideal. 

My next version of this process will probably be in Powershell, in case someone hasn't already done it.  That could be slick and really fast- import-csv, get the column list as a variable, match up to another variable with the columns from the final table, and use the resultant as a hash table for a select.  Next time, next time.

Dependency: right now, Jeff Moden's 8k splitter.

Order of operations for this code:
  • get server list (naturally, this needs the file from multiple servers) 
  • For each server...
  • import first row (header has the field names) into a table
  • split CSV into a list of columns.  
  • import file into wide table
  • split CSV to a keypair-style table, using Jeff Moden's awesome DelimitedSplit8K splitter (may need to change this to use Adam Machanic's CLR)
  • update the table so that blank fields are now NULL (since blank fields become '', which converts to 0 if bigint, but fails converting to decimal.)
  • build a pivot to go from the keypair into your target table
    • compare the columns (sys.columns or information_schema.columns) between the header and the target table
    • match the data and header to make a good insert statement
  • use the pivot to insert into your target table

No comments: