Tuesday, April 28, 2015

[ETL] A dynamic CSV ripper that's forwards/backwards compatible, in T-SQL

Had this problem coming, so wanted to get ahead of it:
Dev releases code, version 2.1, which has a CSV with 4 columns.
[2 days pass]
Dev releases code, version 2.1.1, which has a CSV with 5 columns.
[2 days pass]
Dev releases code, version 2.1.2, which has a CSV with 6 columns.

Since we relational DBAs have this fun thing called "schema", importing this gets hard.  SSIS will choke because it's different each time.  BULK INSERT will choke as the file type has changed.  Inserting via OPENROWSET and the text connector sucks because 64-bit servers require you to install the Excel pack. Modifying the import each time blows for obvious reasons.

So, time to use the DBAs secret weapon, Dynamic SQL.  (Which, yes, lots of potential problems, holes, etc.  See Erland's seminal paper "The Curse and Blessings of Dynamic SQL").

So what did I build?  Practically, it imports any delimited file with a header row, compares that to the end table, and inserts just those fields.  Need that field in your end table? Add it to the target table, and the next run will pick it up. 

Warning: it's SLOW. Not sure which part, but there's plenty of blame to go around on my part. Dynamic Shreds, Dynamic Pivots, Pivots, etc. You could replace the Pivot with a CLR Dynamic Pivot floating around out there (which is probably the majority of it), replace the comma parsing with Adam Machanic's, and that may get you to a happy spot.  For me, though, this is ideal. 

My next version of this process will probably be in Powershell, in case someone hasn't already done it.  That could be slick and really fast- import-csv, get the column list as a variable, match up to another variable with the columns from the final table, and use the resultant as a hash table for a select.  Next time, next time.

Dependency: right now, Jeff Moden's 8k splitter.

Order of operations for this code:
  • get server list (naturally, this needs the file from multiple servers) 
  • For each server...
  • import first row (header has the field names) into a table
  • split CSV into a list of columns.  
  • import file into wide table
  • split CSV to a keypair-style table, using Jeff Moden's awesome DelimitedSplit8K splitter (may need to change this to use Adam Machanic's CLR)
  • update the table so that blank fields are now NULL (since blank fields become '', which converts to 0 if bigint, but fails converting to decimal.)
  • build a pivot to go from the keypair into your target table
    • compare the columns (sys.columns or information_schema.columns) between the header and the target table
    • match the data and header to make a good insert statement
  • use the pivot to insert into your target table
SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
GO
/*
2015/04/22 - mdb - 2.00 - changed to pull raw CSV, shred via function, pivot into a virtual table, then insert into
actual table based on existing fields. Practically: forwards/backwards compatability.
Add new fields to your target table, and as they show up in the new file they'll be added.
Practically, they'll have to rename the file when they do this, so that we know the
names of the new fields. But that's just a parameter change in the job.
2016/03/25 - mdb - 2.01 - found a bug where inserts may fail when the source data uses scientific notation. Added statement
that UPDATEs the staging table where the value is "like '%E-0%'". Converts to float(53), then dec(28,15).
Commented out by default because you might have E-0 in a text field, and THAT would break it.
requires Jeff Moden's 8k CSV parser, or something like it.
*/
--BULK INSERT requires dynamic SQL
DECLARE @filename VARCHAR(500), @sql NVARCHAR(4000), @base_filename VARCHAR(500), @servername VARCHAR(255)
DECLARE @beginning_of_filename VARCHAR(30) = 'filenamestartswith'
DECLARE @Columns VARCHAR (MAX), @Columns_Short VARCHAR (MAX), @Columns_Insert VARCHAR(MAX)
DECLARE @ErrorMessage NVARCHAR(4000);
DECLARE @ErrorSeverity INT;
DECLARE @ErrorState INT;
DECLARE @serverlist_to_do TABLE (id INT IDENTITY, servername sysname)
INSERT INTO @serverlist_to_do
--uses OPENROWSET to pull a serverlist from my repository; can use a linked server or just hit locally.
SELECT server FROM OPENROWSET('SQLNCLI', 'Server=myrepo;Trusted_Connection=yes',
'select server from serverlist')
-------------------
--FOR EACH server--
-------------------
declare @min INT, @max INT
SELECT @min = MIN(id), @max = MAX(id) FROM @serverlist_to_do
while @min <= @max
BEGIN
--wipe out variables that are added each time!
SELECT @sql = '', @ErrorMessage = NULL, @ErrorSeverity = NULL, @ErrorState = NULL
,@Columns = NULL, @Columns_Short = NULL, @Columns_Insert = NULL
SELECT @servername = servername FROM @serverlist_to_do WHERE id = @min
SELECT @filename = '\\' + @servername + '\c$\temp\' + @beginning_of_filename + '_'
--note that the date format is screwy; using a few minutes ago so that it auto-rolls.
+ FORMAT( GETDATE()-.003, 'dd.MM.yyyy', 'en-US' ) + '.csv'
SELECT @base_filename = RIGHT(@filename, CHARINDEX('\',REVERSE(@filename))-1)
-----------------
--Import Header--
-----------------
PRINT CONVERT(VARCHAR(30),GETDATE(),120) + ' BULK INSERT of header begins for ' + @filename
TRUNCATE TABLE myimport_header_stage
SET @sql =
'BULK INSERT myimport_header_stage
FROM ''' + @filename + '''
WITH
(
LASTROW = 1,
FIELDTERMINATOR = ''\0'' --the \0 is "null terminator"; needed to make sure it doesnt try and parse
)'
BEGIN TRY
EXEC sp_executesql @sql
END TRY
BEGIN CATCH
PRINT 'File either locked, Does Not Exist, or format has changed; see error message for more details'
SELECT @ErrorMessage = ERROR_MESSAGE(),
@ErrorSeverity = ERROR_SEVERITY(),
@ErrorState = ERROR_STATE();
IF @@TRANCOUNT > 0
ROLLBACK TRANSACTION;
PRINT 'Error Severity: ' + CONVERT(VARCHAR(30), @ErrorSeverity)
PRINT 'Error State: ' + CONVERT(VARCHAR(30), @ErrorState)
PRINT 'Error Severity: ' + @ErrorMessage
--commented this out so we can run it without files.
--RAISERROR (@ErrorMessage, -- Message text.
-- @ErrorSeverity, -- Severity.
-- @ErrorState -- State.
-- );
END CATCH
---------------
--Import Data--
---------------
PRINT CONVERT(VARCHAR(30),GETDATE(),120) + ' BULK INSERT of data begins for ' + @filename
--inserts into view in order to add the ID column so the PIVOT works. insertview is just a select *, minus the ID
TRUNCATE TABLE myimport_wide_stage
SET @sql =
'BULK INSERT myimport_wide_stage_insertview
FROM ''' + @filename + '''
WITH
(
FIRSTROW = 2,
FIELDTERMINATOR = ''\0'' --the \0 is "null terminator"; needed to make sure it doesnt try and parse
)'
BEGIN TRY
EXEC sp_executesql @sql
END TRY
BEGIN CATCH
PRINT 'File either locked, Does Not Exist, or format has changed; see error message for more details'
SELECT @ErrorMessage = ERROR_MESSAGE(),
@ErrorSeverity = ERROR_SEVERITY(),
@ErrorState = ERROR_STATE();
IF @@TRANCOUNT > 0
ROLLBACK TRANSACTION;
PRINT 'Error Severity: ' + CONVERT(VARCHAR(30), @ErrorSeverity)
PRINT 'Error State: ' + CONVERT(VARCHAR(30), @ErrorState)
PRINT 'Error Severity: ' + @ErrorMessage
--commented this out so we can run it without files.
--RAISERROR (@ErrorMessage, -- Message text.
-- @ErrorSeverity, -- Severity.
-- @ErrorState -- State.
-- );
END CATCH
PRINT CONVERT(VARCHAR(30),GETDATE(),120) + ' BULK INSERT ends for ' + @filename
--===== Split the CSV column for the whole table using CROSS APPLY
--save into a staging table to make the pivot back easier.
TRUNCATE TABLE myimport_stage_split
INSERT INTO myimport_stage_split
(id, ItemNumber, Item)
SELECT stage.id, split.ItemNumber, split.item
FROM myimport_wide_stage stage
CROSS APPLY DelimitedSplit8K(resultant,',') split
--this is needed because blank values ('') don't convert to null; they come back with error converting varchar to decimal
UPDATE myimport_stage_split SET item = NULL WHERE item = ''
--2.01 optional fix for scientific notation.
--UPDATE myimport_stage_split
--SET item = CONVERT(DECIMAL(28,15),CONVERT(FLOAT(53),item))
--WHERE item LIKE '%E-0%'
----------------------
--Building the PIVOT--
----------------------
SELECT @Columns=COALESCE(@Columns + ',','') + QUOTENAME(ItemNumber) + ' as ' + QUOTENAME(item)
, @Columns_Short = COALESCE(@Columns_Short + ',','') + QUOTENAME(ItemNumber)
, @Columns_Insert = COALESCE(@Columns_Insert + ',','') + QUOTENAME(item)
FROM
(
SELECT DISTINCT split.ItemNumber, headers.item
From myimport_stage_split split
INNER JOIN
(
SELECT split.ItemNumber, split.item
FROM myimport_header_stage headers
CROSS APPLY DelimitedSplit8K(resultant,',') split
)
headers
ON headers.ItemNumber = split.ItemNumber
INNER JOIN INFORMATION_SCHEMA.COLUMNS
ON TABLE_NAME = 'mytargettable'
AND columns.COLUMN_NAME = headers.item
) AS B
ORDER BY B.ItemNumber
--We need the CTE so that we can calculate the import_datetime more easily.
SET @SQL='
;with CTE_Import as
(
SELECT ''' + @servername + ''' AS server_name, ' + @Columns
+ ', ''' + @base_filename + ''' as import_filename'
+ ' FROM
(
SELECT id, ItemNumber, item FROM
myimport_stage_split ) AS source
PIVOT
(max(item) FOR source.ItemNumber IN (' + @Columns_Short + ')
)AS pvt
)
insert into mytargettable (server_name, ' + @Columns_Insert + ', import_filename, import_datetime)
select server_name, ' + @Columns_Insert + ', import_filename,
CONVERT(DATETIME,SUBSTRING(RIGHT(''' + @base_filename + ''',14), 7,4) + SUBSTRING(RIGHT(''' + @base_filename + ''',14), 4,2) + SUBSTRING(RIGHT(''' + @base_filename + ''',14), 1,2)
+ CONVERT(datetime,minuteOfStats),120) as import_datetime
from CTE_Import
WHERE NOT EXISTS
(SELECT 1 FROM mytargettable main
WHERE main.server_name = ''' + @servername + '''
AND main.the_filename = ''' + @base_filename + '''
and CTE_Import.matchingfielda = main.matchingfielda
AND CTE_Import.matchingfieldb= main.matchingfieldb)
'
--make sure to handle nulls properly on the above matches!
PRINT @sql
--make sure the filename has a date in the valid format, otherwise the insert will blow up
IF (ISDATE(CONVERT(DATETIME,SUBSTRING(RIGHT(@base_filename,14), 7,4) + SUBSTRING(RIGHT(@base_filename,14), 4,2) + SUBSTRING(RIGHT(@base_filename,14), 1,2)
,120) ) )= 1
BEGIN
EXEC (@sql)
END
PRINT CONVERT(VARCHAR(30),GETDATE(),120) + ' insert into _main ends for ' + @filename
SET @min = @min+1
END
GO
view raw dynamic_csv_etl hosted with ❤ by GitHub

[Powershell] Getting uptime for a service, saved to a database

Needed this for someone in a different group. 

  • Get a list of servers
  • For every server on that list:
    • use get-wmiobject to get the list of open processes that match my search term.
    • Select relevant fields like computername, starttime, processid, and the exec path
    • Include a rounded up/down number of hours the process has been up
    • Write to a table.
Prereqs are the old standbys: out-datatable, write-datatable. 


Technical bits:
Powershell has some crazy syntax.  The $starttime is one example.
The most annoying part was having to get the times converted from the WMI format (MDTF) to something normal. 


. C:\powershell_scripts\out-datatable.ps1
. C:\powershell_scripts\write-datatable.ps1
$server_repository = 'myrepo'
$database_repository = 'repodb'
#here we create $starttime, then it will get the values when passed through the for-each. Could just do inline, but it's a clever concept so leaving it here.
$StartTime= @{n='StartTime';e={$_.ConvertToDateTime($_.CreationDate)}}
#get list of servers we want to look at
$serverlist = invoke-sqlcmd -serverinstance $server_repository -database $database_repository `
-query "SELECT server FROM dbo.serverlist WHERE Connect = 1"
$serverlist|%{
#can't get it to select out otherwise; stupid datatables
$computer = $_.server
#use get-wmiobject to get details about "myprocess.exe", then pull in other details,
# round off the number of hours it's been running, etc., etc.
#The ManagementDateTimeConverter is needed for WMI DMTF dates, as per
# http://stackoverflow.com/questions/29838688/hours-between-two-times-within-a-select
$details_table = gwmi win32_process -cn $_.server -filter "Name='myprocess.exe' AND CreationDate IS NOT NULL" |
select @{Name = 'Servername'; Expression = {"$computer"}},$StartTime, processid, path `
, @{Name = 'HoursUptime'; Expression = {[math]::Round(([datetime](get-date) - [System.Management.ManagementDateTimeConverter]::ToDateTime($_.CreationDate)).TotalHours)}} |
out-datatable
Write-DataTable -ServerInstance $server_repository -Database $database_repository -TableName Scheduler_PID_Info -Data $details_table
}

Thursday, April 16, 2015

[Code] Finding the lowest ID (identity) for a date when there's no index on the date, like Zeno

Needed this today.  I remember seeing someone at a SQLSaturday (which for me this year, doesn't narrow it down much) with this idea, but couldn't figure out who it was, so I wound up implementing my own.

Say you have, like most of us, a table with an "id INT IDENTITY PRIMARY KEY," so your PK is the ID, but the table also has a date column.  Nobody cares about the date.  Until they do, months later.  So now you have a busy table that you can't easily add an index on, and an ADHOC OMGWTFBBQ project comes in, where you need to query a date range, and no idea how to get there quickly. .

*raises hand* Yup, that'd be me.  So I wrote this. 

What it does: cycle down through a table, using the ID column.  Start by decrementing the max ID from the table by 10 million.  Does that go below the date you need?  Yes?  Okay, try 1 million. Yes? 100k.  No? How about 200k? Yes. 110k? No. 120k? Yes. 111k? No. 112k? Yes. 111100? (and so on). 

It'll pop down even a terabyte table pretty quickly, since it only does a handful of queries, and they're all against the PK.  Granted, I could make it faster by doing a (min+max)/2, then (newmin+max)/2 etc, but this works.  I've nicknamed it "Zeno's Arrow", although technically my code doesn't go halfsies - but it is fast and direct.

Also, (importantly!) it may not cope with missing rows (which could happen due to stuff like, for instance, a failed transaction).  I don't have a good fix for that, yet.  Maybe in V2

Hope this helps.

Monday, April 13, 2015

[System_Health] 2 dynamic and full parsers for the system_health session




I needed to look at some locking info in the system_health session recently, and realized nobody ever wrote one - all the ones I've seen are explicit calls. 
Everybody does stuff like '/event/data[3]', but there was no comprehensive shredding done.  I decided to fix that.  

While I've written something for SP_SERVER_DIAGNOSTICS a while ago (which will handle 2012+), I needed something specifically for 2008/2008R2, . 
 At it's core is a piece of code I picked up a while ago from stack_overflow and a couple other Xquery posts:

              SELECT  
                    A.B.value('@name[1]', 'varchar(128)') AS EventType,
                    A.B.value('./@timestamp[1]', 'datetime') AS UTC_Time,
                    X.N.value('local-name(.)', 'varchar(128)') AS NodeName,
                    X.N.value('../@name[1]', 'varchar(128)') AS ParsedName,
                    X.N.value('./text()[1]', 'varchar(max)') AS NodeValue
             FROM    cte_healthsession
             CROSS APPLY EventXML.nodes('/*') AS A (B)
             CROSS APPLY EventXML.nodes('//*') AS X (N)



All this really does is shred out every node at the various levels, including names and values.  Which for system_health seems to work pretty well (not perfectly; there are some values I need to filter in my pivot)

But once it's gotten all the potential details, the code automatically goes through each group and shreds it.  As a bonus, I realized that by doing it slightly differently, I could return all the details in one table.  The table does have a ton of fields, since many are specific to each type of event - error_reported doesn't have duration or wait_type, for instance, whereas waitinfo doesn't include error or messsage.  However, it means that you can read them in time order quite easily, which makes it easier to correlate issues.



However, I also wound up with a version that returns multiple tables, one for each event type.  (This looks similar to the sp_server_diag parser I wrote)





 


Now, when you run this (if you're not already doing a ton of stuff with system_health) you may be wondering why you're not seeing anything recent.  There are two potential reasons:

1) events that are too large, most likely deadlocks.  There's no good way around it other than to modify the system_health session to also write to a file (which 2012 does do).  In order to make sure I'm not susceptible to it, I added deadlock monitoring via Event Notifications (add the event DEADLOCK_GRAPH).  I will probably modify the system_health session again and add a file.

2) a really nasty bug in system_health, which Jonathon Kehayias mentioned in http://www.sqlskills.com/blogs/jonathan/incorrect-timestamp-on-events-in-extended-events/ . I can easily tell it's happening due to crappy monitoring software - I show a new error message every 5 minutes, although the time is several days ago (and not exactly X days ago, either).  So, I added the system time, as suggested in the post. 
Here's the code, holler if you have any questions!

One table:
if object_id('tempdb..#systemhealthsessiondata') is not null
DROP TABLE #systemhealthsessiondata
SELECT CAST(xet.target_data AS XML) AS XMLDATA
INTO #SystemHealthSessionData
FROM sys.dm_xe_session_targets xet
JOIN sys.dm_xe_sessions xe
ON (xe.address = xet.event_session_address)
WHERE xe.name = 'system_health'
if object_id('tempdb..#ParsedData') is not null
DROP TABLE #ParsedData
CREATE TABLE #ParsedData (id INT IDENTITY, Actual_Time DATETIME, EventType NVARCHAR(128), ParsedName NVARCHAR(128), NodeValue VARCHAR(MAX))
;WITH CTE_HealthSession (EventXML) AS
(
SELECT C.query('.') EventXML
FROM #SystemHealthSessionData a
CROSS APPLY a.XMLDATA.nodes('/RingBufferTarget/event') as T(C)
--WHERE C.query('.').value('(/event/@name)[1]', 'varchar(255)') in ('wait_info','wait_info_external')
)
INSERT INTO #ParsedData (Actual_Time, EventType, ParsedName, NodeValue)--(id, ParsedName, NodeValue)
SELECT --id,
DATEADD(MINUTE, DATEPART(TZoffset, SYSDATETIMEOFFSET()), UTC_Time) AS Actual_Time,
EventType,
ParsedName,
NodeValue
FROM (
SELECT
A.B.value('@name[1]', 'varchar(128)') AS EventType,
A.B.value('./@timestamp[1]', 'datetime') AS UTC_Time,
X.N.value('local-name(.)', 'varchar(128)') AS NodeName,
X.N.value('../@name[1]', 'varchar(128)') AS ParsedName,
X.N.value('./text()[1]', 'varchar(max)') AS NodeValue
FROM cte_healthsession
CROSS APPLY EventXML.nodes('/*') AS A (B)
CROSS APPLY EventXML.nodes('//*') AS X (N)
) T
--WHERE NodeName = 'value'
--AND event_type = 'wait_info'
--you can run this entire next batch separately
--And now use the standard dynamic pivot to shred.
-- Because of the way the pivot works, the fields are alphabetical; not a big deal, but not easily fixable when running multiserver
DECLARE @SQL AS VARCHAR (MAX)
DECLARE @Columns AS VARCHAR (MAX)
DECLARE @min INT, @max INT, @eventtype VARCHAR(128)
SET @Columns = NULL
SET @SQL = NULL
SELECT @Columns=
COALESCE(@Columns + ',','') + QUOTENAME(ParsedName)
FROM
(
SELECT DISTINCT ParsedName
FROM #ParsedData
--excluded it here, but the tsql_stack can be used to get the exact statement from the plan cache
--see http://blogs.msdn.com/b/extended_events/archive/2010/05/07/making-a-statement-how-to-retrieve-the-t-sql-statement-that-caused-an-event.aspx
WHERE ParsedName <> 'callstack'
) AS B
ORDER BY B.ParsedName
SET @SQL='
SELECT Actual_Time, EventType,' + @Columns + ' FROM
(
SELECT EventType, Actual_Time, ParsedName, NodeValue FROM
#ParsedData) AS source
PIVOT
(max(NodeValue) FOR source.ParsedName IN (' + @columns + ')
)AS pvt order by actual_time'--, attach_activity_id'
PRINT @sql
EXEC (@sql)




Multiple tables:
if object_id('tempdb..#systemhealthsessiondata') is not null
DROP TABLE #systemhealthsessiondata
SELECT CAST(xet.target_data AS XML) AS XMLDATA
INTO #SystemHealthSessionData
FROM sys.dm_xe_session_targets xet
JOIN sys.dm_xe_sessions xe
ON (xe.address = xet.event_session_address)
WHERE xe.name = 'system_health'
if object_id('tempdb..#ParsedData') is not null
DROP TABLE #ParsedData
CREATE TABLE #ParsedData (id INT IDENTITY, Actual_Time DATETIME, EventType NVARCHAR(128), ParsedName NVARCHAR(128), NodeValue VARCHAR(MAX))
;WITH CTE_HealthSession (EventXML) AS
(
SELECT C.query('.') EventXML
FROM #SystemHealthSessionData a
CROSS APPLY a.XMLDATA.nodes('/RingBufferTarget/event') as T(C)
--WHERE C.query('.').value('(/event/@name)[1]', 'varchar(255)') in ('wait_info','wait_info_external')
)
INSERT INTO #ParsedData (Actual_Time, EventType, ParsedName, NodeValue)--(id, ParsedName, NodeValue)
SELECT --id,
DATEADD(MINUTE, DATEPART(TZoffset, SYSDATETIMEOFFSET()), UTC_Time) AS Actual_Time,
EventType,
ParsedName,
NodeValue
FROM (
SELECT
A.B.value('@name[1]', 'varchar(128)') AS EventType,
A.B.value('./@timestamp[1]', 'datetime') AS UTC_Time,
X.N.value('local-name(.)', 'varchar(128)') AS NodeName,
X.N.value('../@name[1]', 'varchar(128)') AS ParsedName,
X.N.value('./text()[1]', 'varchar(max)') AS NodeValue
FROM cte_healthsession
CROSS APPLY EventXML.nodes('/*') AS A (B)
CROSS APPLY EventXML.nodes('//*') AS X (N)
) T
--WHERE NodeName = 'value'
--AND event_type = 'wait_info'
--And now use the standard dynamic pivot to shred.
-- Because of the way the pivot works, the fields are alphabetical; not a big deal, but not easily fixable when running multiserver
DECLARE @SQL AS VARCHAR (MAX)
DECLARE @Columns AS VARCHAR (MAX)
DECLARE @min INT, @max INT, @eventtype VARCHAR(128)
DECLARE @events_to_do TABLE (id INT IDENTITY, eventtype varchar(128))
INSERT INTO @events_to_do
SELECT DISTINCT eventtype FROM #ParsedData
SELECT @min = MIN(id), @max = MAX(id) FROM @events_to_do
WHILE @min <= @max
BEGIN
SET @Columns = NULL
SET @SQL = NULL
SELECT @eventtype = eventtype FROM @events_to_do WHERE id = @min
SELECT @Columns=
COALESCE(@Columns + ',','') + QUOTENAME(ParsedName)
FROM
(
SELECT DISTINCT ParsedName
FROM #ParsedData
--excluded it here, but the tsql_stack can be used to get the exact statement from the plan cache
--see http://blogs.msdn.com/b/extended_events/archive/2010/05/07/making-a-statement-how-to-retrieve-the-t-sql-statement-that-caused-an-event.aspx
WHERE ParsedName <> 'callstack'
AND EventType = @eventtype
) AS B
ORDER BY B.ParsedName
SET @SQL='
SELECT Actual_Time, EventType,' + @Columns + ' FROM
(
SELECT EventType, Actual_Time, ParsedName, NodeValue FROM
#ParsedData where eventtype = ''' + @eventtype + ''') AS source
PIVOT
(max(NodeValue) FOR source.ParsedName IN (' + @columns + ')
)AS pvt order by actual_time'--, attach_activity_id'
PRINT @sql
EXEC (@sql)
set @min = @min+1
END
-- SELECT * FROM #ParsedData