Monday, April 13, 2015

[System_Health] 2 dynamic and full parsers for the system_health session

I needed to look at some locking info in the system_health session recently, and realized nobody ever wrote one - all the ones I've seen are explicit calls. 
Everybody does stuff like '/event/data[3]', but there was no comprehensive shredding done.  I decided to fix that.  

While I've written something for SP_SERVER_DIAGNOSTICS a while ago (which will handle 2012+), I needed something specifically for 2008/2008R2, . 
 At it's core is a piece of code I picked up a while ago from stack_overflow and a couple other Xquery posts:

                    A.B.value('@name[1]', 'varchar(128)') AS EventType,
                    A.B.value('./@timestamp[1]', 'datetime') AS UTC_Time,
                    X.N.value('local-name(.)', 'varchar(128)') AS NodeName,
                    X.N.value('../@name[1]', 'varchar(128)') AS ParsedName,
                    X.N.value('./text()[1]', 'varchar(max)') AS NodeValue
             FROM    cte_healthsession
             CROSS APPLY EventXML.nodes('/*') AS A (B)
             CROSS APPLY EventXML.nodes('//*') AS X (N)

All this really does is shred out every node at the various levels, including names and values.  Which for system_health seems to work pretty well (not perfectly; there are some values I need to filter in my pivot)

But once it's gotten all the potential details, the code automatically goes through each group and shreds it.  As a bonus, I realized that by doing it slightly differently, I could return all the details in one table.  The table does have a ton of fields, since many are specific to each type of event - error_reported doesn't have duration or wait_type, for instance, whereas waitinfo doesn't include error or messsage.  However, it means that you can read them in time order quite easily, which makes it easier to correlate issues.

However, I also wound up with a version that returns multiple tables, one for each event type.  (This looks similar to the sp_server_diag parser I wrote)


Now, when you run this (if you're not already doing a ton of stuff with system_health) you may be wondering why you're not seeing anything recent.  There are two potential reasons:

1) events that are too large, most likely deadlocks.  There's no good way around it other than to modify the system_health session to also write to a file (which 2012 does do).  In order to make sure I'm not susceptible to it, I added deadlock monitoring via Event Notifications (add the event DEADLOCK_GRAPH).  I will probably modify the system_health session again and add a file.

2) a really nasty bug in system_health, which Jonathon Kehayias mentioned in . I can easily tell it's happening due to crappy monitoring software - I show a new error message every 5 minutes, although the time is several days ago (and not exactly X days ago, either).  So, I added the system time, as suggested in the post. 
Here's the code, holler if you have any questions!

One table:

Multiple tables:

No comments: