Thursday, February 23, 2023

Severity 10 - using Extended Events to get code errors

I wish I knew whom to thank for this.  An absolute genius idea. Probably Ozar, possibly Bertrand, could be White or a bunch of other #SQLFamily.  Warning: once you use this, you'll want it on all your servers. 

Premise: run an Extended Event looking for all errors over Severity 10, saving in a 10mb memory buffer within the SQL Server instance. Keep the rolling last errors so you have a decent amount of history. I believe the original was Severity 10 and above, but that one was too chatty for me, whereas 11+ seems to be ideal. But I refer to it as my "Sev10" code, so here we are. This includes all sorts of stuff you won't normally see - errors from stored procedures, errors in replication, errors in SSMS, etc. 

Testing it is easy: create session & start it, run "select 1/0", then run the query.

I do have a version that I run that saves logs out, but honestly, that's a different post. This Xevent uses 10mb from the Ring Buffer, constantly filling and keeping the last 10mb of errors. Performance hit seems minimal, and we're doing millions of transactions a day.

Performance has always been the kicker on this - using Xquery directly against the ring buffer is slow as heck, and even after solving that via temp tables, I still ran into problems with one particular server. That made me realize I don't appear to have posted this. By splitting it via NODE to multiple rows to a temp table, then querying the temp table to shred the XML, the performance is vastly improved.  In my case, from 8 minutes to 16 seconds. And on most servers, under 5. 

Other thoughts - you might consider adding another WHERE clause to the ADD EVENT; maybe filter end-users' machines or SSMS. I like having it, but you do you. Turns out SSMS is damn chatty sometimes, also. "View server state permission was denied " and "Cannot drop the table '#SVer'" abound.

Want to run this in Azure SQL DB? Sure, just do it against each database, replacing the "ON SERVER" with "ON DATABASE". The table names you query against change slightly (it's commented), and the time zone calculation doesn't work yet. Booooooooo.

I'm working on a collector, just haven't finished yet.

Bonus! Frames! Take the Frames provided in the Event_Data, and go use their amazing code to tell you exactly what piece of code in what function/stored procedure, and what line, caused the issue. Is it in a stored procedure calling a stored procedure calling a function? This will tell you where and what. Genius, and kudos to Tom.  https://straightforwardsql.com/posts/investigating-errors-with-extended-events/

Cheers!


No comments: