I wish I knew whom to thank for this. An absolute genius idea. Probably Ozar, possibly Bertrand, could be White or a bunch of other #SQLFamily. Warning: once you use this, you'll want it on all your servers.
Premise: run an Extended Event looking for all errors over Severity 10, saving in a 10mb memory buffer within the SQL Server instance. Keep the rolling last errors so you have a decent amount of history. I believe the original was Severity 10 and above, but that one was too chatty for me, whereas 11+ seems to be ideal. But I refer to it as my "Sev10" code, so here we are. This includes all sorts of stuff you won't normally see - errors from stored procedures, errors in replication, errors in SSMS, etc.
Testing it is easy: create session & start it, run "select 1/0", then run the query.
I do have a version that I run that saves logs out, but honestly, that's a different post. This Xevent uses 10mb from the Ring Buffer, constantly filling and keeping the last 10mb of errors. Performance hit seems minimal, and we're doing millions of transactions a day.
Performance has always been the kicker on this - using Xquery directly against the ring buffer is slow as heck, and even after solving that via temp tables, I still ran into problems with one particular server. That made me realize I don't appear to have posted this. By splitting it via NODE to multiple rows to a temp table, then querying the temp table to shred the XML, the performance is vastly improved. In my case, from 8 minutes to 16 seconds. And on most servers, under 5.
Other thoughts - you might consider adding another WHERE clause to the ADD EVENT; maybe filter end-users' machines or SSMS. I like having it, but you do you. Turns out SSMS is damn chatty sometimes, also. "View server state permission was denied " and "Cannot drop the table '#SVer'" abound.
Want to run this in Azure SQL DB? Sure, just do it against each database, replacing the "ON SERVER" with "ON DATABASE". The table names you query against change slightly (it's commented), and the time zone calculation doesn't work yet. Booooooooo.
I'm working on a collector, just haven't finished yet.
Bonus! Frames! Take the Frames provided in the Event_Data, and go use their amazing code to tell you exactly what piece of code in what function/stored procedure, and what line, caused the issue. Is it in a stored procedure calling a stored procedure calling a function? This will tell you where and what. Genius, and kudos to Tom. https://straightforwardsql.com/posts/investigating-errors-with-extended-events/
Cheers!
--1.1, 20191210 first version I know of | |
--1.2, 20230223 major changes to speed it up, after dealing with a particular server, added Azure SQL DB details. | |
--1.21 20230403 adding event_sequence | |
--1.22 20230427 adding CPID Client PID. Find app with posh: gwmi win32_process -cn theclientnamehere -filter "ProcessId=2900" | |
--thebakingdba.blogspot.com | |
/* | |
CREATE EVENT SESSION | |
severity_10plus_errors_XE_memory | |
ON server --use "on database" for Azure SQL DB | |
ADD EVENT sqlserver.error_reported | |
( | |
ACTION(package0.event_sequence,sqlserver.client_app_name,sqlserver.client_hostname,sqlserver.database_id,sqlserver.sql_text,sqlserver.tsql_stack,sqlserver.username,sqlserver.client_pid) | |
--ACTION (sqlserver.sql_text, sqlserver.tsql_stack, sqlserver.database_id, sqlserver.username) | |
WHERE ([severity]> 10) | |
) | |
ADD TARGET | |
package0.ring_buffer | |
(SET | |
max_memory = 10000 ) -- Units of KB. | |
WITH (MAX_DISPATCH_LATENCY = 60SECONDS,STARTUP_STATE = on) | |
GO | |
ALTER EVENT SESSION severity_10plus_errors_XE_memory | |
ON server --use "on database" for Azure SQL DB | |
STATE = START; | |
GO | |
*/ | |
if object_id('tempdb..#xevent') is not null | |
DROP TABLE #xevent | |
CREATE TABLE #xevent (target_data XML) | |
INSERT INTO #xevent (target_data) | |
SELECT target_data | |
FROM sys.dm_xe_session_targets AS st --For Azure SQL DB, use dm_xe_database_session_targets | |
INNER JOIN sys.dm_xe_sessions AS se --For Azure SQL DB, use dm_xe_database_sessions | |
ON CAST(se.address AS BINARY(8)) = CAST(st.event_session_address AS BINARY(8)) | |
WHERE | |
se.name = 'severity_10plus_errors_XE_memory' AND st.target_name = 'ring_buffer' | |
DECLARE @count VARCHAR(10) | |
SELECT @count = target_data.value('/RingBufferTarget[1]/@eventCount','varchar(10)') | |
FROM #xevent --null means it's not on or not "installed" | |
PRINT 'eventCount = ' + @count | |
--need two temp tables to speed shredding; this one will hold one row per event. | |
if object_id('tempdb..#xevent2') is not null | |
DROP TABLE #xevent2 | |
CREATE TABLE #xevent2 (err_timestamp DATETIME2, target_data XML) | |
CREATE INDEX ncidx__xevent2__err_timestamp ON #xevent2(err_timestamp) | |
--shredded-to-node for speed and to filter | |
INSERT INTO #xevent2 | |
SELECT DATEADD(MINUTE, DATEPART(TZoffset, SYSDATETIMEOFFSET()), | |
ed.c.value('(@timestamp)[1]', 'nvarchar(max)')) AS err_timestamp, ed.c.query('.') AS event_data | |
FROM #xevent a | |
cross apply (select CAST(target_data as XML) as event_data) as xevents | |
CROSS APPLY xevents.event_data.nodes('RingBufferTarget/event') ed(c) | |
--final select out. Put filters in the SELECT * WHERE clause | |
;with events_cte as( | |
select err_timestamp, | |
target_data.value('(/event/data[@name="severity"]/value)[1]', 'bigint') AS [err_severity], | |
target_data.value('(/event/data[@name="error_number"]/value)[1]', 'bigint') AS [err_number], | |
target_data.value('(/event/data[@name="message"]/value)[1]', 'nvarchar(512)') AS [err_message], | |
target_data.value('(/event/action[@name="sql_text"]/value)[1]', 'nvarchar(max)') AS [sql_text], | |
target_data.value('(/event/action[@name="client_app_name"]/value)[1]', 'nvarchar(max)') AS client_app_name, | |
target_data.value('(/event/action[@name="client_hostname"]/value)[1]', 'nvarchar(max)') AS client_hostname, | |
target_data.value('(/event/action[@name="database_id"]/value)[1]', 'int') AS database_id, | |
target_data.value('(/event/action[@name="nt_username"]/value)[1]', 'nvarchar(max)') AS nt_username, | |
target_data.value('(/event/action[@name="username"]/value)[1]', 'nvarchar(max)') AS username, | |
target_data.value('(/event/action[@name="session_nt_username"]/value)[1]', 'nvarchar(max)') AS session_nt_username, | |
target_data.value('(/event/action[@name="event_sequence"]/value)[1]', 'nvarchar(max)') AS event_sequence, | |
target_data.value('(/event/action[@name="client_pid"]/value)[1]', 'int') AS client_pid | |
, a.target_data AS event_data --we can leave this off since it makes this run SO MUCH LONGER. | |
FROM #xevent2 a | |
) | |
SELECT events_cte.[err_timestamp], events_cte.[err_severity], events_cte.[err_number], events_cte.[err_message], events_cte.[sql_text], events_cte.client_app_name, events_cte.client_hostname | |
, client_pid, DB_NAME(events_cte.database_id) as database_name, events_cte.nt_username, username, events_cte.session_nt_username, events_cte.event_sequence | |
, events_cte.event_data | |
from events_cte | |
WHERE | |
NOT (events_cte.err_number = 17830 AND events_cte.err_severity = 20) --that network error disconnect message | |
--and (events_cte.client_hostname IS NULL OR events_cte.client_hostname not IN ('thebakingdba')) | |
--and err_timestamp >='20210930' | |
order by err_timestamp desc; | |