Friday, November 9, 2012

[SQL Server 2012 SSIS] Getting a useful email on job failure

(updated 2013/02/07 - stupid blogspot ate my HTML tags. Not helpful. fixed.)
(updated 2013/02/14 - making email more bulletproof)
(updated 2013/07/31 - adding info about SSIS "cancelled")
(update 2013/09/05 - adding info about KB 2829948) 

So... SQL Server 2012 has horribly useless emails for SSIS job failures.  If someone is at PASS, please find out whose idea this was, walk up to them, and punch them for me.
If your job fails, here's the error message you get in job history (and therefore to your blackberry or equivalent at 3am):

Executed as user: mydomain\myuser. Microsoft (R) SQL Server Execute Package Utility  Version 11.0.2100.60 for 64-bit  Copyright (C) Microsoft Corporation. All rights reserved.    Started:  5:00:25 PM  Package execution on IS Server failed. Execution ID: 701, Execution Status:4.  To view the details for the execution, right-click on the Integration Services Catalog, and open the [All Executions] report  Started:  5:00:25 PM  Finished: 5:01:43 PM  Elapsed:  77.938 seconds.  The package execution failed.  The step failed.

As my daughter would say (with amazing derision for someone so young):

So my job has failed, and in order to figure out why (and if it's worth getting up for) I have to:
  1. Start up my desktop
  2. VPN to work (since with my luck it's 3am)
  3. RDP into my work machine
  4. Open SSMS 
  5. Connect to that server
  6. Open the SSISDB dashboard report. (oh, and it only holds data for the last 24 hours)
Punching's too good for them.  (And is part of why I disdain SSIS, which seems to actively hate Production DBAs)

Fortunately, the details are stored in a table: [ssisdb].[catalog].[event_messages].

According to SQL Sentry (who we use for a lot of jobs), we could also do this:
I don't use it, because I don't like the changes necessary to put it in place.
This appears fairly minor, I think - it changes each job step to invoke SSIS via the DTS Execute Package Utility directly, changes the job to "Log To Table" the job step history, and other stuff.  However, it also means we can't easily change some of the configurations, among other things.

(However, I did make a version that works with SQL Sentry - instead of adding job steps on SSIS jobs, add a setting to SQL Sentry to run sql code on failure:

So as a proof of concept fix that I use in production now, I wrote a stored procedure that uses the event_messages table and Job Tokens to send an email with the failure info.  Sounds imposing, but fairly easy... and should be scriptable (we're also looking at ways to do it without having to add a job step).

At its core, here's the code we're running.  Run this to get the most recent failure info.

  FROM ssisdb.[catalog].[event_messages]
WHERE event_name = 'OnError'
       AND operation_id IN (SELECT MAX(operation_id)
  FROM ssisdb.[catalog].[event_messages]
  WHERE event_name = 'OnError')
ORDER BY message_time ASC

1. Run the below stored procedure in your "DBA tools" database, whatever it's called, remembering to change the email settings.
2. IN SSMS 2012, create a job step with the following code, and make both the On Success and On Failure be "Quite the job reporting failure", so your other methods will pick up on it.

EXEC DBA_tools.dbo.ssis_job_failure_info @job_id = $(ESCAPE_SQUOTE(JOBID)), @current_step_id = $(ESCAPE_SQUOTE(STEPID))

While it looks complex, all it's doing is using Token Replacement (which is enabled by default) to send the job_id and the step ID.  The SP then uses that to find the most recently failed job step that ISN'T itself. (though on writing this I realize the job step doesn't actually fail until the code has run - doh!)

3. Change all your job steps' "On Failure" from "Quit the job reporting failure" to "run step X", where X is the job step created in step 2 above.

4. Change the next-to-last-step's "On Success"

5. Accept the "The On Success action of the last Step will be changed... to Quit With Success" 

6. Test!

One note: if the job fails because of an SSIS timeout (see:, no email is sent. That's because, while the job failed, technically the package doesn't know it failed, it thinks it was cancelled... or something like that.  Read the Connect item for more details.  Supposedly fixed in SP1, but haven't had it happen up to now. but if it happens to you, there are a handful of fixes available.

  1. Add the 5 indexes as per
  2. Change Retention period as per: 
    1. Don't just set it to 90 if you have a lot of history! You need to iterate through days, running the maint job between each.
  3. Change DB size settings as per same post
  4. Ensure Database is in SIMPLE mode.
  5. (possibly turn on snapshot isolation)
  6. (I also turn on Async Statistics Update) 
  7. CHECK THE EVENT LOGS on the server for details.  It won't be in the SQL Server's ERRORLOG.
  8. - There is a fix from MS!   SP1 CU4 has a fix.  It says it just adds indexes; I suspect it changes one of the delete SPs as well, since adding the indexes did NOT fix it for me.

Enjoy - TBD

CREATE PROCEDURE ssis_job_failure_info 
@job_id UNIQUEIDENTIFIER, @current_step_id INT AS DECLARE @full_ssis_command VARCHAR(4000) , @job_step_id INT , @package_name VARCHAR(4000) , @tableHTML NVARCHAR(MAX) , @MailSubject VARCHAR(200) , @job_name VARCHAR(100) --token replacement happens in the job. --select @job_id = $(ESCAPE_SQUOTE(JOBID)), @current_step_id = $(ESCAPE_SQUOTE(STEPID)) SELECT @job_name = name FROM msdb.dbo.sysjobs WHERE job_id = @job_id --determine which job_step failed. SELECT top 1 @job_step_id = step_id FROM msdb.dbo.sysjobhistory WHERE run_status <> 1 AND step_id > 0 AND job_id = @job_id AND step_id <> @current_step_id ORDER BY instance_id DESC --now find the package name SELECT @full_ssis_command = command FROM msdb.dbo.sysjobsteps WHERE job_id = @job_id AND step_id = @job_step_id IF @full_ssis_command LIKE '%.dtsx%' BEGIN SELECT @package_name = RIGHT(LEFT(@full_ssis_command,CHARINDEX('.dtsx',@full_ssis_command)-1),CHARINDEX('\',REVERSE(LEFT(@full_ssis_command,CHARINDEX('.dtsx',@full_ssis_command)-1)))-1)+'.dtsx' END --goes in the error log, if you have one SELECT [message_time] ,[extended_info_id] ,[package_name] ,[message_source_name] ,[subcomponent_name] ,[package_path] ,[execution_path] ,left([message],400) FROM ssisdb.[catalog].[event_messages] WHERE [package_name] = @package_name AND event_name = 'OnError' AND operation_id IN (SELECT MAX(operation_id) FROM ssisdb.[catalog].[event_messages] WHERE [package_name] = @package_name) ORDER BY message_time ASC SELECT @MailSubject = 'Job Failure on ' + @@servername + ': ' + @job_name FROM msdb.dbo.sysjobs WHERE job_id = @job_id SET @tableHTML = N'<H3>Error for job ' + @job_name + '</H3>' + N'<table border="1">' + N'<th>Message_Time</th>' + N'<th>Extended_info_id</th>' + N'<th>Package_Name</th>' + N'<th>Message_Source_Name</th>' + N'<th>subcomponent_name</th>' + N'<th>package_path</th>' + N'<th>execution_path</th>' + N'<th>message</th>' + CAST(( SELECT td = CONVERT(VARCHAR(24),message_time,121) , '' , td = CONVERT(VARCHAR(10), ISNULL(extended_info_id,'')) , '' , td = CONVERT(VARCHAR(50), RTRIM(LTRIM(ISNULL(package_name,'')))) , '' , td = CONVERT(VARCHAR(50),RTRIM(LTRIM(ISNULL([message_source_name],'')))) , '' , td = CONVERT(VARCHAR(50), RTRIM(LTRIM(ISNULL([subcomponent_name],'')))) , '' , td = CONVERT(VARCHAR(50), RTRIM(LTRIM(ISNULL([package_path],'')))) , '' , td = CONVERT(VARCHAR(50),RTRIM(LTRIM(ISNULL([execution_path],'')))) , '' , td = CONVERT(VARCHAR(400),RTRIM(LTRIM(left(ISNULL([message],''),400)))) FROM ssisdb.[catalog].[event_messages] WHERE [package_name] = @package_name AND event_name = 'OnError' AND operation_id IN (SELECT MAX(operation_id) FROM ssisdb.[catalog].[event_messages] WHERE [package_name] = @package_name AND event_name = 'OnError') ORDER BY event_messages.message_time FOR XML PATH('tr') , TYPE ) AS NVARCHAR(MAX)) + N'</table>' ; --PRINT @tableHTML EXEC msdb.dbo.sp_send_dbmail @profile_name = 'the_baking_dba', @recipients = '', @subject = @MailSubject, @body = @tableHTML, @body_format = 'HTML' ; GO


Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.
magnifictraining said...
This comment has been removed by a blog administrator.
Unknown said...

I agree 100%, and this is why I'm still using file system deployment. The biggest problem I see is that you have no way to identify which errors were reported by a specific job. In an environment where you have hundreds of jobs, and some are calling the same packages, SQL Server is no longer a viable solution.

bourgon said...

Oh! I honestly hadn't thought of that. Okay, I think I know a way to do it. Give me a week and I'ok see what I can do. it will require a new blog post... Blogger doesn't okay well with embedded scripts, which is how I link the gists.