Friday, July 8, 2016

[EN] Troubleshooting and fun with differential backups

Came in this morning to one of my EN alerts going off.  No new messages added in the past hour.  (I have enough servers that this is a really good failsafe)

Queue appears up.
SELECT * FROM sys.service_queues

No records in sys.transmission_queue.  Not a total failsafe, but a good start.

Sys.databases is... interesting.  "ACTIVE_BACKUP_OR_RESTORE" for my EN database.  We'll come back to that.

Any backups running?  NO.

Check the actual queue itself
sp_spaceused '[dbo].[ENAudit_SBQueue]'

Well, it's growing, which is actually a good sign (means, IMHO, that the MSDBs on other servers shouldn't be growing, since the messages are making it over here.

How about the activated stored procedure, then?  What's it doing?
SELECT * FROM sys.dm_broker_activated_tasks 
No records, which means that the activated SP isn't firing.  What the hey?  

Means that for some reason it's not activating properly.

Select * from sys.dm_broker_queue_monitors; 
The state for that database's monitor is "NOTIFIED". Huh, that seems odd.

As per http://rusanu.com/2008/08/03/understanding-queue-monitors/ , this means that it's waiting for the activated stored procedure to do a RECEIVE on the database, but won't fire it until that occurs.  Fortunately for us, possibly (I'm writing this as I diagnose the problem), ANY receive on the queue will work.  We may have to do that.

Okay, let's check what we should've started with - event logs:
The queue 517576882 in database 9 has activation enabled and contains unlocked messages but no RECEIVE has been executed for 26964 seconds.

Well, well, well.  Crap.

And then we see these from earlier:

The transaction log for database 'EventNotificationRec' is full due to 'ACTIVE_BACKUP_OR_RESTORE'.

and

The activated proc '[dbo].[Parse_EN_Messages]' running on queue 'EventNotificationRec.dbo.ENAudit_SBQueue' output the following:  'The transaction log for database 'EventNotificationRec' is full due to 'ACTIVE_BACKUP_OR_RESTORE'. '

Looks like we're onto something.

Aha!

BACKUP failed to complete the command BACKUP DATABASE EventNotificationRec. Check the backup application log for detailed messages.

Okay, so we have several things.  Let's look at job status.  Ahhhh.  The DIFF job is "cancelled". 

I think here's what happened - someone saw a failed backup, and ran the wrong job.  The DIFF backup job is disabled - it's the standard Ola Hallengren job.  Which means it tried to do a DIFF of every DB.  Which... would take a while, since we don't do them, so it could take longer than doing full backups.  They panicked and cancelled it.  And for whatever reason, that left the stored procedure/queue in a weird state where it had been notified but wasn't running.

UPDATE: While I thought the DIFF was the culprit, I was wrong.  The FULL backup on that database started at 00:40:00, so it looks like the TLOG (which is capped) filled up somehow while the backup was running.  No idea why - the cap was 3gb which has been totally fine until now.  Yes, the TLOG can fill during a backup, but it typically doesn't have enough going on for that to matter.

Fix?  Easy, and in Rusanu's post:

ALTER QUEUE ENAudit_SBQueue WITH ACTIVATION (STATUS = OFF);
ALTER QUEUE ENAudit_SBQueue WITH ACTIVATION (STATUS = ON);


Now?  Look in sys.dm_broker_queue_monitors.  All we care about is database_id 9, the other ones are the InternalMailQueue/ExternalMailQueue/syspolicy_event_queue in msdb (everybody has those).


And in sys.broker_activated_tasks.  Wow, it's using all 5 SPIDs.  I must've changed it after the last time the other team broke EN.  And I see new messages showing in my ENAudit_Events table.  Yay!






No comments: