Thursday, January 27, 2011

[Replication] Orphaned agents after Agent crash.

Had SQL Server agent crash on me, and then my home-grown replication monitoring gave me the following error via email (interestingly enough, the Alerts we created didn't fire):


Agent 'MYSERVERNAME-MYDatabase-MyPublication-Servername-525' is
retrying after an error. 14 retries attempted. See agent
job history in the Jobs folder for more details.


Since that's the error you'd see in the Replication Monitor (start->run->"sqlmonitor"), I went and looked at the agent which showed our now-at-16 retries:

[...]
2011-01-27 22:46:17.596 Agent message code 21036. Another distribution agent for the subscription or subscriptions is running, or the server is working on a previous request by the same agent.


Interesting. My guess is that since the SQL Agent crashed, it left behind agents that are still running their original orders, blocking the new agents from starting. Well, let's look at what's connected.


SELECT * FROM master..sysprocesses
WHERE program_name LIKE @@servername + '%'
AND login_time < CONVERT(CHAR(8),GETDATE(),112)--before today
AND hostname = @@servername


Bingo. Two SPIDs, both with program names like 'MYSERVERNAME-MYDatabase-MyPublication-Servername' - and which match my original email above.

Kill the two SPIDs, and the next go-round the distribution agent spins up successfully.

No comments: