Tuesday, May 12, 2015

[Powershell] EVENTLOG_TRACKING - Capturing multiple servers' Event Logs to a database

Wow, that's what I call timing.  TSQL Tuesday happens to be covering something I just finished building!
I'm usually not much for the explanatory posts.  I typically try and just solve a problem, then post it here.

(yes, there's code, glorious already-in-production code)

TL;DR - store classic Application/System event logs, minus the crap, in a database. Fast - 2 minutes for 90 servers.


Problem: we had no insight into our servers' event logs. Our in-house application would throw errors, and we needed a process that could rip through the server logs and send emails to the systems team when there were problems. 

Solution, 5 years ago: I banged something together in a hurry, using Microsoft's LogParser app, to capture event logs to a database.  Why?  Yes, I'm the DBA, but I've always been the "special teams" - need to take hex dumps from an Linux app's log and convert them to a format we can query off of?  I'm your guy.  Sed, tcsh, Shell Scripts, Perl (so very rusty these days), DOS Batch files, Un*x, Powershell, etc.  And nobody else had time, so I wrote one using the quickest path I could.  And it worked!  Worked well, even.  And so it ran in the background for 5 years...

Fast forward to: now.  The fact that I'd been saving them saved the day (both at the time, and recently), but I saw that it needed improving.  Looking at what I wrote... I don't shudder, but it needed a rewrite.  Yes, the systems folk now have tools... that only save for a week.  So yet again... DBA to the rescue!

(Of course, now that I write this - I wonder if I should've just built a process to save from that tools' table...)

So how should we overhaul this?  LogParser hasn't seen any love in years, every time I log on the box I need to clear a bunch of "LogParser has crashed" messages, and while event logs have changed some I still only need the "classic" logs.  And it's slow.  Ye gods, it's slow.  4 hours.  Partly because it's running in serial (it's a loop that executes, then goes to the next), partly because LogParser is dated and partially because.... um, it's slow, no idea why, it shouldn't be.

So, I rewrote it, using PowerShell!  (and it only takes 2 minutes for 90 servers!)

Several days later I've got it working, and have learned some important things I'll share with you. 

  1. Every version has peccadillos.  Get-WinEvent doesn't work with -filterhashtable in 2008 (but works in 2008R2). Get-eventlog only does classic logs. Win32_NTLogEvent is faster, but requires WMI permissions. Fields differ.  One of the event codes is different between the different versions (so an event may be 3 from one method, and 4 from another)
  2. Filtering is weird.  Everything I read says that server-side filtering over slow pipes would be faster (makes sense), and that only Get-WinEvent does server-side filtering... but in every test I've done, over multiple environments, the oldest implementation (gwmi -query on Win32_NTLogEvent) is the fastest.  If everything's on the same network it's close, but over the WAN it's apparent.
  3. That being said, if you want the newer style of logs, you must use Get-Winevent - neither get-eventlog nor gwmi capture them.  *Sad Trombone*
  4. WMI uses its own time implementation, which means you need to pay attention when setting dates.
  5. FilterXML is a beating, but is required for good filtering on SQL 2008 with get-winevent.
  6. Windows Remoting is awesome - if your systems people will set it up.  Then, you can run the commands remotely, and ensure the filtering occurs on the server.  Setting it up (more so on older boxes) is a pain.
  7. The permissions required for WMI are NOT the same as those required for get-eventlog *sigh*.
  8. Being able to say (get-date).second and get back the current second is pretty cool.

One catch on my code, and I have no idea why it is: sometimes, even running in serial, a server won't be inserted. I suspect it's some weird interaction with BULK INSERT, write-datatable and FULLTEXT, like maybe it's running the SP before the data has finished committing.  So, I have an additional job step that runs one last time over, once it's all done.

One other oddity - because of the way logs work, clusters have issues.  The logs show up on both, and the computername shows up as the listener name.  And then you see the computername of the physical boxes.... it's a pain.  I'll have to fix. Later.



TL;DR - store classic Application/System event logs, minus the crap, in a database.

Lower-level overview:

  • Grab a list of servers from our central repository and combine that with our Eventlog view (each month gets its own tables, combined into a view) to figure out what we need to get, aka what time and record number we care about, for each log (app and sys).  Alas, the Security Log has…. issues, not the least of which is the amount of growth/size, but also the way it’s generated means that querying it is beyond slow.

  • Truncate our staging table. At the end of the process, there may be left over records.  If we can’t get them in the database we need to look at it, but we’ll always have some due to our cluster. (Long story)

  • Now, we run the next part in parallel.  I’ve got it running 8 threads at once, so it doesn’t overwhelm the database network.  The way it’s structured, the "master" job will stop everything after a number of seconds (defined at the end of the block).  When our systems people got the permissions working properly, you would see the job run for the full time, then forcibly end.  Which means that it didn’t finish all of the threads.  So when it started up again, it would try again, get as far as it could, then stop again.  Fortunately, that’s pretty far, and after a half-dozen of the 30-minute runs, it was caught up.  Now it takes 1.5 minutes for 70+ servers.  Not too shabby.
 
  • And for each server...

  • Get the version of the server we’re querying.  I did that originally because it says online that Win-Getevent is sooo much quicker, server-side filtering, etc, etc.  LIES.  In my testing the closest I ever got was “not any slower”, but most of the time gwmi was 4 times faster (although I saw 40 times) than either get-eventlog or get-winevent.  So it all uses the same path, but I left in there in case we needed it.  Theoretically we could get it down even faster, using remoting… but setting it up on servers is decidedly Non-Trivial  and increases your Windows Surface Area.  (However, you are MORE than welcome to do so - expect a possible order-of-magnitude speed boost.)
  • Now, use GWMI (get-wmiobject) and a query, built off of the logs, to get new records.  Uses the --query parameter, and including a timegenerated/eventlog/recordnumber. Now that I think about it, it’s possible that one event gets written before a different event from a different subsystem, so we could conceivably lose events.  I’ll have to look at that. 
  • Next, take the datatable results, and insert them into a view.  We use a view because adding stuff like ID (an identity column) and insert_datetime rapidly becomes Non Trivial.  But inserting through the view fixes that.  (Which is a different set of jobs; one that creates a new monthly table and modify the view)
  • Finally, run the eventlog_blacklist_removal Stored Procedure.  That one deserves its own little writeup.  But practically, it looks at a blacklist table and deletes anything from our blacklist. It's fast because of fulltext indexing.  Why do all that? There’s far too much stuff we don’t care about.  

 Hope this helps you!

######============CODE=============#####

Creating the monthly table and views (set this up as a monthly job, probably on the 15th or so, so that you have plenty of time to fix if it borks




The powershell:



Stored procedures (probably have to run these after the tables are all built):
Eventlog_Blacklist_Removal - this runs after the insert, and its job is to delete the crap records (as per the blacklist table) and insert into the partitioned view.



EventLog_Blacklist_Stragglers - for some reason, sometimes a servers logs aren't cleared.  As per the job, this step runs second, and its whole purpose is to remove any stragglers that weren't properly processed originally.



Job:
The job that runs all this:


Finally, because it's so long: the tables & views involved. 
Don't forget: you have to issue: "sp_fulltext_database 'enable' " first, and make sure fulltext is enabled Otherwise it's SLOW.





Finally, some default records for the tables:
EventLogs_To_Get:

SELECT N'1' AS [tinyint], N'Application' AS [EventLog], N'1' AS [days_back_to_get], N'1' AS [minimum_recordnumber], N'2015-04-28 16:36:34.470' AS [insert_datetime] UNION ALL
SELECT N'2' AS [tinyint], N'System' AS [EventLog], N'1' AS [days_back_to_get], N'1' AS [minimum_recordnumber], N'2015-04-28 16:36:40.277' AS [insert_datetime] ) t;






Some Blacklist filter samples: remember this uses fulltext indexing.  You could probably get fancier on the match.  Not doing that yet. 

The WinHTTP Web Proxy Auto-Discovery Service service
WinHttpAutoProxySvc
Database backed up. Database
The system uptime is
Configuration option % changed from 1 to 1. Run the RECONFIGURE statement to install.
CHECKDB for database % finished without errors
Package % finished successfully.
Package % started.

No comments: