Monitoring Availability Groups Part 3 – Extended Events

This entry is part 3 of 3 in the series Monitoring Availability Groups

In last part of the series, we will touch on how to monitor availability groups using extended events.  I prefer this method over using SQL Agent alerts just because I don’t like relying on the individual servers to send me emails.  I like to have our central monitoring server send out alerts based on queries we run against the servers.

Microsoft creates for by default when we create an Availability Group an AlwaysOn_Health extended events session for us.  The code for it listed below:

CREATE EVENT SESSION [AlwaysOn_health]
    ON SERVER
    ADD EVENT sqlserver.alwayson_ddl_executed ,
    ADD EVENT sqlserver.availability_group_lease_expired ,
    ADD EVENT sqlserver.availability_replica_automatic_failover_validation ,
    ADD EVENT sqlserver.availability_replica_manager_state_change ,
    ADD EVENT sqlserver.availability_replica_state ,
    ADD EVENT sqlserver.availability_replica_state_change ,
    ADD EVENT sqlserver.error_reported
        ( WHERE (   [error_number] = ( 9691 )11
                    OR [error_number] = ( 35204 )
                    OR [error_number] = ( 9693 )
                    OR [error_number] = ( 26024 )
                    OR [error_number] = ( 28047 )
                    OR [error_number] = ( 26023 )
                    OR [error_number] = ( 9692 )
                    OR [error_number] = ( 28034 )
                    OR [error_number] = ( 28036 )
                    OR [error_number] = ( 28048 )
                    OR [error_number] = ( 28080 )
                    OR [error_number] = ( 28091 )
                    OR [error_number] = ( 26022 )
                    OR [error_number] = ( 9642 )
                    OR [error_number] = ( 35201 )
                    OR [error_number] = ( 35202 )
                    OR [error_number] = ( 35206 )
                    OR [error_number] = ( 35207 )
                    OR [error_number] = ( 26069 )
                    OR [error_number] = ( 26070 )
                    OR [error_number] > ( 41047 )
                       AND [error_number] < ( 41056 )
                    OR [error_number] = ( 41142 )
                    OR [error_number] = ( 41144 )
                    OR [error_number] = ( 1480 )
                    OR [error_number] = ( 823 )
                    OR [error_number] = ( 824 )
                    OR [error_number] = ( 829 )
                    OR [error_number] = ( 35264 )
                    OR [error_number] = ( 35265 )
                )
        ) ,
    ADD EVENT sqlserver.hadr_db_partner_set_sync_state ,
    ADD EVENT sqlserver.lock_redo_blocked
    ADD TARGET package0.event_file
        ( SET FILENAME = N'AlwaysOn_health.xel', max_file_size = ( 5 ), MAX_ROLLOVER_FILES = ( 4 ))
    WITH (   MAX_MEMORY = 4096KB ,
             EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS ,
             MAX_DISPATCH_LATENCY = 30 SECONDS ,
             MAX_EVENT_SIZE = 0KB ,
             MEMORY_PARTITION_MODE = NONE ,
             TRACK_CAUSALITY = OFF ,
             STARTUP_STATE = ON
         );
GO

Now that the extended events session is setup we can use some queries to query information about our AGs and error messages happening on our servers.  First to query when the server failover and becomes your primary server query the event availability_replica_state_change.  The first 10 lines just read in the files for the extended event session so you don’t have to identify the exact filenames.  Then we parse the xml event for get the timestamp, previous state, current state, repliace name, and group name for the event FROM the filenames we collected before.  In the WHERE clause were are looking for when the state has changed to PRIMARY_NORMAL indicating a failover to the server.

DECLARE @FileName NVARCHAR(4000)
SELECT @FileName = target_data.value('(EventFileTarget/File/@name)[1]','nvarchar(4000)')
FROM (
		SELECT
			CAST(target_data AS XML) target_data
			FROM sys.dm_xe_sessions s
			JOIN sys.dm_xe_session_targets t
				ON s.address = t.event_session_address
			WHERE s.name = N'AlwaysOn_health'
	) ft
 
SELECT 
    XEData.value('(event/@timestamp)[1]','datetime2(3)') AS event_timestamp,
    XEData.value('(event/data[@name="previous_state"]/text)[1]', 'varchar(255)') AS previous_state,
    XEData.value('(event/data[@name="current_state"]/text)[1]', 'varchar(255)') AS current_state,
    XEData.value('(event/data[@name="availability_replica_name"]/value)[1]', 'varchar(255)') AS availability_replica_name,
	XEData.value('(event/data[@name="availability_group_name"]/value)[1]', 'varchar(255)') AS availability_group_name
FROM (
		SELECT CAST(event_data AS XML) XEData, *
        FROM sys.fn_xe_file_target_read_file(@FileName, NULL, NULL, NULL)
        WHERE object_name = 'availability_replica_state_change'
     ) event_data
WHERE XEData.value('(event/data[@name="current_state"]/text)[1]', 'varchar(255)')  = 'PRIMARY_NORMAL'
ORDER BY event_timestamp DESC;

There are a number of errors it tracks that are specific to AGs that I talked about in part two of this series, then we see they track errors numbers 823 and 824 and 829 common corruptions numbers.  Which tells me we can customize this to capture any errors that we want to be query-able from another system.  So if we replace the definition with code below we can get all our SQL Agent alerts in one spot via querying the extended event session by adding to the WHERE clause the severity >= (19).  Highlighted below is the change to the extended event session.  You will need to drop the old extended event session in order to create this new one.

CREATE EVENT SESSION [AlwaysOn_health]
    ON SERVER
    ADD EVENT sqlserver.alwayson_ddl_executed ,
    ADD EVENT sqlserver.availability_group_lease_expired ,
    ADD EVENT sqlserver.availability_replica_automatic_failover_validation ,
    ADD EVENT sqlserver.availability_replica_manager_state_change ,
    ADD EVENT sqlserver.availability_replica_state ,
    ADD EVENT sqlserver.availability_replica_state_change ,
    ADD EVENT sqlserver.error_reported
        ( WHERE	[severity] >= ( 19 )
			OR (   
			[error_number] = ( 9691 )
                    OR [error_number] = ( 35204 )
                    OR [error_number] = ( 9693 )
                    OR [error_number] = ( 26024 )
                    OR [error_number] = ( 28047 )
                    OR [error_number] = ( 26023 )
                    OR [error_number] = ( 9692 )
                    OR [error_number] = ( 28034 )
                    OR [error_number] = ( 28036 )
                    OR [error_number] = ( 28048 )
                    OR [error_number] = ( 28080 )
                    OR [error_number] = ( 28091 )
                    OR [error_number] = ( 26022 )
                    OR [error_number] = ( 9642 )
                    OR [error_number] = ( 35201 )
                    OR [error_number] = ( 35202 )
                    OR [error_number] = ( 35206 )
                    OR [error_number] = ( 35207 )
                    OR [error_number] = ( 26069 )
                    OR [error_number] = ( 26070 )
                    OR [error_number] > ( 41047 )
                       AND [error_number] < ( 41056 )
                    OR [error_number] = ( 41142 )
                    OR [error_number] = ( 41144 )
                    OR [error_number] = ( 1480 )
                    OR [error_number] = ( 823 )
                    OR [error_number] = ( 824 )
                    OR [error_number] = ( 829 )
                    OR [error_number] = ( 35264 )
                    OR [error_number] = ( 35265 )
                )
        ) ,
	ADD EVENT sqlserver.
    ADD EVENT sqlserver.hadr_db_partner_set_sync_state ,
    ADD EVENT sqlserver.lock_redo_blocked
    ADD TARGET package0.event_file
        ( SET FILENAME = N'AlwaysOn_health.xel', max_file_size = ( 5 ), MAX_ROLLOVER_FILES = ( 4 ))
    WITH (   MAX_MEMORY = 4096KB ,
             EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS ,
             MAX_DISPATCH_LATENCY = 30 SECONDS ,
             MAX_EVENT_SIZE = 0KB ,
             MEMORY_PARTITION_MODE = NONE ,
             TRACK_CAUSALITY = OFF ,
             STARTUP_STATE = ON
         );
GO

To query for error messages above sev 19 and error numbers included for AGs (1480 – Role Change, 35264 – AG Data Movement – Suspended, 35265 – AG Data Movement – Resumed, 41404 – 41404 – AG is offline, and 41405 AG is not ready for automatic failover).  In this query we are parsing out the timestamp, error number, severity, and message.

DECLARE @FileName NVARCHAR(4000)
SELECT @FileName = target_data.value('(EventFileTarget/File/@name)[1]','nvarchar(4000)')
FROM (
		SELECT
			CAST(target_data AS XML) target_data
			FROM sys.dm_xe_sessions s
			JOIN sys.dm_xe_session_targets t
				ON s.address = t.event_session_address
			WHERE s.name = N'AlwaysOn_health'
	) ft

SELECT
    XEData.value('(event/@timestamp)[1]','datetime2(3)') AS event_timestamp,
	XEData.value('(event/data[@name="error_number"]/value)[1]', 'int') AS error_number,
	XEData.value('(event/data[@name="severity"]/value)[1]', 'int') AS severity,
	XEData.value('(event/data[@name="message"]/value)[1]', 'varchar(max)') AS message
FROM (
		SELECT CAST(event_data AS XML) XEData, *
        FROM sys.fn_xe_file_target_read_file(@FileName, NULL, NULL, NULL)
        WHERE object_name = 'error_reported'
     ) event_data
WHERE XEData.value('(event/data[@name="error_number"]/value)[1]', 'int')  IN (1480, 35264, 35265, 41404, 41405)
	OR XEData.value('(event/data[@name="severity"]/value)[1]', 'int') >= 19
ORDER BY event_timestamp DESC;
Series Navigation<< Monitoring Availability Groups Part 2 – SQL Agent Alerts

Related Posts

One thought on “Monitoring Availability Groups Part 3 – Extended Events

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.