On Tuesday, the SCOM 2007 R2 infrastructure of one of my customer reflected the following symptoms :
- Newly installed agents display as "Not Monitored" in the Operations Console.
- Agents show as being in maintenance mode in the Operations Console, yet the workflows are not actually unloaded by the System Center Management service on the monitored computer.
- Configuration changes, new rules or monitors, or overrides will not applied to some agents.
- The Operations Manager event log on one or more agents will display event 21026, indicating that the current configuration is still valid, even though the configuration for these agents should have been updated.
- The file "OpsMgrConnector.Config.xml" in the management group folder under "Health Service State""Connector Configuration Cache" does not update for long periods of time relative to the rest of the management group on one or more agents.
- SCOM Email alerts will not triggered.
- RMS event log is flooded by event 21042: Operations Manager has discarded 1 items in management group MGTGROUP, which came from $$ROOT$$. These items have been discarded because no valid route exists at this time. This can happen when new devices are added to the topology but the complete topology has not been distributed yet. The discarded items will be regenerated.
- RMS contains the event 29106: The request to synchronize state for OpsMgr Health Service identified by "a340d2a9-ab1b-2e53-ca78-d303510c831d" failed due to the following exception "Microsoft.EnterpriseManagement.Common.DataItemDoesNotExistException: An instance was deleted before its properties could be read.On the RMS, if you deleted the Health Service State folder and restarted the 3 SCOM Services, the file "OpsMgrConnector.Config.xml" is not generated and on the MS you have the event 20070: The OpsMgr Connector connected to RMSFQDN, but the connection was closed immediately after authentication occurred. The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration. Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.
Concerning the resolution, the first step is to make sure to have a good backup of all your Operations Manager databases /! The actions below are not supported, do it at your own risk /!
Connect on the OperationsManager Database and run the following query :
select MT.BaseManagedEntityId, BME.BaseManagedEntityId from BaseManagedEntity BME left outer join MT_Computer MT on MT.BaseManagedEntityId = BME.BaseManagedEntityId where BME.BaseManagedTypeId = (select ManagedTypeId from ManagedType where TypeName = N'Microsoft.Windows.Computer') and MT.BaseManagedEntityId IS NULL
This query will look for the objects that have not been completely and correctly deleted from the database. Normally this query must return nothing, but in our case, it returned a BaseManagedEntityId. This GUID correspond to an object that has been deleted but some references are still existing in the DB.
We have now to identify which computer is behind this id. For that run that following query
select * from basemanagedentity where basemanagedentityid in ('IDFROMTHEPREVIOUSQUERY')
In our case, it returned an exchange server, I did a quick check of the server itself, SCOM agent is installed, nothing strange in the log. I went back in my SCOM console, and there, impossible to find the computer in the agent managed view. The object is well deleted but not all his references.
As this server seems to be cause of our trouble, we will delete all his references from the database by running the following query.
Begin TRAN DECLARE @NetworkDeviceID as UniqueIdentifier DECLARE @Name as nVarChar(30) Set @NetworkDeviceID = 'IDFROMTHEPREVIOUSQUERY' update basemanagedentity set isdeleted = 1 where basemanagedentityid = @NetworkDeviceID COMMIT TRAN
Now, go back to the RMS, stop the 3 SCOM services, delete the ‘Health Service State’ folder and restart the 3 SCOM services.
Normally, after a few second the OpsMgrConnector.Config.xml file will be created in the “%ProgramFiles%System Center Operations Manager 2007Health Service StateConnector Configuration CacheMGTGROUP” and everything will start to work correctly.
Now concerning the root cause itself, I don’t have any explication, why this server and all his references have not been successfully deleted the first time ? How one server references could cause so much trouble to the infrastructure?
I would like to thank you my MVP friends Silvio Di Benedetto, Marnix Wolf, Bob Cornelissen and also Mihai Buia from the Microsoft Premier Support for their help to resolve this problem.