Tag Archives: #Troubleshooting

SCOM: Bug with SQL MP 6.4.1.0

With the 6.4.1.0 version of the SQL management pack the SQL 2012 DB Engine group does not contain all SQL 2012 servers. This is due to the group being populated based on a SQL registry key which is looking for a version value of 11.0.xxxx.x, however when updating SQL 2012 to SP1 the version changes to 11.1.xxx.x

Kevin Holman has written a nice blog entry about this particular issue: here. As well as an addendum management pack that contains a new group population discovery set to “11.*” along with an override to disable the built in group, which is available for download at the bottom of his article.

Loading

SCOM 2012: Agent on Windows 2012 R2 servers can stop responding

There is a known issue where SCOM 2012 agents stop responding on Windows 2012 R2 Domain Controllers but can affect other Windows 2012 R2 servers as well . Kevin Holman has posted an article with the resolution to this issue: Here

“This is caused by an issue in the Server OS (Windows Server 2012 R2), which is outlined at http://support.microsoft.com/kb/2923126
There is a hotfix, which addresses the issue, which is included in the Feb 2014 update rollup hotfix:  http://support.microsoft.com/kb/2919394″

Loading

SCOM: Upgrade to Operations Manager 2012 R2 may result in Data Warehouse synchronization failures

Brian McDermott highlighted an issue to watch out for when upgrading to SCOM 2012 R2 where you may get Data Warehouse synchronization failure errors after the upgrade.

The article can be found here with solid reasoning as to the cause and solution:

Please note that the Event ID 31565 noted above is a very generic error and you should only run the SQL below if the description identifies that it is the problem with the TFSWorkItemID column.

Error below:

Log Name:      Operations Manager
Source:        Health Service Modules
Date:
Event ID:      31565
Task Category: Data Warehouse
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      OMMS.domain.com
Description:
Failed to deploy Data Warehouse component. The operation will be retried.
Exception ‘DeploymentException’: Failed to perform Data Warehouse component deployment operation: Install; Component: DataSet, Id: ‘0d698dff-9b7e-24d1-8a74-4657b86a59f8′, Management Pack Version-dependent Id: ’29a3dd22-8645-bae5-e255-9b56bf0b12a8′; Target: DataSet, Id: ’23ee52b1-51fb-469b-ab18-e6b4be37ab35’. Batch ordinal: 3; Exception: Sql execution failed. Error 207, Level 16, State 1, Procedure vAlertDetail, Line 18, Message: Invalid column name ‘TfsWorkItemId’.

This issue can be fixed with the below SQL query, as always BACKUP your databases and proceed at your own risk:
USE OperationsManagerDW
 
DECLARE @GuidString NVARCHAR(50)
SELECT @GuidString = DatasetId FROM StandardDataset
WHERE SchemaName = ‘Alert’
 
— update all tables that were already created
DECLARE
   @StandardDatasetTableMapRowId int
  ,@Statement nvarchar(max)
  ,@SchemaName sysname
  ,@TableNameSuffix sysname
  ,@BaseTableName sysname
  ,@FullTableName sysname
 
SET @StandardDatasetTableMapRowId = 0
 
WHILE EXISTS (SELECT *
              FROM StandardDatasetTableMap tm
              WHERE (tm.StandardDatasetTableMapRowId > @StandardDatasetTableMapRowId)
                AND (tm.DatasetId = @GuidString)
             )
BEGIN
  SELECT TOP 1
     @StandardDatasetTableMapRowId = tm.StandardDatasetTableMapRowId
    ,@SchemaName = sd.SchemaName
    ,@TableNameSuffix = tm.TableNameSuffix
    ,@BaseTableName = sdas.BaseTableName
  FROM StandardDatasetTableMap tm
          JOIN StandardDataset sd ON (tm.DatasetId = sd.DatasetId)
          JOIN StandardDatasetAggregationStorage sdas ON (sdas.DatasetId = tm.DatasetId) AND (sdas.AggregationTypeId = tm.AggregationTypeId)
  WHERE (tm.StandardDatasetTableMapRowId > @StandardDatasetTableMapRowId)
    AND (tm.DatasetId = @GUIDString)
    AND (sdas.TableTag = ‘detail’)
    AND (sdas.DependentTableInd = 1)
  ORDER BY tm.StandardDatasetTableMapRowId
 
  SET @FullTableName = @BaseTableName + ‘_’ + @TableNameSuffix
 
  IF NOT EXISTS (SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = @FullTableName AND TABLE_SCHEMA = @SchemaName
    AND COLUMN_NAME = N’TfsWorkItemId’)
  BEGIN
    SET @Statement = ‘ALTER TABLE ‘ + QUOTENAME(@SchemaName) + ‘.’ + QUOTENAME(@FullTableName) + ‘ ADD TfsWorkItemId nvarchar(256) NULL’
    EXECUTE (@Statement)
  END
 
  IF NOT EXISTS (SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = @FullTableName AND TABLE_SCHEMA = @SchemaName
    AND COLUMN_NAME = N’TfsWorkItemOwner’)
  BEGIN
    SET @Statement = ‘ALTER TABLE ‘ + QUOTENAME(@SchemaName) + ‘.’ + QUOTENAME(@FullTableName) + ‘ ADD TfsWorkItemOwner nvarchar(256) NULL’
    EXECUTE (@Statement)
  END
END
 
— alter cover views
EXEC StandardDatasetBuildCoverView@GUIDString, 0
GO

 

Loading

SCOM: When updating the IBM Storage Management Pack to 2.1.0

When updating the IBM Storage Management Pack to 2.1.0 there are a few things to be aware of, most of which is included in the documentation.

After completing the installation, running the upgrade configuration and removing the old management packs and importing the new ones we still weren’t able to re-discover the IBM SAN.

The documentation recommends, in the configuration section that, “The IBM storage configuration must be synchronized with Management Server manually  if there is storage configuration left after upgrading from previous version to version 2.1.0. The IBM storage configuration also should be synchronized with the Management Server manually after the management pack is deleted and re-imported.”

Except that the command skipped all of our SANs.

IMB1

Checking the SCOM configuration revealed that something which shouldn’t have happened, had happened. The SCOM configuration had been lost during the upgrade

IMB2

Using the –sc-set command to re-do the configration was successful, which allowed the migration to complete and in short order the SANs were discovered and monitoring.

IMB3

Loading

SCOM 2012: The System Center Management service stops responding after an instance of SQL Server goes offline

Update: this issue also applies to SCOM 2012 R2 as confirmed in this article by Kevin Holman

Microsoft released a KB article on the 5th of December on how to deal with a particular issue in SCOM 2012 SP1. It may be worth applying this fix preemptively if you are still running SP1, in order to avoid unnecessary downtime.

“After an instance of Microsoft SQL Server that hosts the OperationsManager database goes offline, the System Center Management service of the Microsoft System Center 2012 Operations Manager Service Pack 1 (SP1) management server stops responding.

For example, the System Center Management service stops responding after the instance of SQL Server disconnects, restarts, or fails. To recover from this issue after the instance of SQL Server is available again, you must restart the System Center Management service. “

KB Article

NB: As always backup your registry before making any changes

To resolve this issue, you can enable the automatic recovery feature in System Center 2012 Operations Manager SP1. By default, this automatic recovery feature is disabled. 

To enable the automatic recovery feature on the management server, follow these steps:

  1. Start Registry Editor.
  2. Locate and then click the following registry subkey:
    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\System Center\2010\Common\DAL
  3. Create the following two registry entries:
    • DALInitiateClearPoolType: DWORD
      Decimal value: 1
    • DALInitiateClearPoolSecondsType: DWORD
      Decimal Value: 60Note The DALInitiateClearPoolSeconds setting controls when the management server drops the current connection pool and when the management server tries to reestablish an SQL connection. We recommend that you set this setting to 60 seconds or more to avoid performance issues.
  4. Restart the System Center Management service on the management server.

Loading

SCOM: Windows 2008 also running monitors for Windows 2003, Orphaned class

Now here’s something you certainly don;t see every day. It all started when I was asked to investigate a flood of memory alerts for a particular server at one of my customers. When I opened Health Explorer I noticed the following:

The server was running the monitors for Windows 2008 and Windows 2003.

HExpl double monitors
As it turns out the server had been recently re-installed from 2003 to 2008 with the same name, without the agent being uninstalled or being removed from the console. This caused a bit of confusion in the back-end. A quick look at the Windows Server 2003 Operating System Inventory showed another server which was “Upgraded” i in the same fashion:

2003 OS edt
What’s happened here is the class for Windows Server 2003 Operating System is still being loaded by the agent, and this is causing all of the related rules and monitors to load as well. In the past when I’ve come across this particular issue I’ve been able to solve it with the remove-disabledmonitoringobject powershell cmdlet. 
All that you need to do is override the discovery rule in question to false for your object (In this case “Discover Windows Server 2003 Operating System) and then open OpsMgr Shell and run remove-disabledmonitoringobject. After a short delay the offending objects are removed.

However in this case the above did not work, eventually I deleted the agent from the console, waited for grooming to run (you can force it if you are in a hurry), cleared the local agent cache and then approved the agent. Now only the correct objects are being discovered.

Loading

SCOM / SCVMM: PRO group names contain Chinese characters in System Center Operations Manager

If you have integrated SCOM 2012 and SCVMM 2012 and are using the PRO management packs you may have noticed some groups containing Chinese characters.

In an English-language version of Microsoft System Center 2012, you connect Virtual Machine Manager (VMM) to Operations Manager. However, group names in the Performance and Resource Optimization (PRO) management packs contain Chinese characters.

Cause: This behavior occurs because the LanguageCode option for the management group is not set in the Operations Manager database. Therefore, when a management pack contains multiple languages, its display names appear in the last language that is included.

Microsoft has published a KB article with the solution to this issue.

Loading

SCOM: Using Authenticated SMTP for Notifications

Sometimes with your SCOM environments you might come across an email relay that does not allow Anonymous authentication. When this happens your notification subscriptions will not be able to send email.

All you need to do is create a windows run as account using credentials which have access to the mail relay and then assign that account to the Notification Account Run as Profile.

Note: The AD account which you use will need an email address which will send your notifications.

Loading

SCOM: Patches to be Careful of

Having come across another patch recently which can cause critical issue with SCOM I’ve decided to create a page to record the KB numbers on as well as any relevant additional information.

1.KB2585542
2.KB2775511 

1. KB2585542 – This patch will break Unix monitoring due to  causing WS-Management connections to UNIX/Linux agents to fail.  If this patch is installed on your management servers you can either uninstall it or perform one of the following:

  • Edit the registry to add this 32bit DWORD value:HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\ SendExtraRecord = 2
  • Or there is a “FixIt” package is available in the KB article under the Known Issues section that can be used to disable the security update

2.KB2775511  – Marnix Wolf has a great article on this issue. “After installing KB2775511 on Operations Manager Management Servers, agents or servers may be affected by a deadlock.
Once in deadlock, Management Servers will generate Heart Beat failures and will go into a “greyed out” state. grayed out. As a result, devices managed by these Management Servers will also go into a “greyed out” or “not monitored state.””

This patch is a combination of 89 hot fixes so ideally you want to avoid installing it.  Even though the issue doesn’t occur on all SCOM system it would be advisable to wait for an updated bulletin from the MS System Center team before installing it.

Note: Microsoft have release a hotfix to address this issue, I’d still recommend approaching with caution. Link – “SCOM 2012 or SCOM 2007 R2 throws a “Heartbeat Failure” message and then goes into a greyed out state in Windows Server 2008 R2 SP1

Loading

SCOM: Not all objects on a server are discovered / managed

In the case of an agent that is managing a large amount of objects you may find that not all of them are discovered or if they are that some of them remain in a Not Monitored State. This can be caused by a couple of things.

If you find this error in your OpsMgr event log: “The health service has removed some items from the send queue for management group since it exceeded the maximum allowed size of 15 megabytes”

The the below registry keys need to be adjusted:

  • Set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\HealthService\Parameters\Persistence Version Store Maximum to 80 MB (5120). Default = 60 MB
  • Set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\HealthService\Parameters\Management Groups\<MG Name>\maximumQueueSizeKb to 100 MB. Default = 15 MB
  • Set HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Modules\Global\PowerShell\ScriptLimit\QueueMinutes to 120 mins

However if you find this error: In memory container (hash table System.Health.EntityStateChangeData) had to drop data because it reached max limit. Possible data loss.

Then the following registry key need to be adjusted:

  • HKLM\System\CurrentControlSet\Services\HealthService\Parameters:”State Queue Items”, the default value for this key is 1024, depending on the server load double this to 2048 or if the error continues to occur to 4096

I have come across instances where both of these errors occur, after the adjustments were made and the heath service restarted all objects were discovered and monitored correctly.

Loading