Category Archives: Troubleshooting

SCOM: Using Authenticated SMTP for Notifications

Sometimes with your SCOM environments you might come across an email relay that does not allow Anonymous authentication. When this happens your notification subscriptions will not be able to send email.

All you need to do is create a windows run as account using credentials which have access to the mail relay and then assign that account to the Notification Account Run as Profile.

Note: The AD account which you use will need an email address which will send your notifications.

Loading

SCOM: Patches to be Careful of

Having come across another patch recently which can cause critical issue with SCOM I’ve decided to create a page to record the KB numbers on as well as any relevant additional information.

1.KB2585542
2.KB2775511 

1. KB2585542 – This patch will break Unix monitoring due to  causing WS-Management connections to UNIX/Linux agents to fail.  If this patch is installed on your management servers you can either uninstall it or perform one of the following:

  • Edit the registry to add this 32bit DWORD value:HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\ SendExtraRecord = 2
  • Or there is a “FixIt” package is available in the KB article under the Known Issues section that can be used to disable the security update

2.KB2775511  – Marnix Wolf has a great article on this issue. “After installing KB2775511 on Operations Manager Management Servers, agents or servers may be affected by a deadlock.
Once in deadlock, Management Servers will generate Heart Beat failures and will go into a “greyed out” state. grayed out. As a result, devices managed by these Management Servers will also go into a “greyed out” or “not monitored state.””

This patch is a combination of 89 hot fixes so ideally you want to avoid installing it.  Even though the issue doesn’t occur on all SCOM system it would be advisable to wait for an updated bulletin from the MS System Center team before installing it.

Note: Microsoft have release a hotfix to address this issue, I’d still recommend approaching with caution. Link – “SCOM 2012 or SCOM 2007 R2 throws a “Heartbeat Failure” message and then goes into a greyed out state in Windows Server 2008 R2 SP1

Loading

SCOM: Not all objects on a server are discovered / managed

In the case of an agent that is managing a large amount of objects you may find that not all of them are discovered or if they are that some of them remain in a Not Monitored State. This can be caused by a couple of things.

If you find this error in your OpsMgr event log: “The health service has removed some items from the send queue for management group since it exceeded the maximum allowed size of 15 megabytes”

The the below registry keys need to be adjusted:

  • Set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\HealthService\Parameters\Persistence Version Store Maximum to 80 MB (5120). Default = 60 MB
  • Set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\HealthService\Parameters\Management Groups\<MG Name>\maximumQueueSizeKb to 100 MB. Default = 15 MB
  • Set HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Modules\Global\PowerShell\ScriptLimit\QueueMinutes to 120 mins

However if you find this error: In memory container (hash table System.Health.EntityStateChangeData) had to drop data because it reached max limit. Possible data loss.

Then the following registry key need to be adjusted:

  • HKLM\System\CurrentControlSet\Services\HealthService\Parameters:”State Queue Items”, the default value for this key is 1024, depending on the server load double this to 2048 or if the error continues to occur to 4096

I have come across instances where both of these errors occur, after the adjustments were made and the heath service restarted all objects were discovered and monitored correctly.

Loading

SCOM – The System Center Management service terminated with service-specific error 5 (0x5).

Not a common error by any means and there are several blog posts out there pertaining to other error codes.
Marnix Wolf has a great article about Error 2147500037

If you get error 5 (0x5) however this means that SCOM is unable to create self-signed certificate.

In our case local system did not have full permissions to the server C:\ProgramData\Microsoft\Crypto\RSA\S-1-5-18 directory. Added that and the service started right up.

Loading

SCOM 2012 – SharePoint monitored servers return to unidentified state

Something to be aware of when monitoring SharePoint, especially if you upgraded your environment from SCOM 2007 to SCOM 2012, is that you might find that your previously discovered SharePoint servers suddenly return to being Unidentified.

SP Uni

What appears to be missing from the SharePoint Management pack configuration guide is that in a SCOM 2012 deployment the SharePointMP.Config file needs to be present in the same location on all management servers. What seems to happen is that if the management server that executes the workflows for the configuration task does not have the config file, then the workflows undiscover the SharePoint servers, which makes them come up as unidentified again. This is due to the task being executed against the “All Management Servers” resource pool.

A symptom of the above is that when running the configuration task you find that you get this error:

Failed to create process due to error ‘0x8007010b : The directory name is invalid.’, this workflow will be unloaded

Note: This applies to monitoring SharePoint 2010 and 2013

Loading

SCOM Rebuilding performance counters

A colleague of mine came across a situation where SCOM performance reports were not working due to corrupt performance counters on a particular server. This situation can be easily resolved by rebuilding the performance counters using the lengthly procedure outlined at http://support.microsoft.com/kb/300956/en-us

Sympton:

Event 10102:

In PerfDataSource, could not resolve counter LogicalDisk, Free Megabytes, C:. Module will be unloaded.

  • One or more workflows were affected by this.     

 

The short version:

On the affected server open an elevated command prompt

Navigate to c:\windows\system32

Type lodctr /R and press enter

 

Shortly you should see the following message:

Info: Successfully rebuilt performance counter setting from system backup store.

Loading

SCOM Old alerts reoccuring when an agent exits maintenance mode

Something I came across with SCOM 2007 but still seems to be an issue with SCOM 2012. In rare cases when an agent exits maintenance mode it will reprocess all events in the event log and generate alerts for old events.

Adding the following registry key will correct this issue:

HKLM\Software\Microsoft\Microsoft Operations Manager\3.0\Modules\Global\NT Event Log DS

Create a DWord value named MaxEventBufferSize and set it to a decimal value of 500000.

You can also clear the event log on the effected machine although the issue may reoccur.

Loading

SCOM 2012 Availibility report bug – Bars display as dark grey Up (Monitoring unavailible)

Recently I came across an issue with SCOM 2012 availability reports which causes the bars at the top level to display incorrectly.

avalilibility drill down 2

This is due to an error which is causing a duplicate entry to be created in the HealthServiceOutage table which has an outage start time but not an outage end time which causes an incorrect availability calculation for those objects..

The following SQL query will allow you to identify if you are affected by this issue:

Step 1:

SELECT * FROM HealthServiceOutage HS1 JOIN HealthServiceOutage HS2

ON HS1.StartDateTime = HS2.StartDateTime

AND HS1.ManagedEntityRowId = HS2.ManagedEntityRowId

WHERE HS2.EndDateTime IS NULL AND HS1.HealthServiceOutageRowId <> HS2.HealthServiceOutageRowId

If this query returns any records make a note of the StartDateTime values in the duplicate rows this date will be used again later to correct the problem.

This issue is addressed in UR3 for SCOM 2012 SP1 but if you are not planning on rolling this out in the near future there is a private fix available from Microsoft which will correct the relevant stored procedure. Also as this is an acknowledged known issue Microsoft will not charge for any case to address this problem.

Once you have applied the fix you will need to use the following queries to add an outage end time to the duplicate entries and then re-aggregate the affected data.

As always before performing any database update operations, ensure to make a full backup of the OperationsManager and OperationsManagerDW databases.

Step 2:

This query will update the EndDateTime value from NULL to valid time stamp.

UPDATE HS2

SET HS2.EndDateTime = HS1.EndDateTime

FROM HealthServiceOutage HS1 JOIN HealthServiceOutage HS2

ON HS1.StartDateTime = HS2.StartDateTime

AND HS1.ManagedEntityRowId = HS2.ManagedEntityRowId

WHERE HS2.EndDateTime IS NULL AND HS1.HealthServiceOutageRowId <> HS2.HealthServiceOutageRowId

Once Step 2 has finished running you should re-run the query in Step 1 to make sure that there are no additional affected rows.

 Step 3:

This  query will set the DirtyInd value for all the rows in the specific time range from 0 to 1, making them eligible for re-aggregation. The start date will be the StartDateTime value noted in step 1, the end date should be todays date.

update StandardDatasetAggregationHistory

set DirtyInd = 1

where DatasetId = (Select Datasetid from Standarddataset where Schemaname = ‘state’)

and AggregationDateTime => ‘2012-21-01 00:00:00’

and AggregationDateTime < ‘2012-13-03 00:00:00’

Step 4:

Disable the Standard Data set Maintenance rule for the State data set ONLY, then run the below query to manually re-aggregate the State Data.

declare @i int

set @i=1

while(@i<=500)

begin

DECLARE @DataSet uniqueidentifier

SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = ‘State’)

EXEC standarddatasetmaintenance @DataSet

set @i=@i+1

Waitfor delay ’00:00:05′

End

Note: Thie query may need to be run multiple times depending upon the amount of data that need to be aggregated .

Step 5:

Once this query returns less then 5 rows Step 4 can be stopped and the Standard Data set Maintenance rule  can be re-enabled.

Select count(*) from StandardDatasetAggregationHistory

where Datasetid = (Select Datasetid from Standarddataset where Schemaname = ‘state’)

AND DirtyInd=1

 

In my case there were 741 rows that needed to be re-aggregated, on average it takes between 5 and 10 minutes for each row, which resulted in 105 hours total, although your mileage may vary depending on the power of your SQL server and how busy your environment it.

Loading

SCVMM2012 Issue when trying to remove a VMWare vCenter

Recently I experienced an issue with SCVMM 2012 where I needed to remove a vCenter which exists at a remote site.

Trying to remove the server from the console caused VMM to try and perform an inventory update before removing the server which resulted in this error, due to the inventory job taking quite some time to complete:

Error : Unable to perform the job because one or more of the selected objects are locked by another job.

To find out which job is locking the object, in the Jobs view, group by Status, and find the running or canceling job for the object. When the job is complete, try again.

After trying several methods with no success, including powershell, I tried a last ditch attempt before resigning myself to Microsofts solution http://blogs.technet.com/b/scvmm/archive/2012/07/16/kb-attempting-to-remove-vmware-vcenter-from-system-center-2012-virtual-machine-manager-fails-with-error-0x8007274d.aspx which is rather extreme.

I ended up requesting the VMware admin remove the access for my SCVMM run as account, which caused the inventory to fail immediatly and not lock the objects. This enabled me to remove the vCenter without resorting to drastic measures.

I did later come across another solution which is unsupported http://digitaljive.wordpress.com/2012/07/06/scvmm-2012-force-remove-vcenter-server/

Hopefully Microsoft will provide a better solution for this issue in the future.

Loading

Forcibly removing a SCOM agent that cannot be uninstalled by normal means

During our SCOM 2012 upgrade I came across some 2007 agents would not upgrade to 2012 due to being unable to complete the uninstall portion of the agent installer.

Errors we experienced  included corrupt MSIEXEC packages and a rollback of the 2012 agent upgrade with the message “unable to install performance counters.”

After attempting manually uninstalling from Add / Remove programs as well as the SCOM 2007 removal tool with no success we came across a tool called MSIZAP. (Thanks to Jonathan Almquist for his great blog post pointing us in the right direction)

The following process will allow you to remove the SCOM agent from your servers which will in turn allow you to install your new 2012 agent:
As always backup your registry before attempting any process that makes changes to it.

  1. Download MSIZAP and copy to a location on the affected computer.
  2. Find the product code, which is a GUID that is required for the MSIZAP product code switch.  This can be found by opening the registry and navigating to:HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall

With the Uninstall key highlighted, click on Edit > Find, and look for the string System Center Operations Manager.  Open the UninstallString string value, and copy the GUID.  Include the squiggly brackets.

scom code

3. Open an elevated command prompt and run the program as follows:

msizap.exe t {product code}

Examples:

SCOM 2007 product code 25097770-2B1F-49F6-AB9D-1C708B96262A

SCOM 2012 product code 5155DCF6-A1B5-4882-A670-60BF9FCFD688

Wait until this process has completed..

4. Delete the SCOM program files, usually located under “%ProgramFiles%\System Center Operations Manager 2007”. Some files may be locked those can be ignored.

5. Open the registry, search for the Management Group name

6. Delete the Microsoft Operations Manager key that the management group name is part of

MG

7.Open the registry and navigate to:

HKLM\System\CurrentControlSet\Services

Delete the following registry entries:

healthservice
opsmgr*
MOMConnector
System Center Management APM (2012 only)

8. Reboot the server

You will now be able to install your agent manually or with your console.

Loading