Monthly Archives: April 2013

SCOM 2012 Availibility report bug – Bars display as dark grey Up (Monitoring unavailible)

Recently I came across an issue with SCOM 2012 availability reports which causes the bars at the top level to display incorrectly.

avalilibility drill down 2

This is due to an error which is causing a duplicate entry to be created in the HealthServiceOutage table which has an outage start time but not an outage end time which causes an incorrect availability calculation for those objects..

The following SQL query will allow you to identify if you are affected by this issue:

Step 1:

SELECT * FROM HealthServiceOutage HS1 JOIN HealthServiceOutage HS2

ON HS1.StartDateTime = HS2.StartDateTime

AND HS1.ManagedEntityRowId = HS2.ManagedEntityRowId

WHERE HS2.EndDateTime IS NULL AND HS1.HealthServiceOutageRowId <> HS2.HealthServiceOutageRowId

If this query returns any records make a note of the StartDateTime values in the duplicate rows this date will be used again later to correct the problem.

This issue is addressed in UR3 for SCOM 2012 SP1 but if you are not planning on rolling this out in the near future there is a private fix available from Microsoft which will correct the relevant stored procedure. Also as this is an acknowledged known issue Microsoft will not charge for any case to address this problem.

Once you have applied the fix you will need to use the following queries to add an outage end time to the duplicate entries and then re-aggregate the affected data.

As always before performing any database update operations, ensure to make a full backup of the OperationsManager and OperationsManagerDW databases.

Step 2:

This query will update the EndDateTime value from NULL to valid time stamp.

UPDATE HS2

SET HS2.EndDateTime = HS1.EndDateTime

FROM HealthServiceOutage HS1 JOIN HealthServiceOutage HS2

ON HS1.StartDateTime = HS2.StartDateTime

AND HS1.ManagedEntityRowId = HS2.ManagedEntityRowId

WHERE HS2.EndDateTime IS NULL AND HS1.HealthServiceOutageRowId <> HS2.HealthServiceOutageRowId

Once Step 2 has finished running you should re-run the query in Step 1 to make sure that there are no additional affected rows.

 Step 3:

This  query will set the DirtyInd value for all the rows in the specific time range from 0 to 1, making them eligible for re-aggregation. The start date will be the StartDateTime value noted in step 1, the end date should be todays date.

update StandardDatasetAggregationHistory

set DirtyInd = 1

where DatasetId = (Select Datasetid from Standarddataset where Schemaname = ‘state’)

and AggregationDateTime => ‘2012-21-01 00:00:00’

and AggregationDateTime < ‘2012-13-03 00:00:00’

Step 4:

Disable the Standard Data set Maintenance rule for the State data set ONLY, then run the below query to manually re-aggregate the State Data.

declare @i int

set @i=1

while(@i<=500)

begin

DECLARE @DataSet uniqueidentifier

SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = ‘State’)

EXEC standarddatasetmaintenance @DataSet

set @i=@i+1

Waitfor delay ’00:00:05′

End

Note: Thie query may need to be run multiple times depending upon the amount of data that need to be aggregated .

Step 5:

Once this query returns less then 5 rows Step 4 can be stopped and the Standard Data set Maintenance rule  can be re-enabled.

Select count(*) from StandardDatasetAggregationHistory

where Datasetid = (Select Datasetid from Standarddataset where Schemaname = ‘state’)

AND DirtyInd=1

 

In my case there were 741 rows that needed to be re-aggregated, on average it takes between 5 and 10 minutes for each row, which resulted in 105 hours total, although your mileage may vary depending on the power of your SQL server and how busy your environment it.

Loading

SCVMM2012 Issue when trying to remove a VMWare vCenter

Recently I experienced an issue with SCVMM 2012 where I needed to remove a vCenter which exists at a remote site.

Trying to remove the server from the console caused VMM to try and perform an inventory update before removing the server which resulted in this error, due to the inventory job taking quite some time to complete:

Error : Unable to perform the job because one or more of the selected objects are locked by another job.

To find out which job is locking the object, in the Jobs view, group by Status, and find the running or canceling job for the object. When the job is complete, try again.

After trying several methods with no success, including powershell, I tried a last ditch attempt before resigning myself to Microsofts solution http://blogs.technet.com/b/scvmm/archive/2012/07/16/kb-attempting-to-remove-vmware-vcenter-from-system-center-2012-virtual-machine-manager-fails-with-error-0x8007274d.aspx which is rather extreme.

I ended up requesting the VMware admin remove the access for my SCVMM run as account, which caused the inventory to fail immediatly and not lock the objects. This enabled me to remove the vCenter without resorting to drastic measures.

I did later come across another solution which is unsupported http://digitaljive.wordpress.com/2012/07/06/scvmm-2012-force-remove-vcenter-server/

Hopefully Microsoft will provide a better solution for this issue in the future.

Loading

SCOM 2012 Improving console performance when creating dashboards views

The following registry change will improve your SCOM 2012 console performance overall but I’ve notived a particular improvement when creating dashboard views.

HKCU\Software\Microsoft\Microsoft Operations Manager\3.0\console\CacheParameters\PollingInterval

The value can be from 0 to 10. 0 will disable automatic refresh and require using F5 to refresh manually. The every incriment will increase the console refresh period by 15 seconds to a maximum of 2 minutes and 30 seconds.

 

Loading

SCOM 2012 Specified cast is not valid when opening run as account properties

Earlier today I received the following error in SCOM 2012 SP1 when trying to open a Run As Account in order to add a new server for distribution

“Specified cast is not valid”

Clicking ok on the error brings up the run as properties but the distribution list is empty.

This error seems to be caused when an agent is deleted from the console and doesn’t get removed properly from the run as distribution. If you are able to, re-approving the agent does fix the error allowing you to remove that device from distribution before deleting it again.

However if the server is no longer availible (due to being decomissioned or just plain crashing) the following steps, courtesy of Blake Wilson will help:

  1. Create a list of your servers which use the affected Run As Account.
  2. Rename the current Run As Account to “Run as Account – Old” or similar.
  3. Created a new Run As Account with the name and credentials of the orgional, then distribute to all of the servers from step 1.
  4. Assign the new Run As Account to all of the Run as Profiles associated with the origional one. The easiest way is to edit the existing entry and select the new Run As Account.
  5. Once you’ve confirmed everything is working you can safely remove the old Run As Account.

 

Loading

SCOM 2012: Approving agents that don’t appear under pending management.

Well the old “Manually installed agent that doesn’t appear under pending management” situation still exists in SCOM 2012. What is different is that the powershell cmdlet get-agentpendingaction no longer works.

Instead we have the following cmdlet Get-SCOMPendingManagement which provides you all agents
that are in pending management and with Approve-SCOMPendingManagement you can approve the agent you need to.

For a specific agent open the Operations Manager Shell and enter:
Get-SCOMPendingManagement | where {$_.AgentName -eq “ServernameFQDN“} | Approve-SCOMPendingManagement

Loading

Default override preventing heartbeat failure alert.

We had a server that went down and didn’t generate alerts for Heartbeat Failure or Could not Connect to Computer.

What I’d found is that there is a group in SCOM called
“Managed Computer Client Health Service Watcher Group” and there is a default
override to disable  generating alerts for Heartbeat Failure or Could not
Connect to Computer against this group.

This group is apparently intended for workstations being monitored by SCOM and is dynamically populated but sometimes servers also ended up in there.

I you don’t monitor workstations the easiest solution is to create a second override to enable those alerts and just enforce it.

Loading