An issue to be aware of when you package your SCOM agent with your server build image is that when the server is built a certificate is generated for the agent to use, this certificate resides in the Operation Manager Certificate Store. If the server is then renamed due to it having a temporary build name you will see the below error in your Operations Manager event log.
Event: 7022 Source: HealthService
The Health Service has downloaded secure configuration for management group <MG Name>, and processing the configuration failed with error code Keyset does not exist(0x80090016).
Re-installing the agent will fix this issue but there is a simpler solution by Gerrie Louw, open your certificate MMC, navigate to the Operation Manager Store and delete the certificate, then restart your Healthservice.
The symptoms can occur with all versions of the SCOM / MMA agent under the agent packaged with a server image scenario.
Kevin Holman recently published a great article about the inherent pitfalls of importing a management pack into an environment without understanding the intended scope, scalability, and any known/common issues.
Specifically he discusses the Dell Hardware Management Pack (Detailed Edition) which has a small scalability limitation of 300 agents.
The lesson to learn here is – be careful when importing MP’s. A badly written MP, or an MP designed for small environments, might wreak havoc in larger ones. Sometimes the recovery from this can be long and quite painful. An MP that tests out fine in your Dev SCOM environment might have issues that wont be seen until it moves into production. You should always monitor for changes to a production SCOM deployment after a new MP is brought in, to ensure that you don’t see a negative impact. Check the management server event logs, MS CPU performance, database size, and disk/CPU performance to see if there is a big change from your established baselines.
Go here for the full article, definitely worth the read.
All Linux monitored servers in a critical state is not an ideal way to start a Monday morning. Especially when none of the servers are actually experiencing an issue.
The issue at hand:
All of the Linux servers generated a heartbeat failure at the same time. Looking through the health explorer revealed the following error:
The WinRM client cannot process the request because the server name cannot be resolved.
Testing WinRM with the following command also yielded the same result, and testing with DNS resolved the server name successfully.
WinRM uses the windows proxy to resolve host names, I checked the windows proxy settings on the Management Server using the following command.
netsh winhttp show proxy
and discovered that my proxy was set correctly but the bypass list for excluded servers had been replaced with a single server, using the below command I was able to amend the bypass list to include all of the local domain servers.
netsh winhttp set proxy proxy-server=”http=<proxy FQDN” bypass-list=”*<Domain Suffix>”
One that was completed the WinRM test returned the correct data and the servers started to turn green again.
I came across a great article on TechNet which is essentially a compilation of useful SCOM information, it includes everything from the basics and key concepts, to deployment guidelines and information on how to configure and use different features.
Here is an article from the OpsMgr Engineering Blog detailing an issue with SCOM discovering Ckuster Resources
“If a Cluster has orphaned object entries in ClusterHive registry key, the System Center Operations Manager agent may not discover some or all Cluster resources. This can occur with System Center Operations Manager 2007 (OpsMgr 2007) or System Center 2012 Operations Manager (OpsMgr 2012).”
The article details a fix which envolves locating and then removing the orphaned objects:
“First, use the Failover Cluster PowerShell commands Get-ClusterResource and Get-ClusterGroup to get the list of Resources and Groups. Then, using the output, check for Resources/Groups that appear as Offline and verify if these can be seen in the Failover Cluster Console. Verify with the Cluster administrator whether these are still valid, then assuming they are not and you’ve identified which ones that are orphaned, delete them using these commands from an elevated CMD Prompt (Run As Administrator):
For orphaned resources: Cluster RES “<RESOURCE_NAME>” /DELETE
For orphaned groups: Cluster GROUP “<GROUP_NAME>” /DELETE
Once this is done the missing cluster resources should now be discovered.”
A while ago I had an issue at a customer where their SCOM console would experience huge performance degradation at random intervals.
This is a slow and complex situation to troubleshoot which is why I am glad to see a comprehensive and well laid out Troubleshooting plan from Marnix Wolf. This is a must read to better understand the areas that can impact your SCOM environment performance and more importantly your user experience, as people don’t want to use a console that’s slow.
Stefan Roth has written a great article on how to overcome some of the issues evolved with configuring SNMP monitoring in SCOM 2012. He covers the requirements as well as includes a step-by-step in order to avoid making the mistake of choosing the incorrect SNMP version.
Sometimes you might have a situation where all of your agents are showing as healthy in the console but when you try and draw a performance report data is missing.
The below SQL query which has been developed by my colleague Gerrie Louw will identify any agent that has not submitted performance data in the past 4 hours. It does so by checking the following performance counters:
Processor > % Processor Time
LogicalDisk > % Free Space > C:
Memory > Available MBytes
Note: You will probably have to change the DisplayName_ and IsVirtualNode for your OperationsManager database.
if object_id(‘tempdb..#temptable’) IS NOT NULL
DROP TABLE #temptable
SELECT distinct bmetarget.Name into #temptable
FROM OperationsManager.dbo.BaseManagedEntity AS BMESource WITH (nolock) INNER JOIN
OperationsManager.dbo.Relationship AS R WITH (nolock) ON
R.SourceEntityId = BMESource.BaseManagedEntityId INNER JOIN
OperationsManager.dbo.BaseManagedEntity AS BMETarget WITH (nolock) ON
R.TargetEntityId = BMETarget.BaseManagedEntityId inner join mtv_computer d on bmetarget.name=d.[DisplayName_55270A70_AC47_C853_C617_236B0CFF9B4C]
and d.IsVirtualNode_E817D034_02E8_294C_3509_01CA25481689 is null
WHERE (bmetarget.fullname like ‘Microsoft.Windows.Computer%’)
if object_id(‘tempdb..#healthstate’) IS NOT NULL
DROP TABLE #healthstate
select megv.path, megv.ismanaged, megv.isavailable, megv.healthstate into #healthstate
from managedentitygenericview as megv with (nolock) inner join managedtypeview as mtv with (nolock)
on megv.monitoringclassid=mtv.id
where mtv.name =’microsoft.systemcenter.agent’
if object_id(‘tempdb..#perfcpudata’) IS NOT NULL
DROP TABLE #perfcpudata
select Path, ‘CPU’ as ‘Cat’ into #perfcpudata
from PerformanceDataAllView pdv with (NOLOCK)
inner join PerformanceCounterView pcv on pdv.performancesourceinternalid = pcv.performancesourceinternalid
inner join BaseManagedEntity bme on pcv.ManagedEntityId = bme.BaseManagedEntityId
where (TimeSampled < GETUTCDATE() AND TimeSampled > DATEADD(MINUTE,-240, GETUTCDATE()))
and objectname =’Processor’ and countername=’% Processor Time’
if object_id(‘tempdb..#perfmemdata’) IS NOT NULL
DROP TABLE #perfmemdata
select Path,’Memory’ as ‘Cat’ into #perfmemdata
from PerformanceDataAllView pdv with (NOLOCK)
inner join PerformanceCounterView pcv on pdv.performancesourceinternalid = pcv.performancesourceinternalid
inner join BaseManagedEntity bme on pcv.ManagedEntityId = bme.BaseManagedEntityId
where (TimeSampled < GETUTCDATE() AND TimeSampled > DATEADD(MINUTE,-240, GETUTCDATE()))
and objectname =’Memory’ and countername=’Available MBytes’
if object_id(‘tempdb..#perfdiskdata’) IS NOT NULL
DROP TABLE #perfdiskdata
select Path,’Disk’ as ‘Cat’ into #perfdiskdata
from PerformanceDataAllView pdv with (NOLOCK)
inner join PerformanceCounterView pcv on pdv.performancesourceinternalid = pcv.performancesourceinternalid
inner join BaseManagedEntity bme on pcv.ManagedEntityId = bme.BaseManagedEntityId
where (TimeSampled < GETUTCDATE() AND TimeSampled > DATEADD(MINUTE,-240, GETUTCDATE()))
and objectname =’LogicalDisk’ and countername=’% Free Space’ and instancename=’C:’
if object_id(‘tempdb..#temptable1′) IS NOT NULL
DROP TABLE #temptable1
create table #temptable1 (
name nvarchar(250),
cat nvarchar(20),
val nvarchar(2)
)
insert into #temptable1
select name, ‘CPU’ as ‘cat’, ’1′ as ‘val’
from #temptable where name not in
(select path from #perfcpudata)
insert into #temptable1
select name, ‘Memory’ as ‘cat’, ’1′ as ‘val’
from #temptable where name not in
(select path from #perfmemdata)
insert into #temptable1
select name, ‘Disk’ as ‘cat’, ’1′ as ‘val’
from #temptable where name not in
(select path from #perfdiskdata)
if object_id(‘tempdb..#output’) IS NOT NULL
DROP TABLE #output
create table #output (
name nvarchar(250),
cpu nvarchar(2),
memory nvarchar(2),
disk nvarchar(2)
)
insert into #output
select distinct tt.name ,’0′,’0′,’0′
from #temptable1 as tt, #healthstate as hs
where tt.name=hs.path collate SQL_Latin1_General_CP1_CI_AS
and hs.isavailable=1
and hs.ismanaged=1
and hs.healthstate is not null
update #output set cpu=1 where #output.name in (select name from #temptable1 where #temptable1.name=#output.name and #temptable1.cat=’CPU’)
update #output set memory=1 where #output.name in (select name from #temptable1 where #temptable1.name=#output.name and #temptable1.cat=’Memory’)
update #output set disk=1 where #output.name in (select name from #temptable1 where #temptable1.name=#output.name and #temptable1.cat=’Disk’)
select * from #output
You can use this query to build a report such as the one sampled below:
Here is a fantastic technet blog with the top Microsoft Support solutions for the most common issues experienced when using System Center 2012 Operations Manager (updated quarterly).
Definitely one to have a look through occasional to see what the top issues are, that are being experienced with SCOM.