Category Archives: Troubleshooting

SCOM 2016: Error while connecting to management server: The client has been disconnected from the server

When trying to install a second management server in a SCOM 2016 management group, after entering the OpsDB details,  a few moments later the wizard would return to the SQL details screen.

A brief investigation revealed the following error in the installation log file (OpsMgrSetupWizard.txt this is your best friend for troubleshooting SCOM installations):

Info:Error while connecting to management server: The client has been disconnected from the server. Please call ManagementGroup.Reconnect() to reestablish the connection.
Error: :Couldn’t connect to mgt server stack: : Threw Exception.Type: Microsoft.EnterpriseManagement.Common.ServerDisconnectedException, Exception Error Code: 0x80131500, Exception.Message: The client has been disconnected from the server. Please call ManagementGroup.Reconnect() to reestablish the connection.

First confirming that the Data Access service was in fact running on the original management server and that a connection could be made to the SDK.

It turns out that the two servers times were out of sync by more than 5 minutes causing a Keberos time skew. After correcting the server time the second management server installed with no issues.

SCOM and Deprecating SHA1 Certificates what you need to know!

With the recent deprivation of SHA1 certificates in favour of the more secure SHA2 it’s important to know that SCOM uses  SHA1 to manage workloads for cross platform monitoring (Unix / Linux).

Have no fear the MS SCOM team have released an article on how to replace your existsing SHA-1 certificates with the newer SHA256 certificates.

It is important to note that you will need to update your 2012 R2 environments to UR 12 and your 2016 environments to UR2 respectivly in order to use the new SHA256 certificates by default.

The article is available here

Warning! October Windows updates causing SCOM console crash on all Windows versions

There is an issue with the October cumulative updates (KB3194798, KB3192392, KB3185330 & KB3185331)  for all windows versions which is causing the SCOM console to crash.

Thank you to Dirk Brinkman for blogging about this issue.

The Product Group is aware of this issue and is working on a fix. Unfortunately I do not have an ETA for it. You will find an announcement on the SCOM Team blog (https://blogs.technet.microsoft.com/momteam/) once the fix for this issue is availabe.
All credits and thank’s to my colleague Mihai Sarbulescu for finding this issue!

Update from Dirk Brinkman The product group released a hotfix for this issue: https://support.microsoft.com/en-us/kb/3200006.

SCOM: Version 2.0 of Tim Culhams SCOM healh check script

Tim Culham has been promising version 2 of his health check script for a while now and let me tell you it was worth the wait, it offers a great overview of the health of a SCOM management group on a single page, get it here.

Great stuff Tim.

Features:

  • A Data Volume graph where you can instantly see the amount of Alerts, Events, Performance Data and State Changes over the last 7 days
  • The Health State of your SCOM Agents
  • A Graph of your Alert Statistics – how many Open or Closed Alerts
  • Your Management Server Health, Versions, Server Uptime & the number of Workflows they are running. Now updated to include Gateways! Also CPU, Disk and Memory Graphs for each!
  • Any Open Warning or Critical Alerts for your Gateways and Management Servers.
  • The Top 5 Alerts by Repeat Count (use this to identify recurring problems)
  • The Top 5 Events by Computer (see which computers are the noisiest)
  • Your Operational Database & Data Warehouse Servers…how much space they are using, free space, file sizes and locations.
  • Your Operational Database and Data Warehouse Backups. Make sure they are running.
  • Are the Databases being groomed? You’ll be able to tell instantly!
  • Find out what is using all of the space in your Operational Database & Data Warehouse Databases.
  • The Reporting Server and Web Console Server URL’s and if they are OK.
  • The Status of your Scheduled Reports
  • If there are any Overrides in the ‘Default Management Pack’
  • What Discoveries ran in the last 24 Hours and what Properties were changed?
  • Identify agents that are lower than the highest version installed. Use this to see which agents should be upgraded to the most current version
  • Identify Agents that are not Remoteable.
  • Is Agent Proxying Enabled?
  • What Objects are in Maintenance?
  • And it’s available in 2 different colors!

SCOM: SQL Dashboards workaround for slow performance

Microsoft has finally officially recommended a workaround that some of us have been using for some time to keep the SQL dashboards in a usable state.

Dashboards may work slowly if used rarely

Issue: When used rarely or after a long break, the dashboards may work rather slowly due to large amounts of the collected data to be processed; especially, it is related to large environments (2000+ objects).

Resolution: Below is a “warming up” script, which may be used to create an SQL job to run on some schedule. Before scheduling it as an SQL job, please test how long these queries will be executing (if you will schedule it to run too often or execution time is too long, that may kill the performance). If you have dashboards with thousands of objects to load, then time to load the content will be 10+ seconds anyway. It was tested with 600 000 objects, and the dashboard loading time was 1-2 minutes.

USE [OperationsManagerDW]

EXECUTE [sdk].[Microsoft_SQLServer_Visualization_Library_UpdateLastValues]

EXECUTE [sdk].[Microsoft_SQLServer_Visualization_Library_UpdateHierarchy]

It is also worth noting that the following versions of SQL Server Management Pack are considered as deprecated and suspended:

  • 1.314.35
  • 1.400.0
  • 3.173.0
  • 3.173.1
  • 4.0.0
  • 4.1.0
  • 5.1.0
  • 5.4.0
  • 6.0.0
  • 6.2.0
  • 6.3.0

XPost: Warning Base OS MP version 6.0.7303.0!!!

Kevin Holman updated his MP post with the following warning,:

***WARNING***  There are some significant issues in this release of the Base OS MP, I do not recommend applying this one until an updated version comes out.

Issues:

  • Cluster Disks on Server 2008R2 clusters are no longer discovered as cluster disks.
  • Cluster Disks on Server 2008 clusters are not discovered as logical disks.
  • Quorum (or small size) disks on clusters that ARE discovered as Cluster disks, do not monitor for free space correctly.
  • Cluster shared volumes are discovered twice, once as a Cluster Shared Volume instance, and once as a Logical disk instance, with the latter likely cause by enabling mounted disk discovery.
  • On Hyper-V servers, I discover an extra disk, which has no properties:

So best to hold off on this one folks. This of course comes back to some big questions about MP quality control as we’ve had many issues with the recent SQL MP releases and now this.

SCOM: Monitoring a Fortigate Firewall

A while ago I had a request from one of my clients to monitor their new Fortigate Firewalls, as there is no existing management pack for this it required a bit of custom work.

First on the firewall you’ll also need to configure SNMP, as well as what trap notifications will be sent.

snmptraps

Then discover the Fortigate using the standard network monitoring discovery.

This is the address for the Fortigate MIB file contents which you will need in order to map OIDs for the next part.

In SCOM create an SNMP Trap alerting Rule targeting the Node Class.

snmpalerting1snmpalerting2

For now leave the OID properties filter empty
snmpalerting3

This rule will be used to identify any OIDs in the future that may be missing from your specific alerting rules.

Now using the MIB list provided earlier each alert ticked in the Fortigate configuration needs to be mapped to the relevant OID and a specific alerting rule created for it, for example 1.3.6.1.4.1.12356.101.4.4.2.1.2 is the OID for HIgh Processor Usage. So in order to generate an alert for High CPU on the Fortigate you will need a rule with this specific OID in the filter 1.3.6.1.4.1.12356.101.4.4.2.1.2.

Repeat for each OID that you need to monitor and use the catch all to identify anything you may have missed.

XPost: Event 18054 errors in the SQL application log

Here is a great post by Kevin Holman addressing an issue you would come across if you have had to move your SCOM databases or recover them to a new SQL server.

Sample error:

Log Name:      Application
Source:        MSSQL$I01
Date:          10/23/2010 5:40:14 PM
Event ID:      18054
Task Category: Server
Level:         Error
Keywords:      Classic
User:          OPSMGR\msaa
Computer:      SQLDB1.opsmgr.net
Description:
Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.

In essence this happens, as explain, due to the sysmessages being created in the master database on installation. These messages will then be missing after a database move or new sql server recovery.

Kevin has also provided a script to re-add these messages for you here – CAUTION for SCOM 2012 R2 only

SCOM: Alerting on Exchange 2013 Database Failover

As it turns out the Exchange 2013 MP will not be able to alert you should your Exchange 2013 databases fail over, this is by design, as Microsoft does not consider this condition to be an issue.

There is a great article by Scott at flobee.net which addresses this issue for Exchange 2010, it is quite simple to apply the same thinking for use with Exchange 2013. The event is the same, the target just needs to be Exchange 2013 server.

Screenshot-FailoverAlert2

 

SCOM: The System Center Management service terminated with service-specific error %%-2130771964

Just a quick issue which bares being noted.

A colleague of mine had an issue where the health service on one of his management servers would not start. The error displayed was “The System Center Management service terminated with service-specific error %%-2130771964″

The resolution is simple, rename the Health Service State folder and then start the service.

This issue is caused by corruption in the health service cache which is preventing the service from starting,