Friday, 17 May 2013

The most common reasons for failures with Dynamic Logical Partitioning


I need to add or remove processors, memory, or I/O devices from an LPAR using the HMC. Its not working and I would like to know why ?


Most likely, the cause is a broken RMC connection between the HMC and the LPAR


Starting point for troubleshooting problems with Dynamic Logical Partitioning.
The procedures listed below apply to Power4, Power5, & Power6 HMCs

The most common reason is broken RMC connection between the HMC and the LPAR.

The first place to check is the HMC using queries within the HMC restricted shell command prompt.
# lspartition -dlpar
If you get no output at all, then there is an RMC problem affecting all lpars attached to this particular HMC. A good thing to do if this happens is to close all Serviceable Events (under Service Focal Point) and reboot the HMC
# hmcshutdown -r -t now
Once the HMC reboots, wait about 15 minutes and re-run
# lspartition -dlpar
If still no output, then you should probably open a call with tech support.

In order for RMC to work, port 657 upd/tcp most be open in both directions between the HMC public interface and the lpar.

Look for the partition in question. In order for dlpar to function,the partition must be returned, the partition must return with the correct IP of the lpar. Also, the active value must be higher than zero, and the decaps value must be higher 0x0
Example of a working lpar

<#1> Partition:<11*9117-570*10XXXX, correct_hostname.domain, correct_ip>
Active:<1>, OS:<AIX, 5.3, 5.3>, DCaps:<0x3f>, CmdCaps:<0xb, 0xb>, PinnedMem:<146>

Example of non-working lpar

<#9> Partition:<10*9117-570*10XXXX, hostname, ip>
Active:<0>, OS:<, , >, DCaps:<0x0>, CmdCaps:<0x0, 0x0>, PinnedMem:<0>
There are too many reasons as to why the RMC connection is broken to list in one document. However, if you see the condition in the 2nd example (and dlpar is working for other lpars on this HMC) then the next step is to check the RMC status from the lpar (AIX root access will be needed).
lssrc -a | grep rsct
ctcas rsct inoperative
ctrmc rsct inoperative
IBM.ERRM rsct_rm inoperative
IBM.HostRM rsct_rm inoperative
IBM.ServiceRM rsct_rm inoperative
IBM.CSMAgentRM rsct_rm inoperative
IBM.DRM rsct_rm inoperative
IBM.AuditRM rsct_rm inoperative
IBM.LPRM rsct_rm inoperative
Here we see that all the rsct daemons are inoperative. In many cases, you will see some active and some missing. The key for dynamic logical partitioning is that IBM.DRM

This is the correct method to stop and start RMC without erasing the configuration.
# /usr/sbin/rsct/bin/rmcctrl -z
# /usr/sbin/rsct/bin/rmcctrl -A
# /usr/sbin/rsct/bin/rmcctrl -p
Now repeat
# lssrc -a | grep rsct
Is IBM.DRM active now? If so, there is a pretty good chance the problem is resolved.

Go back to the HMC restricted shell command prompt
# lspartition -dlpar
Now the partition shows correct hostname & IP Active<1> and Decaps value 0x3f

The values mean the partition is capable of a dlpar operation.

*** Other notes *** an lpar cloned from a mksysb may still have the RMC config from the mksysb source. In this case, you may see IBM.DRM active, but if it does not return after running the three rmcctrl commands listed above, then they were never really active.

****Using the recfgct command*****

recfgct deletes the RMC database, does a discovery, and recreates the RMC configuration.

In nearly all cases, recfgct is safe to run on a production system. There are just a few cases where you would not use recfgct, and those are if the LPAR is a CSM Management Server or the LPAR has RMC Virtual Shared Disks (VSDs). VSDs are usually only found in very large GPFS clusters. If you are using VSDs, then these filesets would be installed on your AIX system: rsct.vsd.cmds, rsct.vsd.rvsd, rsct.vsd.vsdd, andrsct.vsd.vsdrm
# lslpp -L | grep vsd
If no output, then you are not using VSDs

The other rarely used application that can be interrupted by recfgct, but without significant consequences, is if the node is a CSM Manager node or CSM client node. All AIX lpars should have these filesets
# lslpp -L | grep csm
csm.client C F Cluster Systems Management
csm.core C F Cluster Systems Management
csm.deploy C F Cluster Systems Management
csm.diagnostics C F Cluster Systems Management
csm.dsh C F Cluster Systems Management Dsh
csm.gui.dcem C F Distributed Command Execution
If you have additional filesets that start with csm, such as csm.server, csm.hpsnm, csm.ll, csm.gpfs, then you may have an LPAR that is part of a larger CSM cluster. The csm.server fileset should only be installed on a CSM Management Server. Following details a few additional checks you can perform to see if you have a Management Server configured.
# csmconfig -L ---> csmconfig not found, this is not a csm server
# lsrsrc IBM.ManagementServer
This will list resources that manage the lpar, including the HMC and/or a csm server Look at the Manager Type field Manager Type = CSM --- this is a csm node

So if it turns out your node is a csm manager, then you would have to re-add all the nodes. If the system was a csm client node, then you would need to get onto the manager server and re-add the node.

Thats it for the warnings on recfgct. If you think you might be using VSDs and/or a CSM cluster, but are not sure, then please open a pmr and support can assist in you in determining this.
Assuming you have not reason to be concerned about the warning discussed above, then proceed.

# /usr/sbin/rsct/install/bin/recfgct

Wait several minutes

# lssrc -a | grep rsct

If you see IBM.DRM active, then you have probably resolved the issue

# lsrsrc IBM.ManagementServer

Do you see?


Try the dlpar operation again. If it fails, then you will likely need
to open a software PMR.

The other main reason for a dlpar failure is that the lpar has reached its minimum or maximum (on processors or memory)
Note: The partition profile does not give a true picture of the current running configuration. If the profile was edited, but the partition did not go down into a "not activated" state, then reactivated, then the profile edits have not been read.

To check the current "running configuration" check the Partition Properties instead of the profile properties. You will see the min, max, & current. You can not remove or add processors and memory that are not within these boundaries. The command to check the running properties from the HMC restricted shell listed here
# lssyscfg -r sys -F name
(you need the value of name for use with the -m flag on many HMC commands)

# lshwres -r proc -m <server_name> --level lpar
(this list just the lpars settings)
# lshwres -r proc -m <server_name> --level sys
(this list the entire servers memory settings)

If you are checking for memory, replace "proc" in the above commands with "mem"

DLPAR can fail for many reasons, and it may be necessary to contact Remote Technical Support. However, the above may solve your problem.


  1. One of the best document on DLPAR failure topic. Nicely written. Keep up the good work.

  2. Agree. Work on Power 7 HMC too.