Showing posts with label AIX. Show all posts
Showing posts with label AIX. Show all posts

Saturday, 2 May 2015

Where does my space gone in AIX/Linux filesystem ?


One of my friend got a situation where in she is seeing 9 GB allocated to one of the filesystems which is 100%utilized  but actual usage is 4GB  when verified with "du" command
#df -k  /mytest
Filesystem 1024-blocks Free %Used Iused %Iused Mounted on
/dev/mytestlv  9216 9216 100% 48804 12% /mytest 
She was wondering where does the other 5GB gone . 

Reason: 

This situation happens  when a  process is opening a file and dumping data into it and the file is removed while said process still has file open.So called process still holds that file space even file deleted.

How to Rectify ?

At first you need to check  what are all the  processes using  a particular filesystem using "fuser" command.
# fuser -c /mytest
/mytest:  2567 4006c 6548c 8657
You need to kill the above process if you want to free up the space.

Note: You need to inform the respective application owner/support team and take the application down time if this file-system is used by any-application.

How to Kill the proceses ?

# fuser -kc /mytest 
This  will kill  all the processes and  space will be freed up.

Check the space now 
#df -k  /mytest
Filesystem 1024-blocks Free %Used Iused %Iused Mounted on
/dev/mytestlv  9216 4096 45% 48804 12% /mytest 

Sunday, 11 January 2015

Not enough free space to shrink the file system issue in AIX


Recently got an issue in reducing jfs2 filesystem  with osverion 6.1 and have enough space to reduce filesystem.
root@umaix /tmp>df -g /orafs1
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/oralv1   100.00    75.00   25%      555     1%  /orafs1

root@umaix /tmp>chfs -a size=-15G /orafs1
chfs: There is not enough free space to shrink the file system.
This issue will occur whenever you try to reduce big chunk of data ( in this case 15GB) that may not be contiguous in the file-system because you have files scatted everywhere.

Try   the following methods one by one until your issue fixed

1. Try to defrag the FS:

#defragfs -s /orafs1

2. Reduce in smaller chunks:

If you still can't reduce it after this. Try reducing the filesystem  in smaller chunks. Instead of 15G at a time, try reducing 1 or 2 gigs. Then, repeat the operation.

3. Check the processes:

Sometimes processes open big files and use lots of temporary space in those filesystem.
You could check processes/applications running against the filesystem and stop them temporarily, if you can.
#fuser -cu[x] <filsystem>

4. Move the large files and try shrink

Try looking for files large using the find cmd and move them out temporarily, just to see if we can shrink the fs without them:
#find /<filesystem> -xdev -size +2048 -ls|sort -r +10|pg

Finally the last method, the alternative approach if any one of above methods are not working then go for filesystem recreation.

==> You should be very care full , need to take fs backup and as well as approach application before removing the filelsystem.

5) Recreate filesystem:

  • - Take databackup of the fielsystem  ( very Important,dont skip this )
  •   Either you can take using your backup tools like TSM / netbackup or move data to a temporary   directory

  • - Remove the  filesystem  (  #rmfs /orafs)
  • - Create the filesystem again
  •    #mklv -y oralv1 -t jfs2 oravg 600  ( in this case we need 75GB and pp size is 128)
       #crfs -v jfs2 -d oralv1 -m /orafs1 -A yes  (create orafs1 filesystem)

  • - Restore data to the filesystem
  • - Verify fs size

  • root@umaix /tmp>df -g /orafs1
    Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
    /dev/oralv1   75.00    50.00   33%      555     1%  /orafs1

Wednesday, 7 January 2015

How to mirror VIOS Boot Disk?

Here is the procedure to mirror VIOs boot disk.
# lspv
NAME             PVID                 VG               STATUS
hdisk0           00c122d4341c6e62     rootvg           active
hdisk1           00cd55a4fg6b676f     None
hdisk2           00c5524409a99b77     None
Here hdisk0 is rootvg disk , now we need to check free disk.
you can use lspv -free command to check the un-mapped free disks.
$ lspv -free
NAME            PVID                                SIZE(megabytes)
hdisk1         00cd55a4fg6b676f                     256000
hdisk2         00c5524409a99b77                     256000
So In this case, hdisk1 is free and un-mapped . So we're going to use hdisk1 to mirror with hdisk0.

Add hdisk1 into rootvg:
# extendvg rootvg hdisk1 0516-1254 extendvg: Changing the PVID in the ODM.
Now mirror the disk but defer the automatic reboot:
$ mirrorios -defer hdisk1
Now check the boot list:
$ bootlist -mode normal -ls
hdisk0 blv=hd5 pathid=0
We only have hdisk0 at the moment.  Need to add hdisk1 into this:
$ bootlist -mode normal hdisk0 hdisk1
Check that worked:
$ bootlist -mode normal -ls
hdisk0 blv=hd5 pathid=0
hdisk1 blv=hd5 pathid=0
You now have a mirrored rootvg. 

Saturday, 11 October 2014

Run VIO commands from the HMC using "viosvrcmd" without VIOs Passwords

Recently  we got a situation where  in we don't know the passwords of  either padmin/root of VIOS  but need to run commands in VIOs.

Found an interesting command  in HMC  called "viosvrcmd",which will enble us to run commands on VIOs through HMC.
viosvrcmd -m managed-system {-p partition-name | --id partition-ID} -c "command" [--help]
Description: viosvrcmd issues an I/O server command line interface (ioscli) command to a virtual I/O server partition.

The ioscli commands are passed from the Hardware Management Console (HMC) to the virtual I/O server partition over an RMC session.

RMC does not allow interactive execution of ioscli commands.
-m    VIOs managed system name

-p    VIOs hostname

--id  The partion ID of the VIOs

Note:You must either use this option to specify the ID of the partition, or use the  -p option to specify the partition's name. The --id and the -p options are mutually exclusive.

-c    The I/O server command line interface (ioscli) command to issue to the virtual I/O      server partition.

Note: Command must be enclosed in double quotes. Also, command cannot contain the      semicolon (;), greater than (>), or vertical bar (|) characters.

--help  Display the help text for this command and exit.
Here is an example:
hscroot@umhmc:~> viosvrcmd -m umfrm570 -p umvio1 -c "ioslevel"
2.2.0.0
Since  we can't give the ; or > or |  in the command , if you need to process the output using filters , you can use that after "".
hscroot@umhmc:~> viosvrcmd -m umfrm570 -p umvio1 -c "lsdev -virtual" | grep vfchost0
vfchost0         Available   Virtual FC Server Adapter

What if  you want to run  command as root (oem_setup_env) ,  

got a method from internet
hscroot@umhmc:~> viosvrcmd -m umfrm570 -p umvio1 -c "oem_setup_env
> whoami"
root

You can  run in one shot like below

hscroot@umhmc:~> viosvrcmd -m umfrm570 -p umvio1 -c "oem_setup_env\n whoami"
root
If you need to run multiple commands , you can use them by assiging the commands to a variable and call the variable in place of the command parameter.
hscroot@umhmc:~>command=`printf  "oem_setup_env\nchsec -f /etc/security/lastlog -a unsuccessful_login_count=0 -s padmin"`

hscroot@umhmc:~>viosvrcmd -m umfrm570 -p umvio1 -c "$command"

Friday, 5 September 2014

Getting "Server refused to allocate pty" upon login attempt

Problem(Abstract)

You are unable to log into AIX because the maximum number of pseudo-terminals have already been allocated.

Symptom

An attempt to log into AIX via telnet or ssh results in this error:

"Server refused to allocate pty"

- You have increased the maximum number of ptys but you still see the problem.
- Each time you log in, the pty number increases and the pty numbers are not getting released and re-used.

Diagnosing the problem

The symptoms may indicate that there is an application that is holding on to ptys and not releasing it.

Try using the 'fuser' command to find the culprit application, like this:
# cd /dev/pts
# fuser *
The 'fuser' command will list all PIDs associated with each pty device.

If there is a process that is not releasing its ptys, you will see its PID occur many times in the fuser output above

Resolving the problem

Restarting the application that you diagnosed above should release all the ptys held by that application. Contact the application vendor support to see if there is a patch or configuration for the problem.

Saturday, 26 July 2014

PowerHA/HACMP Moving Resource Group (RG) one node to other

We are going to discuss about the resource group (RG) movement one node to other in PowerHA.
Here are the steps

1) Extending PATH vairable with cluster paths

Sometimes cluster paths are not included in default path ,run below command incase if you are not able to run commands directly.
export PATH=$PATH:/usr/es/sbin/cluster:/usr/es/sbin/cluster/utilities:/usr/es/sbin/cluster/sbin:/usr/es/sbin/cluster/cspoc

2) Check the cluster services are up  or not in destination node

#clshowsrv -v
Status of the RSCT subsystems used by HACMP:
Subsystem         Group            PID          Status
 topsvcs          topsvcs          278684       active
 grpsvcs          grpsvcs          332026       active
 grpglsm          grpsvcs                       inoperative
 emsvcs           emsvcs           446712       active
 emaixos          emsvcs           294942       active
 ctrmc            rsct             131212       active

Status of the HACMP subsystems:
Subsystem         Group            PID          Status
 clcomdES         clcomdES         204984       active
 clstrmgrES       cluster          86080        active

Status of the optional HACMP subsystems:
Subsystem         Group            PID          Status
 clinfoES         cluster          360702       active

3) Check the availability of resource group

# clRGinfo
-----------------------------------------------------------------------------
Group Name     Type           State      Location
-----------------------------------------------------------------------------
UMRG1            non-concurrent OFFLINE    umhaserv1
                                ONLINE     umhaserv2
#

4) Move the resourcegroup by using below command

==>  clRGmove -g <RG> -n  <node> -m

# clRGmove -g UMRG1 -n umhaserv1 -m
Attempting to move group UMRG1 to node umhaserv1.
Waiting for cluster to process the resource group movement request....
Waiting for the cluster to stabilize..................
Resource group movement successful.
Resource group UMRG1 is online on node umhaserv1.

You can use smitty path also

smitty cl_admin => HACMP Resource Group and Application Management => Move a Resource Group to Another Node / Site

5) Verify the RG movement

# clRGinfo
-----------------------------------------------------------------------------
Group Name     Type           State      Location
-----------------------------------------------------------------------------
UMRG1          non-concurrent   ONLINE     umhaserv1
                                OFFLINE    umhaserv2
#

Thursday, 24 July 2014

Editing the /etc/inittab File in Maintenance Mode

Problem(Abstract)

This technote describes a technique for creating a minimal /etc/inittab file if no other tools are available.

Symptom

System hangs or crashes at boot time.

Cause

A bad entry in the /etc/inittab is keeping the system from booting properly.

Resolving the problem

Ordinarily if there is a problem with one or more entries in the /etc/inittab the preferred method of editing it is:

1. Boot into Maintenance Mode off AIX install CDs, mksysb, or NIM
2. Access the rootvg and start a shell with the filesystems mounted.
3. Edit /etc/inittab down to a minimum 3 lines:
init:2:initdefault:
brc::sysinit:/sbin/rc.boot 3 >/dev/console 2>&1 # Phase 3 of system boot
cons:0123456789:respawn:/usr/sbin/getty /dev/console
In cases where the rootvg filesystems cannot be mounted automatically (for example the CD media is a different Technology Level than what exists on hard disk; or the filesystems for some reason won't automatically mount), commands such as the "vi" editor won't be available to edit the inittab.

In this case a hard-luck method can be used to create a minimal inittab.
1. Boot into Maintenance Mode and choose Option 2 "Access rootvg and start a shell before mounting filesystems".

2. Once in Maintenance Mode, fsck all rootvg filesystems necessary:
# fsck /dev/hd1
# fsck /dev/hd2
# fsck /dev/hd3
# fsck /dev/hd4
# fsck /dev/hd9var

3. Mount root on a temporary mount point:
# mount /dev/hd4 /mnt

4. Copy the bad inittab to a backup:
# cd /mnt/etc
# mv inittab inittab.bad

5. Use grep to create a minimal new inittab:
# grep "init:" inittab.bad > inittab (adds both the init: and brc: entries)
# grep "^cons:" inittab.bad >> inittab (adds the cons: entry)

6. Reboot using the new inittab:
# sync; sync; sync
# cd /
# umount /mnt

power cycle the system from the front panel or HMC

How to enable the Name Service cache Daemon (NSCD)

Question

How do you enable NSCD to improve the performance of the hostname, password, name and group lookup which is frequently being done by IBM Rational ClearCase?

Cause

By enabling the Name Service cache Daemon (NSCD) of the operating system, a significant performance improvement can be achieved when using naming services like DNS, NIS, NIS+, LDAP.

Answer

Benefit of name service cache daemon (NSCD) for ClearCase

Example:

WithoutNSCD:
[user@host]$ time cleartool co -nc "/var/tmp/file"
Checked out "/var/tmp/file" from version "/main/10".
real    0m3.355s
user    0m0.020s
sys     0m0.018s
With NSCD
[user@host]$ time cleartool co -nc "/var/tmp/file"
Checked out "/var/tmp/file" from version "/main/11".
real    0m0.556s
user    0m0.021s
sys     0m0.016s
Enabling NSCD
Solaris:
/etc/init.d/nscd start

Linux
service nscd start

AIX:
startsrc -s netcd
Note: In addition to having nscd started it is mandatory to be sure this service will be started after a reboot. For instance on Red Hat and SuSE you can run:
chkconfig nscd  on
For more details on how to configure and or enable NSCD refer to your respective operating system vendor's manpage.

Note that this service is not yet available on HP-UX platforms.

Tuesday, 24 June 2014

AIX RC Scripts

We need some applications  should be stopped and started gracefully without manual intervention during the reboots . Order to serve this purpose , we use  rc scripts in all unix flavors including AIX  .

So, how do rc.scripts work:
  1. Write a single script, put it into /etc/rc.d/init.d, make sure the script accepts a single parameter of start or stop and does the right thing.
  2. In /etc/rc.d/rc2.d create a link (ln -s) to the script in init.d called Sxxname where xx is a number that dictates where in comparison to other scripts in the directory your script will execute (lower number first).
  3. In /etc/rc.d/rc2.d create a link to the script in init.d called Kxxname where xx is a number which dictates when the script is run to stop your app in comparison to other scripts in the directory (lower number first).
Note: Its just convention to place scripts in /etc/rc.d/init.d and make  soft links  in /etc/rc.d/rc2.d. But its need not mandatory to keep  scripts in /etc/rc.d/init.d.

Example RC Script:

#!/usr/bin/ksh

ulimit -c 0

case "$1" in
start )
        ps -ef | grep -v grep | grep myengine > /dev/null
        ret=$?
        if [ $ret -gt 0 ]; then
                /var/myengine/bin/startup.sh
        fi
        ;;
stop )
        PID=$$
        for i in myengine-app1 myengine-app2 myengine-app3 myengine-app4; do
                ps -ef | grep $i | grep -v grep | awk '{print $2}' >> /tmp/myengine.$PID
        done
        while read line; do
                kill $line
        done < /tmp/myengine.$PID
        rm /tmp/myengine.$PID
        ;;
* )
        echo "Usage: $0 (start | stop)"
        exit 1
esac

Example Creating Symbolic Links

This is an example on creating symbolic links for automatic startup for tivoli. tivoli should start first (meaning a low Sxx) and stop last (meaning a high Kxx):
umadmin@umserve1:/etc/rc.d/rc2.d>sudo ln -s /etc/rc.d/init.d/rc.tivoli S20tivoli
umadmin@umserve1:/etc/rc.d/rc2.d>sudo ln -s /etc/rc.d/init.d/rc.tivoli K70tivoli

Thursday, 19 June 2014

How to Convert OpenSSH to SSH2 and vise versa

The program SSH (Secure Shell) provides an encrypted channel for logging into another computer over a network, executing commands on a remote computer, and moving files from one computer to another. SSH provides strong host-to-host and user authentication as well as secure encrypted communications over the Internet.

SSH2 is a more secure, efficient, and portable version of SSH .

Connecting two servers running different type of SSH can be a danting task if you does not know how to convert the key. In this article ,we are going to learn about how to convert  keys   SSH( OpenSSH) to SSH2.

How to Generate OpenSSH(SSH v1) key :

umadm@umixserv1 [/home/umadm/.ssh]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/umadm/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/umadm/.ssh/id_rsa.
Your public key has been saved in /home/umadm/.ssh/id_rsa.pub.
The key fingerprint is:
5b:ac:ea:c3:25:cf:2d:31:a2:aa:83:76:4b:a2:c9:eb umadm@umixserv1
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|                 |
|         .       |
|        S o      |
|. o   . .+       |
|+o o + oo        |
|Bo.   =.         |
|#Eo..oo.         |
+-----------------+
umadm@umixserv1 [/home/umadm/.ssh]$
Here we get two encrypted keys  callled   private key( called id_rsa) and public key id_rsa.pub  undr ~$HOME/.ssh directory.
  
You can generate dsa key by using below command.
#ssh-keygen -t dsa

Convert SSH2 to  OpenSSH(SSH):


The command below can be used to convert an SSH2 private key into the OpenSSH format:
ssh-keygen -i -f path/to/private.key > path/to/new/opensshprivate.key
The command below can be used to convert an SSH2 public key into the OpenSSH format:
ssh-keygen -i -f path/to/publicsshkey.pub > path/to/publickey.pub
Here  -i ==> SSH to read an SSH2 key and convert it into the OpenSSH format

Convert OpenSSH(SSH) to SSH2:

The  reverse  process to convert an OpenSSH key into the SSH2 format in the event that a client application requires the other format. This can be done using the following command:

OpenSSH to SSH2 Private key conversion:
ssh-keygen -e -f path/to/opensshprivate.key > path/to/ssh2privatekey/ssh2privatekey
OpenSSH to SSH2 Public key conversion:
ssh-keygen -e -f path/to/publickey.pub > path/to/ssh2privatekey/ssh2publickey.pub
Here  -e ==> SSH to read an OpenSSH key file and convert it to SSH2 format

Note:If you need passwordless authentication  b/w two different hosts , you need to convert the publickey as per the destination server SSH version and  append the public key to   ~/.ssh/authorized_keys or  ~/.ssh2/authorized_keys at destination server.

Sunday, 8 June 2014

How to Remove a Virtual SCSI Disk

This document describes the procedure to remove a virtual disk in a volume group on a Virtual I/O Client, to map the virtual scsi disk to its corresponding backing device, and to remove the backing device from the Virtual I/O Server.  Please, read the entire document before proceeding.

This document applies to AIX version 5.3 and above.

In a Virtual I/O environment, the physical devices are allocated to the VIO server.  When there is a hardware failure (disk or adapter may go bad) on the VIO server, unless the VIO server has some type of redundancy, that will have an impact on the VIO client whose virtual disks are being served by the failing device.  The impact may be loss of connectivity to the virtual scsi disks, unless there is some type of redundancy (MPIO or LVM mirroring) on the client partition. 

This document does NOT apply to any of the following environments:
1. If the virtual disk is in a shared volume group (i.e HACMP, etc)
2. If the virtual disk is part of rootvg volume group.

 Removing a Physical Volume from a Volume Group

 The following steps are needed to remove a virtual disk from the VIO client, and they are later discussed in more detail:

1. Deallocate all the physical partitions associated with the physical volume in the volume group.
2. Remove the physical volume from the volume group
3. Map the virtual scsi disk on the VIO client partiton to the backing device on the VIO server.
4. Remove the virtual scsi disk definition from the device configuration database.
5. Remove the backing device.

At this point, a new virtual scsi can be added to the VIO client in place of the virtual disk that was removed in the case where this procedure was done as a result of a hardware failure on the VIO server partition.

 1. Deallocating the physical partitions

 In the following procedure, we will be using hdisk4 in the example, as the virtual scsi disk wanting to be removed from the VIO client.

First, we need to determine the logical volumes defined on the physical volume we want to remove. This can be done by running:

# lspv -l hdisk#            
where hdisk# is the virtual scsi disk to be removed.

Example:

# lspv -l hdisk4
hdisk4:
LV NAME          LPs      PPs      DISTRIBUTION       MOUNT POINT
fslv00               2          2          00..02..00..00..00    /test
loglv00             1          1          00..01..00..00..00    N/A
rawlv                 30        30         00..30..00..00..00    N/A

If the hdisk name no longer exists, and the disk is identifiable only by its 16-digit PVID (you might see this from the output of lsvg -p <VGname>), substitute the PVID for the disk name. For example:

# lspv -l 00c2b06ef8a9f98a

You may receive the following error:
     0516-320 : Physical volume 00c2b06ef8a9f98a is not assigned to
     a volume group.
If so, run the following command:
# putlvodm -p `getlvodm -v <VGname>` <PVID>
VGname refers to your volume group, PVID refers to the 16-digit physical volume identifier, and the characters around the getlvodm command are grave marks, the backward single quote mark. The lspv -l <PVID> command should now run successfully.  To determine the VGname associated with that physical volume use lspv hdisk#.
If another disk in the volume group has space to contain the partitions on this disk, and the virtual scsi disk to be replaced has not completely failed, the migratepv command may be used to move the used PPs on this disk. See the man page for the migratepv command on the steps to do this.
If the partitions cannot be migrated, they must be removed. The output of the lspv -l <hdisk#>, or lspv -l <PVID>, command indicates what logical volumes will be affected. Run the following command on each LV:
# lslv <LVname>
The COPIES field shows if the LV is mirrored. If so, remove the failed copy with:

# rmlvcopy <LVname> 1 <hdisk#>
hdisk# refers to all the disks in the copy that contain the failed disk. A list of drives can be specified with a space between each. Use the lslv -m <LVname> command to see what other disks may need to be listed in the rmlvcopy command. If the disk PVID was previously used with the lspv command, specify that PVID in the list of disks given to the rmlvcopy command.  The unmirrorvg command may be used in lieu of the rmlvcopy command. See the man pages for rmlvcopy and unmirrorvg, for additional information.
If the logical volume is not mirrored, the entire logical volume must be removed, even if just one physical partition resides on the drive to be replaced and cannot be migrated to another disk. If the unmirrored logical volume is a JFS or JFS2 file system, unmount the file system and remove it. Enter:
# umount /<FSname>
# rmfs /<FSname>

If the unmirrored logical volume is a paging space, see if it is active. Enter:
# lsps -a

If it is active, set it to be inactive on the next reboot.  Enter:
# chps -a n <LVname>

Then deactivate it and remove it remove it by entering:
# swapoff /dev/<LVname>
# rmps <LVname>

Remove any other unmirrored logical volume with the following command:
# rmlv <LVname> 

2. Remove the physical volume from the volume group.

 In the case where the virtual scsi disk to be replaced is the only physical volume in the volume group, then remove the volume group, via:

# exportvg <VGname>

This will deallocate the physical partitions and will free up the virtual disk.  Then, remove the disk definition, as noted on step 3.

In the case where there are more than one physical volumes.  Using either the PVID or the hdisk name, depending on which was used when running lspv -l in the preceding discussion, run one of the following:

# reducevg <VGname> <hdisk#>
# reducevg <VGname> <PVID>

If you used the PVID value and if the reducevg command complains that the PVID is not in the device configuration database, run the following command to see if the disk was indeed successfully removed:

# lsvg -p <VGname>

If the PVID or disk is not listed at this point, then ignore the errors from the reducevg command.

3. How to map the virtual scsi disk (on the client partiton) to the physical disk (on the server partition)

 In the following example, we are going to determine the mapping of virtual scsi disk, hdisk4

On the VIO client:

The following command shows the location of hdisk4:

# lscfg -vl hdisk4
  hdisk4           U9117.570.102B06E-V1-C7-T1-L810000000000  Virtual SCSI Disk Drive

where V1 is the LPAR ID (in this case 1), C7 is the slot# (in this case 7), and L81 is the LUN ID. 
Take note of these values.

Next, determine the client SCSI adapter name, by ‘grep’ing for the location of hdisk4's parent adapter, in this case, V1-C7-T1:

# lscfg -v|grep V1-C7-T1
  vscsi4           U9117.570.102B06E-V1-C7-T1                Virtual SCSI Client Adapter
        Device Specific.(YL)........U9117.570.102B06E-V1-C7-T1
  hdisk4           U9117.570.102B06E-V1-C7-T1-L810000000000  Virtual SCSI Disk Drive

where vscsi4 is the client SCSI adapter.

On the HMC:

Run the following command to obtain the LPAR name associated with the LPAR ID

# lshwres -r virtualio --rsubtype scsi -m <Managed System Name> --level lpar

To get the managed system name, run
# lssyscfg -r sys -F name

Then, look for the "lpar_id" and "slot_num" noted earlier.  In our case, the VIO client lpar id is 1 and the slot # is 7.

In the following example, the managed system name is Ops-Kern-570.  The VIO client partition name is kern1.
The VIO Server partition name is reg33_test_vios.

# lshwres -r virtualio --rsubtype scsi -m Ops-Kern-570 --level lpar
...
lpar_name=kern1,lpar_id=1,slot_num=7,state=1,is_required=0,adapter_type=client,
remote_lpar_id=11,remote_lpar_name=reg33_test_vios,remote_slot_num=23,backing_devices=none
...
Take note of the remote_lpar_id (11) and the remote_slot_num (23).  Then, in the same output, look for a line that corresponds to "lpar_id 11, slot # 23
...
lpar_name=reg33_test_vios,lpar_id=11,slot_num=23,state=1,is_required=0,adapter_type=server,
remote_lpar_id=any,remote_lpar_name=,remote_slot_num=any,backing_devices=none
...
So in this case, VIO server reg33_test_vios is serving virtual scsi disk, hdisk4, on the VIO client, kern1.
            
On the VIO Server:

Go to the VIO Server associated with the LPAR ID obtained in the previous step, in our case reg33_test_vios.
As padmin, run the following command to display the mapping, which should match the mapping obtained from the HMC obtained above.

$ lsmap -all|grep <VIO server lpar ID>-<VIOS slot#>

For example,
$ lsmap -all|grep V11-C23
where V11 is the VIO server lpar_id and C23 is the slot #

The cmd will return something similar to

vhost21         U9117.570.102B06E-V11-C23                    0x00000001

In this case, vhost21 is the server SCSI adapter mapped to our VIO client lpar id 1 (0x00000001).

Next, list the mapping for the vhost# obtained previously.

$ lsmap -vadapter vhost21
SVSA               Physloc                                                Client Partition ID
---------------         --------------------------------------------    ------------------
vhost21            U9117.570.102B06E-V11-C23     0x00000001

VTD                  virdisk01                      
LUN                  0x8100000000000000
Backing device clientlv01                     
Physloc               

Take note of the VTD and Backing device name.  In this case, the backing device mapped to virtual scsi disk, hdisk4, is logical volume, clientlv01, and it is associated with Virtual Target Device, virdisk01.

4. Remove the virtual scsi disk definition from the device configuration database on the VIO client

 To remove the vscsi definition, run

# rmdev -dl hdisk#

Ensure you know the backing device associated with the virtual scsi disk being removed prior to issuing the rmdev command.  That information will be needed in order to do clean up on the server partition.  Refer to the section "How to map the virtual scsi disk (on the client partition) to the physical disk (on the server partitions)".

 5. Remove the backing device on the VIO server

 The peripheral device types or backing devices currently supported are
·                logical volume
·                physical volume
·               optical device starting at v1.2.0.0-FP7 (but not currently supported on System i)

Prior to removing the backing device, the virtual target device must be removed first. To do so, run the following as padmin:

$ rmdev -dev <VTD name>
$ rmlv <LVname>

or you can remove both the VTD and logical volume in one command by running:

$ rmvdev -vtd <VTD name> -rmlv

In the case where the backing device is a physical volume, then, removing the virtual target device completes this document.

If you need to determine the physical device and volume group that the logical volume belongs to, you can issue the following commands prior to running rmlv or rmvdev.
$ lslv -pv <LVname>    List the physical volume that the logical volume specified resides on.
$ lslv <LVname>          Shows the characteristics of the logical volume, including the volume group name, # of mirrored copies, etc.

In our example, the backing device is a logical volume, clientlv01, and it resides on the physical device, hdisk3:

$ lslv -pv clientlv01
clientlv01:N/A
PV                COPIES        IN BAND       DISTRIBUTION 
hdisk3            080:000:000   100%          000:080:000:000:000

$ rmdev -dev virdisk01
virdisk01 deleted

$ rmlv clientlv01
Warning, all data contained on logical volume clientlv01 will be destroyed.
rmlv: Do you wish to continue? y(es) n(o)? y
rmlv: Logical volume clientlv01 is removed.

Related Documentation

Virtual I/O Server Website
http://www14.software.ibm.com/webapp/set2/sas/f/vios/home.html

Relevant Links in Documentation Tab:
http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/home.html
·                     IBM System p Advanced POWER Virtualization Best Practices Redbook
·                     IBM System Hardware Information Center
·                     VIOS Commands Reference

Saturday, 7 June 2014

AIX NFS Error - RPC: 1832-010 Authentication error fixing



AIX NFS Error and Solution - RPC: 1832-010 Authentication error
[root-umserv1][/]> mount umserv2:/repos /mymnt
mount: 1831-008 giving up on:
umserv2:/repos
vmount: The file access permissions do not allow the specified action.
NFS fsinfo failed for server umserv2: error 7 (RPC: 1832-010 Authentication error)
To fix this issue check "nfs_use_reserved_ports" value , if its 0 set it to 1
[root-umserv1][/]> nfso -a | grep port
portcheck = 0
nfs_use_reserved_ports = 0

[root-umserv1][/]> nfso -po portcheck=1
Setting portcheck to 1
Setting portcheck to 1 in nextboot file

[root-umserv1][/]> nfso -po nfs_use_reserved_ports=1
Setting nfs_use_reserved_ports to 1
Setting nfs_use_reserved_ports to 1 in nextboot file

[root-umserv1][/]> mount umserv2:/repos /mymnt
[root-umserv1][/]>

Monday, 19 May 2014

Simple Script to Document LVM info in AIX

Friends here is a small script which can pull AIX OS LVM information of the server and if want it in a file re-direct the  output of the script into a file.This is best useful when your are doing system reboots and upgrades.

Script:

#!/bin/ksh
#
# Simple script to document LVM configurations.
#
exec 2>&1
printf "AIX DISK AND LVM INFORMATION\n"
printf "*********************************************************\n"

printf "\nDF\n"
printf "==========================\n"
df -k

printf "\nVOLUME GROUPS:\n"
printf "==========================\n"
lsvg

printf "\n\nPHYSICAL VOLUMES:\n"
printf "==========================\n"
lspv

printf "\n\nPVs BY VOLUME GROUP\n"
printf "==========================\n"
lsvg | while read VG; do
    VGLIST="$VGLIST $VG"
    printf "\n$VG\n"
    printf "--------------------------\n"
    lspv | grep $VG
done

printf "\n\nPV INFORMATION:\n"
printf "==========================\n"
lspv | while read PV; do
    printf "\n$PV\n"
    printf "--------------------------\n"
    lspv $PV
done

printf "\n\nVG INFORMATION\n"
printf "==========================\n"
for VG in $VGLIST; do
    printf "\n$VG\n"
    printf "--------------------------\n"
    lsvg $VG
    lsvg -l $VG
done

printf "\n\nLV INFORMATION\n"
printf "==========================\n"
for VG in $VGLIST; do
    printf "\nVolume Group: $VG\n"
    printf "--------------------------\n"
    lsvg -l $VG | egrep -v "^$VG:" | egrep -v "^LV NAME" | while read LV JUNK;
do
        printf "\nLogical Volume: $LV\n"
        printf "--------------------------\n"
        lslv $LV
    done
done

Thursday, 8 May 2014

Tivoli System Automation (TSA) Overview

Introduction

The purpose of this guide is to introduce Tivoli® System Automation for Multiplatforms and provide a quick-start, purpose-driven approach to users that need to use the software, but have little or no past experience with it.

This guide describes the role that TSA plays within IBM’s Smart Analytics System solution and the commands that can be used to manipulate the application. Further, some basic problem diagnosis techniques will be discussed, which may help with minor issues that could be experienced during regular use.
When the Smart Analytics system is built with High Availability, TSA is automatically installed and configured by the ATK. Therefore, this guide will not describe how to install or configure a TSA cluster (domain) from scratch, but rather how to manipulate and work with an existing environment. To learn to define a cluster of servers, please refer to the References appendix for IBM courses that are available.

Terminology

It is advisable to become familiar with the following terms, since they are used throughout this guide. It will also help you become familiar with the scopes of the different components within TSA.
Table 1. Terminology
TermDefinition
Peer Domain: A cluster of servers, or nodes, for which TSA is responsible
Resource: Hardware or software that can be monitored or controlled. These can be fixed or floating. Floating resources can move between nodes.
Resource group: A virtual group or collection of resources
Relationships: Describe how resources work together. A start-stop relationship creates a dependency (see below) on another resource. A location relationship applies when resources should be started on the same or different nodes.
Dependency: A limitation on a resource that restricts operation. For example, if resource A depends on resource B, then resource B must be online for resource A to be started.
Equivalency: A set of fixed resources of the same resource class that provide the same functionality
Quorum: A cluster is said to have quorum when there it has the capability to form a majority within its nodes. The cluster can lose quorum when there is a communication failure, and sub-clusters form with an even number of nodes.
Nominal State: This can be online or offline. It is the desired state of a resource, and can be changed so that TSA will bring a resource online or shut it down.
Tie Breaker: Used to maintain quorum, even in a split-brain situation (as mentioned in the definition of quorum). A tie-breaker allows sub-clusters to determine which set of nodes will take control of the domain.
Failover: When a failure occurs (typically hardware), which causes resources to be moved from one machine to another machine, the resources are said to have “failed over”

Getting Started

The purpose of TSA in the Smart Analytics system is to manage software and hardware resources, so that in the event of a failure, they can be restarted or moved to a backup system. TSA uses background scripts to check the status of processes and ensure that everything is working ok. It also uses “heart-beating” between all the nodes in the domain to ensure that every server is reachable. Should a process fail the status check, or a node fails to respond to a heartbeat, appropriate action will be taken by TSA to bring the system back to its nominal state.
Let’s start with the basics. In a Smart Analytics System, the TSA domain includes the DB2 Admin node, the Data nodes, and any Standby/backup nodes. The management server is not part of the domain and TSA commands will not work there. Further, all TSA commands are run as the root user.
The first thing you want to do is check the status of the domain, and start it if required:
    # lsrpdomain
    Name      OpState RSCTActiveVersion MixedVersions TSPort GSPort 
    bcudomain Online  2.5.3.3           No            12347  12348
In this case it’s already started, but if OpState would show “Offline”, then the command to start the domain is,
startrpdomain bcudomain
Notice that the domain name is bcudomain, and it is required for the start command. Likewise, if you want to stop the domain, the command is,
stoprpdomain bcudomain
If TSA is in an unstable state, you can also forcefully shut down the domain using the -f parameter in the stoprpdomain command. However, this is typically not recommended:
stoprpdomain -f bcudomain
You should not stop a domain until all your resources have been properly shut down. If your system uses GPFS to manage the /db2home mount, then you need to manually unmount the GPFS filesystems before you can stop the TSA domain using the following command,
/usr/lpp/mmfs/bin/mmunmount /db2home
Next, you’ll want to check the status of the nodes in the domain. The following command will do this:
        # lsrpnode
        Name      OpState RSCTVersion 
        beluga006 Online  2.5.3.3     
        beluga008 Online  2.5.3.3     
        beluga007 Online  2.5.3.3
You can see that we have 3 nodes in this domain: beluga006, beluga007, and beluga008. This also shows their state. If they are Online, then TSA can work with them. If they are Offline, they are either turned off or TSA cannot communicate with them (and thus unavailable). Nodes don’t always appear in the order that you would expect, so be sure to scan the whole output (in this case, beluga008 shows up before beluga007).

Resource Groups

After you have verified that the Domain is started, and all your nodes are Online, you will want to check the status of your resources. TSA manages all resources through resource groups. You cannot start a resource individually through TSA. When you start a resource group however, it will start all resources that belong to that group.
To check the status of your DB2 resources, use the hals command. This gives you a summary of all nodes in the peer domain, including their primary and backup locations, current location, and failover state.
+===============+===============+===============+==================+==================+===========+
|  PARTITIONS   |    PRIMARY    |   SECONDARY   | CURRENT LOCATION | RESOURCE OPSTATE | HA STATUS |
+===============+===============+===============+==================+==================+===========+
| 0             | dwadmp1x      | dwhap1x       | dwadmp1x         | Online           | Normal    |
| 1,2,3,4       | dwdmp1x       | dwhap1x       | dwdmp1x          | Online           | Normal    |
| 5,6,7,8       | dwdmp2x       | dwhap1x       | dwdmp2x          | Online           | Normal    |
| 9,10,11,12    | dwdmp3x       | dwhap1x       | dwhap1x          | Online           | Failover  |
| 13,14,15,16   | dwdmp4x       | dwhap1x       | dwdmp4x          | Online           | Normal    |
+===============+===============+===============+==================+==================+===========+
In this example, we see that the admin node is dwadmp1x since it holds partition 0. There are 4 data nodes in this system, and all are in Normal state except for data node 3. We can see that data node 3 is in Failover state and its current location is dwhap1x, the backup server.
The hals command is actually a summary of the complete output. For more detailed information about each resource, use the lssam command. The following output is an example of a cluster with the following nodes:
Admin node:   beluga006
Data node:    beluga007
Standby node: beluga008

# lssam | grep Nominal
Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online
Online IBM.ResourceGroup:db2_bculinux_NLG_beluga006-rg Nominal=Online
        '- Online IBM.ResourceGroup:db2_bculinux_0-rg Nominal=Online
Online IBM.ResourceGroup:db2_bculinux_NLG_beluga007-rg Nominal=Online
        |- Online IBM.ResourceGroup:db2_bculinux_1-rg Nominal=Online
        |- Online IBM.ResourceGroup:db2_bculinux_2-rg Nominal=Online
        |- Online IBM.ResourceGroup:db2_bculinux_3-rg Nominal=Online
        '- Online IBM.ResourceGroup:db2_bculinux_4-rg Nominal=Online
Notice that the full output was grepped to “Nominal”. This is a trick to shorten the output so that we only see the Nominal states, and soon you will see that it can get quite long otherwise.
Let’s step through the above output:
Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online
This first line tells us that we have a resource group named SA-nfsserver-rg and it is Online. The Nominal state is also Online, so it is working as expected. By the name, we can tell that this resource group manages the NFS server resources. Typically, this should always be online.
Online IBM.ResourceGroup:db2_bculinux_NLG_beluga006-rg Nominal=Online
        '- Online IBM.ResourceGroup:db2_bculinux_0-rg Nominal=Online
Next we have a resource group called db2_bculinux_NLG_beluga006-rg. This is the resource group belonging to the Admin node. We know that because beluga006 is the hostname for the Admin node. Here, we have 1 DB2 partition (the coordinator partition). For every partition, we define a resource group. You’ll see why shortly. The resource group for the admin partition, partition 0, is called db2_bculinux_0-rg.
Online IBM.ResourceGroup:db2_bculinux_NLG_beluga007-rg Nominal=Online
        |- Online IBM.ResourceGroup:db2_bculinux_1-rg Nominal=Online
        |- Online IBM.ResourceGroup:db2_bculinux_2-rg Nominal=Online
        |- Online IBM.ResourceGroup:db2_bculinux_3-rg Nominal=Online
        '- Online IBM.ResourceGroup:db2_bculinux_4-rg Nominal=Online
Lastly, we have our data partition group, db2_bculinux_NLG_beluga007-rg. Every data partition in a Balanced Warehouse has 4 partitions, and they can be easily seen here.
Now, let us examine the full lssam output. Try to find each of the lines from the grepped output in the full output:
# lssam
Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online
        |- Online IBM.AgFileSystem:shared_db2home
                |- Online IBM.AgFileSystem:shared_db2home:beluga006
                '- Offline IBM.AgFileSystem:shared_db2home:beluga008
        |- Online IBM.AgFileSystem:varlibnfs
                |- Online IBM.AgFileSystem:varlibnfs:beluga006
                '- Offline IBM.AgFileSystem:varlibnfs:beluga008
        |- Online IBM.Application:SA-nfsserver-server
                |- Online IBM.Application:SA-nfsserver-server:beluga006
                '- Offline IBM.Application:SA-nfsserver-server:beluga008
        '- Online IBM.ServiceIP:SA-nfsserver-ip-1
                |- Online IBM.ServiceIP:SA-nfsserver-ip-1:beluga006
                '- Offline IBM.ServiceIP:SA-nfsserver-ip-1:beluga008
Online IBM.ResourceGroup:db2_bculinux_NLG_beluga006-rg Nominal=Online
        '- Online IBM.ResourceGroup:db2_bculinux_0-rg Nominal=Online
                |- Online IBM.Application:db2_bculinux_0-rs
                   |- Online IBM.Application:db2_bculinux_0-rs:beluga006
                   '- Offline IBM.Application:db2_bculinux_0-rs:beluga008
                |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0000-rs
                    |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0000-rs:beluga006
                    '- Offline IBM.Application:db2mnt-db2fs_bculinux_NODE0000-rs:beluga008
                '- Online IBM.ServiceIP:db2ip_172_16_10_228-rs
                    |- Online IBM.ServiceIP:db2ip_172_16_10_228-rs:beluga006
                    '- Offline IBM.ServiceIP:db2ip_172_16_10_228-rs:beluga008
Online IBM.ResourceGroup:db2_bculinux_NLG_beluga007-rg Nominal=Online
        |- Online IBM.ResourceGroup:db2_bculinux_1-rg Nominal=Online
                |- Online IBM.Application:db2_bculinux_1-rs
                    |- Online IBM.Application:db2_bculinux_1-rs:beluga007
                    '- Offline IBM.Application:db2_bculinux_1-rs:beluga008
                '- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0001-rs
                    |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0001-rs:beluga007
                    '- Offline IBM.Application:db2mnt-db2fs_bculinux_NODE0001-rs:beluga008
        |- Online IBM.ResourceGroup:db2_bculinux_2-rg Nominal=Online
                |- Online IBM.Application:db2_bculinux_2-rs
                    |- Online IBM.Application:db2_bculinux_2-rs:beluga007
                    '- Offline IBM.Application:db2_bculinux_2-rs:beluga008
                '- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0002-rs
                    |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0002-rs:beluga007
                    '- Offline IBM.Application:db2mnt-db2fs_bculinux_NODE0002-rs:beluga008
        |- Online IBM.ResourceGroup:db2_bculinux_3-rg Nominal=Online
                |- Online IBM.Application:db2_bculinux_3-rs
                    |- Online IBM.Application:db2_bculinux_3-rs:beluga007
                    '- Offline IBM.Application:db2_bculinux_3-rs:beluga008
                '- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0003-rs
                    |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0003-rs:beluga007
                    '- Offline IBM.Application:db2mnt-db2fs_bculinux_NODE0003-rs:beluga008
        '- Online IBM.ResourceGroup:db2_bculinux_4-rg Nominal=Online
                |- Online IBM.Application:db2_bculinux_4-rs
                    |- Online IBM.Application:db2_bculinux_4-rs:beluga007
                    '- Offline IBM.Application:db2_bculinux_4-rs:beluga008
                '- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0004-rs
                    |- Online IBM.Application:db2mnt-db2fs_bculinux_NODE0004-rs:beluga007
                    '- Offline IBM.Application:db2mnt-db2fs_bculinux_NODE0004-rs:beluga008

Let us take a look at the NFS resource group:


Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online
        |- Online IBM.AgFileSystem:shared_db2home
             |- Online IBM.AgFileSystem:shared_db2home:beluga006
             '- Offline IBM.AgFileSystem:shared_db2home:beluga008
The first line was what we had seen before (lssam | grep Nom). Now, we can see what resources actually form the resource group. This first resource is of type AgFileSystem and represents the db2home mount. We can see that it can exist on beluga006 and beluga008, and that it is Online in beluga006 and Offline in beluga008.
Similarly, for the admin node, we can now see the individual resources:
Online IBM.ResourceGroup:db2_bculinux_NLG_beluga006-rg Nominal=Online
        '- Online IBM.ResourceGroup:db2_bculinux_0-rg Nominal=Online
                |- Online IBM.Application:db2_bculinux_0-rs
                   |- Online IBM.Application:db2_bculinux_0-rs:beluga006
                   '- Offline IBM.Application:db2_bculinux_0-rs:beluga008
The first two lines were part of the previous grepped output, but now we can see an Application resource. You can see similar results for the data node and each of its 4 data partitions. The reason that each of these resources exist on two nodes (beluga006 and beluga008) is for high availability. If beluga006 were to fail, TSA will move all those resources that are currently Online there to beluga008. Then, you would see that they are Offline in beluga006, and Online in beluga008. You can see how this output is useful to determine on which nodes the resources exist.
The lssam command also shows Equivalencies as part of the output. I will include it for the sake of completion, but we will discuss this later on:
Online IBM.Equivalency:SA-nfsserver-nieq-1
        |- Online IBM.NetworkInterface:bond0:beluga006
        '- Online IBM.NetworkInterface:bond0:beluga008
Online IBM.Equivalency:db2_FCM_network
        |- Online IBM.NetworkInterface:bond0:beluga006
        |- Online IBM.NetworkInterface:bond0:beluga007
        '- Online IBM.NetworkInterface:bond0:beluga008
Online IBM.Equivalency:db2_bculinux_0-rg_group-equ
        |- Online IBM.PeerNode:beluga006:beluga006
        '- Online IBM.PeerNode:beluga008:beluga008
Online IBM.Equivalency:db2_bculinux_1-rg_group-equ
        |- Online IBM.PeerNode:beluga007:beluga007
        '- Online IBM.PeerNode:beluga008:beluga008
Online IBM.Equivalency:db2_bculinux_2-rg_group-equ
        |- Online IBM.PeerNode:beluga007:beluga007
        '- Online IBM.PeerNode:beluga008:beluga008
Online IBM.Equivalency:db2_bculinux_3-rg_group-equ
        |- Online IBM.PeerNode:beluga007:beluga007
        '- Online IBM.PeerNode:beluga008:beluga008
Online IBM.Equivalency:db2_bculinux_4-rg_group-equ
        |- Online IBM.PeerNode:beluga007:beluga007
        '- Online IBM.PeerNode:beluga008:beluga008
Online IBM.Equivalency:db2_bculinux_NLG_beluga006-equ
        |- Online IBM.PeerNode:beluga006:beluga006
        '- Online IBM.PeerNode:beluga008:beluga008
Online IBM.Equivalency:db2_bculinux_NLG_beluga007-equ
        |- Online IBM.PeerNode:beluga007:beluga007
        '- Online IBM.PeerNode:beluga008:beluga008
The lssam command also lets you limit the output to a particular resource group, with the –g option:
# lssam –g SA-nfsserver-rg
Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online
        |- Online IBM.AgFileSystem:shared_db2home
                |- Online IBM.AgFileSystem:shared_db2home:beluga006
                '- Offline IBM.AgFileSystem:shared_db2home:beluga008
        |- Online IBM.AgFileSystem:varlibnfs
                |- Online IBM.AgFileSystem:varlibnfs:beluga006
                '- Offline IBM.AgFileSystem:varlibnfs:beluga008
        |- Online IBM.Application:SA-nfsserver-server
                |- Online IBM.Application:SA-nfsserver-server:beluga006
                '- Offline IBM.Application:SA-nfsserver-server:beluga008
        '- Online IBM.ServiceIP:SA-nfsserver-ip-1
                |- Online IBM.ServiceIP:SA-nfsserver-ip-1:beluga006
                '- Offline IBM.ServiceIP:SA-nfsserver-ip-1:beluga008
With the Smart Analytics System, some new commands were introduced to make it easier to monitor and use TSA with DB2:
Table 2. Useful Commands
CommandDefinition
hals: shows HA status summary for all db2 partitions
hachknode shows the status of the node in the domain and details about the private and public networks
hastartdb2 start db2 partition resources
hastopdb2 stop db2 partition resources
hafailback moves partitions back to the primary machine specified in the primary_machine argument
Equivalency: A set of fixed resources of the same resource class that provide the same functionality
hafailover moves partitions off of the primary machine specified in the primary_machine argument to it is standby
hareset attempt to reset pending, failed, stuck resource states

Stopping and Starting Resources

If you want to stop or start the DB2 service, you need to stop the respective DB2 resource groups using TSA commands. TSA will then start or stop DB2.
The command to do this is chrg. To stop a resource group named db2_bculinux_NLG_beluga007, issue the command,
chrg –o offline –s “Name == ‘db2_bculinux_NLG_beluga007’”
Similarly, to start the resource group
chrg –o online –s “Name == ‘db2_bculinux_NLG_beluga007’”
You can also stop/start all resources at the same time:
chrg –o online –s “1=1”
The Smart Analytics System also has some pre-configured commands:
hastartdb2 and hastopdb2
These two commands, however, are specific to DB2 and if there has been customization to TSA, they may not stop/start all resources.
If TSA has pre-configured rules/dependencies, they will ensure that resources are stopped and started in the correct order. For example, DB2 resources that depend on NFS will not start if the NFS share is Offline.

TSA Components

Now that you understand the basics of Tivoli System Automation, we can discuss some of the other components that it can manage.

Service IP

A service IP is a virtual, floating resource attached to a network device. Essentially, it is an IP address that can move from one machine to another, in the event of a failover. Service IPs play a key role in a highly available environment. Because they move from a failed machine to a standby, they allow an application to reconnect to the new machine using the same IP address – as if the original server had simply restarted.
The following command will allow you to view what service IPs have been configured for your system.
# lsrsrc -Ab IBM.ServiceIP
    Resource Persistent and Dynamic Attributes for IBM.ServiceIP
    resource 1:
     Name              = "db2ip_10_160_20_210-rs"
     ResourceType      = 0
     AggregateResource = "0x2029 0xffff 0x414c690c 0x7cc2abfa 0x919b42d5 0xbf62ab75"
     IPAddress         = "10.160.20.210"
     NetMask           = "255.255.255.0"
     ProtectionMode    = 1
     NetPrefix         = 0
     ActivePeerDomain  = "bcudomain"
     NodeNameList      = {"t6udb3a"}
     OpState           = 2
     ConfigChanged     = 0
     ChangedAttributes = {}
    resource 2:
     Name              = "db2ip_10_160_20_210-rs"
     ResourceType      = 0
     AggregateResource = "0x2029 0xffff 0x414c690c 0x7cc2abfa 0x919b42d5 0xbf62ab75"
     IPAddress         = "10.160.20.210"
     NetMask           = "255.255.255.0"
     ProtectionMode    = 1
     NetPrefix         = 0
     ActivePeerDomain  = "bcudomain"
     NodeNameList      = {"t6udb1a"}
     OpState           = 1
     ConfigChanged     = 0
     ChangedAttributes = {}
    resource 3:
     Name              = "db2ip_10_160_20_210-rs"
     ResourceType      = 1
     AggregateResource = "0x3fff 0xffff 0x00000000 0x00000000 0x00000000 0x00000000"
     IPAddress         = "10.160.20.210"
     NetMask           = "255.255.255.0"
     ProtectionMode    = 1
     NetPrefix         = 0
     ActivePeerDomain  = "bcudomain"
     NodeNameList      = {"t6udb1a","t6udb3a"}
     OpState           = 1
     ConfigChanged     = 0
     ChangedAttributes = {}
The above example shows three resources with the same name, db2ip_10_160_20_210-rs. The NodeNameList parameter tells us which node(s) the resource is referring to. The first resource has Opstate set to 2, which tells us that this is where the service IP is currently pointing (it is also the primary location of the resource). The second resource has Opstate 1, which tells us that this is the backup/standby node. The third resource contains both nodes in its NodeNameList parameters, and this tells TSA that this is a floating resource between those two nodes.

Application Resources

TSA manages resources using scripts. Some scripts are built in (and part of TSA), such as those for controlling DB2. These scripts are responsible for starting, stopping and monitoring the application. Sometimes it can be useful to understand these scripts, or even edit them for problem diagnosis. To find out where they are located, we use the lsrsrc command, which provides us with the complete configuration of a particular resource.
Following is an example:
# lsrsrc -Ab IBM.Application
resource 12:
  Name                  = "db2_dbedw1da_8-rs" 
  ResourceType          = 1
  AggregateResource     = "0x3fff 0xffff 0x00000000 0x00000000 0x00000000 0x00000000"
  StartCommand          = "/usr/sbin/rsct/sapolicies/db2/db2V97_start.ksh dbedw1da 8"
  StopCommand           = "/usr/sbin/rsct/sapolicies/db2/db2V97_stop.ksh dbedw1da 8"
  MonitorCommand        = "/usr/sbin/rsct/sapolicies/db2/db2V97_monitor.ksh dbedw1da 8"
  MonitorCommandPeriod  = 60
  MonitorCommandTimeout = 180
  StartCommandTimeout   = 330
  StopCommandTimeout    = 140
  UserName              = "root"
  RunCommandsSync       = 1
  ProtectionMode        = 1
  HealthCommand         = ""
  HealthCommandPeriod   = 10
  HealthCommandTimeout  = 5
  InstanceName          = ""
  InstanceLocation      = ""
  SetHealthState        = 0
  MovePrepareCommand    = ""
  MoveCompleteCommand   = ""
  MoveCancelCommand     = ""
  CleanupList           = {}
  CleanupCommand        = ""
  CleanupCommandTimeout = 10
  ProcessCommandString  = ""
  ResetState            = 0
  ReRegistrationPeriod  = 0
  CleanupNodeList       = {}
  MonitorUserName       = ""
  ActivePeerDomain      = "bcudomain"
  NodeNameList          = {"d8udb11a","d8udb3a"}
  OpState               = 1
  ConfigChanged         = 0
  ChangedAttributes     = {}
  HealthState           = 0
  HealthMessage         = ""
  MoveState             = [32768,{}]
  RegisteredPID         = 0
Some of the more common and useful attributes are described in Table 3.
Table 3. Resource Attributes
AttributeDefinition
ResourceType: Indicates whether the resource is allowed to run on multiple nodes, or a single node. A fixed resource is identified with a ResouceType value of 0, and a floating resource has a value of 1.
StartCommand: Specifies the command to be run when the resources is started
StopCommand: Specifies the command to be run when the resource is stopped
MonitorCommand: Specifies the command to be run when the resource is being monitored. This happens on a regular interval, and you will likely see this command often when you run the “ps –ef” command.
UserName: The userid that TSA will use to start this resource
NodeNameList: Indicates on which nodes the resource is allowed to run. This is an attribute of an RSCT resource.
OpState: Specifies the operational state of a resource or a resource group. The valid states are,
0 - UNKNOWN
1 - ONLINE
2 - OFFLINE
3 - FAILED_OFFLINE
4 - STUCK_ONLINE
5 - PENDING_ONLINE
6 - PENDING_OFFLINE

Network Resources

Every machine typically has an Ethernet adaptor, with a configured network address. TSA is aware of this and you can see how they have been configured with the lsrsrc command. For example,
# lsrsrc -Ab IBM.NetworkInterface
    resource 1:
        Name             = "en0"
        DeviceName       = ""
        IPAddress        = "172.22.1.217"
        SubnetMask       = "255.255.252.0"
        Subnet           = "172.22.0.0"
        CommGroup        = "CG1"
        HeartbeatActive  = 1
        Aliases          = {}
        DeviceSubType    = 6
        LogicalID        = 0
        NetworkID        = 0
        NetworkID64      = 0
        PortID           = 0
        HardwareAddress  = "00:21:5e:a3:be:60"
        DevicePathName   = ""
        IPVersion        = 4
        Role             = 0
        ActivePeerDomain = "bcudomain"

Log Files

It is important to be aware of the log files that TSA actively writes to:
  1. History file – this logs the commands that were sent to TSA
    /var/ct/IBM.RecoveryRM.log2
  2. Error and monitor logs – these logs are simply the AIX and Linux system logs. They will show you the output of the start, stop, and monitor scripts as well as any diagnostic information coming from TSA. Although the system administrator can configure the location for these logs, they are typically located in the following locations,
    AIX: /tmp/syslog.out
    Linux: /var/log/messages

Command Reference

Table 4 describes the most common commands that a TSA administrator will use.
Table 4. Common TSA Commands
CommandDefinition
hals: Display HA configuration summary
hastopdb2: Stop DB2 using TSA
hastartdb2: Start DB2 using TSA
mkequ:Makes an equivalency resource
chequ:Changes a resource equivalency
lsequ: Lists equivalencies and their attributes
rmequ: Removes one or more resource equivalencies
mkrg: Makes a resource group
chrg: Changes persistent attribute values of a resource group (including starting and stopping a resource group)
lsrg: Lists persistent attribute values of a resource group or its resource group members
rmrg: Removes a resource group
mkrel: Makes a managed relationship between resources
chrel: Changes one or more managed relationships between resources
lsrel: Lists managed relationships
rmrel: Removes a managed relationship between resources
samcrl: Sets the IBM TSA control parameters
lssamctrl: Lists the IBM TSA controls
addrgmbr: Adds one ore more resources to a resource group
chrgmbr: Changes the persistent attribute value(s) of a managed resource in a resource group
rmrgmbr: Removes one or more resources from the resource group
lsrgreq: Lists outstanding requests applied against resource groups or managed resources
rgmbrreq: Requests a managed resource to be started or stopped, or cancels the request
rgreq: Requests a resource group to be started, stopped, or moved, or cancels the request
lssam: Lists the defined resource groups and their members in a tree format

Command Tips

Following are some useful commands with examples.
Show relationships/dependencies:
lsrel | sort
Show details for a specific relationship:
# lsrel -A b -s "Name = 'db2_bculinux_0-rs_DependOn_db2_bculinux_qp-rel'"
Managed Relationship 1:
        Class:Resource:Node[Source] = IBM.Application:db2_bculinux_qp
        Class:Resource:Node[Target] = {IBM.Application:db2_bculinux_0-rs}
        Relationship                = DependsOn
        Conditional                 = NoCondition
        Name                        = db2_bculinux_0-rs_DependOn_db2_bculinux_qp-rel
        ActivePeerDomain            = bcudomain
        ConfigValidity              =
Delete/remove a relationship
rmrel -s "Name like 'db2_bculinux_%-rs_DependsOn_db2_bculinux_0-rs-rel'"
Change a resource attribute:
chrsrc -s "Name=='"  attribute=value
Example:
chrsrc -s "Name=='db2ip_10_160_10_27-rs'" IBM.ServiceIP NetMask='255.255.255.0'
To save current SAMP policy information:
sampolicy –s /tmp/sampolicy.current.xml
To check if the policy in the input file is valid:
sampolicy –c /tmp/sampolicy.current.xml
To activate it:
sampolicy –a /tmp/sampolicy.current.xml

Troubleshooting

This section describes methods that can be used to determine the cause of a particular problem or failure. Though techniques vary depending on the type of problem, the following should be a good starting point for most issues.
Resolving FAILED OFFLINE status
A failed offline status will prevent you from setting the nominal status to ONLINE, so these must be resolved first and changed to OFFLINE before turning it back to ONLINE. Make sure that the Nominal status is showing OFFLINE before resolving it.
To resolve the Failed offline messages, use the resetrsrc command.
resetrsrc -s ‘Name = "db2whse_appinstance_01.abxplatform_server1"‘ IBM.Application
resetrsrc -s 'Name = "db2whse_appinstance_01.adminconsole_server1"' IBM.Application
Recovery from a failed failover attempt
Take all TSA resources offline. The lssam output should reflect “Offline” for all resources before you attempt to bring them back online. To reset NFS resources, use:
resetrsrc -s "Name like 'SA-nfsserver-%'" IBM.Application (if necessary)
resetrsrc -s "Name like 'SA-nfsserver-%'" IBM.ServiceIP (if necessary)
When testing goes wrong, you are often left with resources in various states such as online, offline, and unknown. When the state of a resource is unknown, before attempting to restart it, you must issue resetrsrc for that particular resource.
When you are restarting DB2, you must verify that all the resources are offline before attempting to bring them online again. You must also correct the db2nodes.cfg file. Make sure you have backup copies of db2nodes.cg and db2ha.sys.
NFS mounts stop functioning
In testing the NFS failover, we were able to move the server over successfully, but the existing NFS client mounts stopped functioning. We solved this problem by unmounting and remounting the NFS volume.
Resolving Binding=Sacrificed
To resolve this problem you have to look at the overall cluster and how its setup/defined. Issues that can and will cause this are types that will have a cluster-wide impact but not specifically affect one resource.
  1. Check for failed relationships by listing the relationships with the following command "lsrel -Ab", and then determine if one or more of the relationships relating to the failed resource group have not been satisfied.
  2. Check for failed equivalencies by listing them with the following command "lsequ -Ab" and then determine if one re more of the equivalencies have not been satisfied.
  3. Check your resource group attributes and look for anything that maybe set incorrectly, some of the commands to use are listed as follows:
    lsrg -Ab -g 
    lsrsrc -s 'Name="failed_resource"' –Ab IBM.
    lsrg -m -g 
    samdiag -g <resource_group_name>
  4. Check for anything specific to your configuration that all of the sacrificed resources share in common, like a mount point, a database instance, a virtual IP.
Check hardware configuration:dmesg – check initialization errors
date – check server synchronization
ifconfig – to check network adapters
netstat -I – to check network configuration
ps -ef | grep inetd – will provide a list of the running processes, including group and PID
Resource state is unknown
Try resetting the resource using the resetrsrc command:
resetrsrc -s "Name like 'db2_db2inst2_%'" IBM.Application
resetrsrc -s "Name like 'db2_db2inst2_%'" IBM.ServiceIP
Timeout values for resources
For the health query interval of each resource, use:
chrsrc -s 'Name like "db2_db2inst2%"' IBM.Application MonitorCommandPeriod=300
For the health query timeout, use:
chrsrc -s 'Name like "db2_db2inst2%"' IBM.Application MonitorCommandTimeout=290
For the resource startup script timeout, use
chrsrc -s 'Name like "db2_db2inst2%"' IBM.Application StartCommandTimeout=300
For the Resource Stop script timeout, use:
chrsrc -s 'Name like "db2_db2inst2%"' IBM.Application StopCommandTimeout=720
Recycling the automation manager
If the problem is most likely related to the automation manager, you should try recycling the automation manager (IBM.RecoveryRM) before contacting IBM support. This can be done using the following commands:
Find out on which node the RecoveryRM master daemon is running:
# lssrc -ls IBM.RecoveryRM | grep Master
On the node running the master, retrieve the PID and kill the automation manager:
# lssrc -ls IBM.RecoveryRM | grep PID
# kill -9 
As a result, an automation manager on another node in the domain will take over the master role, and proceeds with making automation decisions. The subsystem will restart the killed automation manager immediately.
Resolving lssam hangs
http://www-01.ibm.com/support/docview.wss?uid=swg21293701
Move to another node in the same HA group and see if you can run the lssam command. If you can, go back to the original node to see if you can now do the lssam command. If this still does not work, then run the following commands:
lssrc -ls IBM.RecoveryRM | grep -i master 
lssrc -ls IBM.GblResRM | grep -i leader
Make sure neither of the above command outputs return the “hanging” node and if so, then reboot just that node and see if the issue is resolved.
AVOID the following (DON’Ts)
  • Do not use rpower –a, or rpower on more than one node in the same HA group when SAMP HA is up and running.
  • Do not offline HA-NFS using a sudo command while logged in as the instance owner and while in the /db2home directory. HA-NFS will get stuck online, and the RecoveryRM daemon has to be killed on the master. If RecoveryRM will not start, reboot may be required.
  • Do not use ifdown to bring down a network interface. This will result in the eth (or en) device to be deleted from equivalency member and will require you to add the "eth" device (in Linux) or "en" device (in AIX) back into the network equivalency using chequ command
  • Do not manipulate any BW resources that are under active SAMP control.
    Turn automation off (samctrl –M T) before manipulating these BW resources.
  • Do not implement changes to the SA MP policy unless exhaustive testing of the HA test cases is completed.
Check the following frequently (DOs)
  • Ensure the /home and /db2home directories are always mounted before starting up a node.
  • Check for process ids that may be blocking stop, start and monitor commands.
  • Save backup copies of the db2nodes.cfg and db2ha.sys file.
  • Save the backup copies of the current SAMP policy before and after every SAMP change. Compare the current SAMP policy to the backup SAMP policy every time there is an HA incident.
  • Save backup copies of db2pd -ha output before and after every SAMP change. Compare the current db2pd outputs to the backup db2pd outputs every time there is an HA incident.
  • Save backup copies of the samdiag outputs. 
Source