Sunday, 31 March 2013

AIX System Recovery Tips and Techniques

An AIX recovery can be necessary as a result of a number of events: the loss of some system files, an unexplained system crash, a site environmental problem, or simply a request for a system recovery test. Either way, be prepared to hit the ground running and get the recovery done—or be ready to pack your bags and say goodbye.

AIX recovery is a basic skill; there are no excuses for not having it or not being prepared to use it as part of a disaster recovery (DR) plan. AIX system recovery isn’t rocket science, but you need to have your wits about you. This article will help you prepare to perform a recovery quickly and with  confidence. 

Prepare, Prepare, Prepare

Key requirements for a successful recovery are an up-to-date configuration listing of the target machine, a current system backup, and application backups or re-installation media. Whether you’re dealing with a full or partial restore, or a simulated or real disaster, the processes involved are the same. If you’re prepared with these prerequisites, your recovery will go smoothly; if not, you’re in for a difficult time.

The best way to ensure that you’re prepared is to routinely (at least weekly) create a system-bootable backup of your AIX servers to capture the sort of periodic changes that occur on a regular basis, such as PTFs and minor file changes. Also, track the status of applications and data being backed up daily, because these components are much more volatile than the OS itself. Typically, application backup is the responsibility of an operational team, but as an AIX systems admin, you should be informed that the data is being backed up successfully; after all, the applications do reside on your machine. You should also take a configuration report for each server. At minimum, this should include the output of the following commands:
  • lspv
  • lsvg -l <vgname>, lsvg -p <vgname>, lsvg <vgname>(for all volume groups)
  • lsslot -c slot
  • lscfg -vp
  • lsdev
  • lsattr -El sys0
A script can collect this information for you automatically and archive it off machine by, for example, emailing its output file to you. With the information these commands provide, you’ll be on a good footing to a confident recovery.

Expect the Unexpected

Recovering a system to a new server at a remote site typically involves restoring the OS from a tape or DVD bootable backup. You can perform a boot restore via the network if you’ve taken remote network system saves with netboots (e.g., Storix or NIM), but this process is much slower than restoring from a tape or DVD, and only the largest “hot site” facilities have netboot host capabilities. The rest of us must make do with bare-bones recovery from the trenches.

The restore-from-bootable-media process is straightforward. First, because it’s best to start up without a network attachment, make sure all Ethernet and other network cables (other than storage) are unplugged. Next, insert the bootable media—tape or DVD—into a boot-capable drive and start the system. It’s best if the server you’re restoring to closely matches the specs of the failed server, but some differences can be accommodated. For example, the root volume group (rootvg) disk(s) might not be the same size, but as long as they’re larger, not smaller, the restore will complete. You should be prepared to alter some of the logical volume copies or re-size the logical volumes during the AIX recovery process if your restore product allows.

Confirm Settings in New Environment

Confirm from the networks team or DR manager what IPs you’ll be using for the following:
  • Host and gateway IP addresses (IPv4 and IPv6)
  • Subnet mask
  • DNS servers
  • DNS entries (forward and reverse for all addresses owned by the host)
  • Firewall, ipfilter, and/or tcpwrapper rules
  • Printed copies of all customized directories showing ownership and permission settings
  • Mail relay host (if your machine forwards mail)
  • xntpd server
You might be on a different LAN or VLAN for the duration of the disaster, so be sure to document the IP environment for the recovery site so that you’re not fighting network issues during recovery operations. And, of course, if your system interacts with other servers or services, ensure that those are accessible from the recovery site.

Review Operational Parameters

Remember that all Ethernet cables should be disconnected at startup. If the machine comes up with the network interface disabled, that’s good; if it comes up enabled, you forgot to take out the Ethernet cables, which can complicate startup troubleshooting. (You don’t want some automated application process kicking off uncontrolled sessions.) When the AIX recovery boot-up completes, it’s time to check all the operational parameters, and then check them again. Review the /etc/inittab file, comment out any non-required services you don’t want started, then refresh the inittab with telinit -q. Check out root’s crontab and review any non-required periodic jobs that might start. Once you’re satisfied that all application processes and undesired mail sending processes are commented out, stop or kill any processes that might have been kicked off before you reviewed /etc/inittab and crontabs. You might want to delete any outbound queued email files held in /var/spool/mqueue because the mail system might try to send those messages, which you might not want until you’re ready for full production operation.

Next, stop and re-start sendmail so you have a clean mail agent running. Review any firewall, ipfilter, and tcp wrapper rules you have; these will undoubtedly have to be amended now that you’re in recovery mode and in a geographically different environment. If your machine’s database applications use raw devices, be sure to check the ownerships of these devices in /dev, because these likely would have been changed on a system restore. Most databases use async I/O; check that your databases are running using pstat -a. If your machine is on AIX 6.1 or later, database processes are started automatically. On AIX 5.3, you’ll probably need to start them up.

On the Network

Bring the machine onto the network by connecting the Ethernet cables (you should have already configured the net interfaces). Verify that you can ping the network gateway (both IPv4 and IPv6 if you use it), your DNS server, and any necessary collaborative servers. Validate that your configured DNS correctly resolves local and global names, and give special attention to reverse name resolution for the IP addresses owned by the AIX system you’re recovering. One of the most common root causes of startup failure is missing DNS entries for the new network environment.

If static routes are required to reach any internal or WAN networks other than through the default gateway, use the netstat -rn command to verify that the routes exist, and add them if needed. Stop and start the sshd service if it’s present (from the console, or you’ll cut off your command-line session). Test a remote connection, such as Telnet or ssh, to ensure you have remote access capabilities. Next, begin the xntpd service to start getting the machine time synced, and verify it with the date command. You should now be able to send a test email to make sure sendmail forwarding works:
#echo "test mail" | mail [email protected]
Now you're ready to configure your data volumes.

Bring In the Disks

Internal data volumes won’t typically be saved with the system bootable backup. You must restore them separately, so be sure your DR plan includes the instructions for this step. If you use a Storage Area Network (SAN), the SAN volumes might reside at a remote site. If so, be sure to get iSCSI or FC zoning correct—there’s no time to mess around—then run cfgmgr to bring them in. The same goes for locally attached disks. Be sure to create your disk raid configuration, if required. If you’re only going to be at DR for a few days, you can generally forgo RAID altogether—the complexity isn’t worth the risk of a disk failure during DR operations.

Create the volume groups and file systems based on the configuration reports you captured previously. It might be advantageous to create a script when you're gathering your reports of the host configurations; this lets you automatically create the file systems and saves you a lot of time, as I’ve learned from experience.

Restore the Application Data

As I noted earlier, your application data must be backed up separately from the bootable OS media, and thus must be restored separately. If you’re using a third-party product for your application backups, check that the client is running and talking back to the remote backup server. Next, restore the applications and the data (if you do incremental backups, ensure the operational team has the full list of tapes required). This is typically the operational team’s responsibility, so be sure to hurry them along. When all the data is recovered, review the permissions of the base directories or file systems, then review them again. Once you’re satisfied, prepare to start up the services in a controlled manner, one by one. If you have databases to restore, make sure you have the latest dumps before restoring them. Review the processes running and consult with the applications’ support teams so that there are no issues. If everything looks good, stop all applications.

A reboot with Pause

Now’s the time to test that the machine can reboot. You might be thinking, “Why do this; let’s just get the machine recovered?” Well, if the machine goes down at a working DR site, it doesn’t reflect well on you or your team, so run this test now before you release the machine to the users. There are many factors that could stop an automatic boot, and because your initial boot was closely attended, you might not have encountered or noticed them. Simple things such as an incorrectly seated Ethernet cable or an IP address conflict can cause a reboot to stop and wait for manual intervention, so a trial reboot is essential.

First, clear the errorlog with errclear 0 so that you have a clean error logging sheet. Issue the bosboot and then the reboot commands. You should always issue a bosboot before any reboot or shut down because it’s a good habit to have. If for some reason the boot hangs, count your lucky stars that you discovered the problem now.

A Final Cross-Check, Please

Once the machine comes back up, check that all services are up. Get the support team to connect to the applications. Then relax and wait for the phone calls to come in on some other tinkering that needs to be done. This is inevitable, I’m afraid; however, the bulk of your work is now done.

1 comment:

  1. Detailed and very accessible guide. Really happy i've found this. Hope it help in developing a disaster recovery plan.