Friday 22 May 2009

A business on demand solution for backups using Dlpar, SSH & Sudo

Managing system level backups has been an uphill task for P5 based Lpars system administrators. In most of cases system administrators are asked to do the system level backups of multiple Lpars with single tape drive or DVD ram drive. This tape drive or DVD ram drive or CD RW drive is then required to be moved across all these Lpars using IBM DLPAR technology. Some system administrators use sysback tool from IBM which allows system administrators to take AIX level backups (mksysb, vg backups, filesystem level backups etc) to a tape library remotely. It therefore eliminates need of performing Dlpar operations as it is able to take remote backups on TCPIP level. In some environments, where TSM is available, sysback tool allows integration into TSM where system administrators can use TSM policies to manage versions of system level backups with their desired policies.

In scenario where no Sysback or TSM is available, system administrators have to rely on AIX built-in tools of mksysb, savevg etc to protect their system level configurations and data. All these AIX based operating system level backup tools can only take backups to locally available devices ( like tape drives or DVD drives ) for these Lpars. While performing any DLPAR operation to move these devices, two important things to be keep in mind

1. First, these devices should be present in the system as child resources of any “movable “physical adapter. When I say “movable”, it means that this physical adapter ideally should not contain any required resource for system… only those resources should be present which are declared as desired resources in Lpar profile.

2. If any other resource exist as child resource of that physical adapter (which contain tape drive or DVD ram drive as a child resource) , then deletion of physical adapter PCI slot and recursively deletion of tape drive resource should be done in a very careful manner.

While number of Lpars within a physical server increases with only few backup devices available, system administrators might find it hard to manage these DLpar operations , especially when backup operations have to be done on daily basis.

In this article , I will take my readers to guided through step by step solution , which is a fully automated solution to move backup devices ( like DVD-RAM device ) between Lpars. This solution makes use of UNIX shell scripting with basic tools of SSH & Sudo to do the whole work automated for system administrators.

Basic Features of Solution

The solution comprises of the following features:

1. It automatically detects how many Lpars are available on a given physical P5 or P4 system. It then also detects and shows the Lpar which contains the device in available state.
2. It asks you for identifying target Lpar (on which the device has to be moved) and then after confirmation, deletes all parent devices from source Lpar and then perform required DLpar operation from source Lpar to target Lpar.
3. After making the device in available state on target Lpar, it can use the device for taking system or vg or file system level backups.

To perform all these tasks automatically, solution uses SSH to access HMC and perform DLPAR operations. As a security requirement, HMC should not be accessed remotely using root user while deletion of devices and running cfgmgr on AIX operating system have to be done by root equivalent user , so to meet both requirements , I also used Sudo as a major component of this automated solution.

Following is pictorial representation of whole solution











Building whole solution Steps by Steps
My scenario consisted of two P570 servers and two P550 servers which are managed by single HMC. There are six Lpars on P570-1, four on P570-2 and three Lpars on each P550. Each of these P5 servers, off course, possesses single DVD-RAM drive for backup purposes. As a prerequisite of this solution, DLPAR operations should be running on all the servers and all Lpars should be capable of acquiring processors, memory and IO resources through Dlpar operations.
A) Host Name resolution setup:

I , first of all identified one of Lpar on P570-2 ( aqbtest) as a management Lpar for this whole solution. I established name resolution setup on this Lpar so that all other Lpars on P570-1, P570-2 and P570-3 servers should be pingable by name and Ip address from this management Lpar. Hmc should also be resolvable by hostname from this management Lpar.
One important thing which you should consider while implementing this whole solution is that Lpar names (as they appear on HMC interface) should also be resolvable from management Lpar. In most of the cases, hostnames and Lpar names are same , but however if they are different then still you can use /etc/hosts file or DNS to resolve Lpar names also from management Lpar. This requirement arises from shell scripting done for this solution and will be explained further in Section D.

B) SSH access setup:

Second step would be establishing SSH relationship and access from this management Lpar to each and every other Lpar as well as to HMC.
SSH access to HMC is established in slightly different way as compared to SSH access to Lpars. I used Openssh on AIX from Bull web site ( freeware.openssh.rte 3.8.1.0), installed it alongwith openssl library ( openssl 0.9.6.7).I created a user on aqbtest Lpar named as hscadmin and also on Hmc HMC1. I assigned "Managed system profile " to hscadmin user on HMC .I also allowed remote comnmand execution ( so that HMC can allow SSH remote connections to be established with it )
On AIX lpar "aqbtest", i generated RSA key pairs by following commands
/home/root> su - hscadmin
/home/hscadmin> ssh -keygen -t rsa ( accept default values with blank passphrase )
/home/hscadmin> export hscadminkey=`cat id_rsa.pub`
/home/hscadmin> ssh hscadmin@HMC1 mkauthkeys -a \ "$hscadminke\/" ( replace it with back slash while final editing )
The above command will copy public key from AIX Lpar aqbtest to HMC1. Once copied , you can also directly login to HMC as hscadmin using ssh and varify that key has been copied successfully or not by executing " cat .ssh/authorized_keys2 " command.
You should now be able to login to HMC from AIX management Lpar without any password prompt. You can verify by executing
/home/hscadmin> ssh HMC1 lsusers

which will show all users presnt on hmc.

If you face any problem while login into hmc using ssh , you can always make the authorized_keys file empty and then try again with above procedure. To make this file empty, you can follw the following command sequence on AIX management lpar
/home/hasadmin> touch /tmp/mykeyfile ( an empty file )
/home/hscadmin> scp /tmp/mykeyfile hscadmin@HMC1:.ssh/authorized_keys2

Now the same management Lpar should also be able to execute commands remotely to all other Lpars without any password prompt.For this , again i decided to use SSH with DSA authentication so that management lpar should be login and excute commands on all lpars remotely without any password prompt.

I created hscadmin user ( which may be an ordinary users ) on another Lpar on P570-1 server ( named aqbdb) and then install openssh on this Lpar. I then generated DSA key pair on management Lpar "aqbtest"

/home/hscadmin> ssh-keygen -t dsa -b 2048

/home/hscadmin> scp id_dsa.pub hscadmin@aqbdb:/home/hscadmin/.ssh/dsa_aqbtest.pub

on aqbdb Lpar ,

/home/hscadmin> touch .ssh/authorized_keys
/home/hscadmin> cat .ssh/dsa_aqbtest.pub >> authorized_keys

Now you should be able to login from management Lpar aqbtest to Lpar aqbdb , as hscadmin user, using ssh , without any password prompt.

C) Sudo Setup:

The real security challenge in this solution was resolved by using sudo. On AIX systems and Lpars , majority of system related commands like cfgmgr and rmdev can only be executed by root user while using root user on hmc, for remote comamnds execution is real security hazard. I therefore decided to use hscadmin as main user for this solution . This hscadmin user on AIX lpars is an ordinary user , however with the help of sudo this user is allowed to execute commands like cfgmgr and rmdev.On hmc , same user is given with the profile of " Managed system profile".

I used Sudo tool from Bull website and install it in general way of software installtion on AIX. I then configured sudo tool to allow hscadmin user on AIX lpars to execute rmdev,cfgmgr & other commands.
Contents of sudoers file on AIX Lpars are as follows. This file has to be created on every Lpar which is participating in this solution (including management Lpar) and ideally should have same contents.
-----------------------------------------------------------------------------------------------
# Same thing without a password
# %wheel ALL=(ALL) NOPASSWD: ALL

# Samples
# %users ALL=/sbin/mount /cdrom,/sbin/umount /cdrom
# %users localhost=/sbin/shutdown -h now

User_Alias HSC = hscadmin
Host_Alias SERVERS = aqbdb,abqcomm,abqapp,aqbapp,abqdb,bkmecomm,aubcomm
,kmeapp,bkmedb,kfhapp,kmecomm,abqapp,bkmeapp,kmedb,kfhdb
Runas_Alias HSC = root, hscadmin
Cmnd_Alias RMDEV = /usr/sbin/rmdev
Cmnd_Alias FND = /usr/bin/find
Cmnd_Alias ODM = /usr/bin/odmget
Cmnd_Alias LSLOT = /usr/sbin/lsslot
Cmnd_Alias CFG = /usr/sbin/cfgmgr
Defaults@SERVERS log_year, logfile=/var/log/sudo.log, !authenticate
hscadmin SERVERS = (HSC) FND , ODM , LSLOT , RMDEV , CFG
----------------------------------------------------------------------------------------------


D)Shell Scripts Creation:

There are two shell scripts which are mainly used in this whole solution. These two shell scripts are named cdlpar.sh and cdmov.sh. The cdlpar.sh is main script which is to be executed as hscadmin user from management Lpar.
There are some static configuration parameters which have to be defined in this shell script for one time. One of the most important parameter is Unit ID and Bus id(containing IO adapter and CD as it’s child device ). You can gather this information easily from HMC and put this information in cdlpar.sh for one time.
When you execute cdlpar.sh , it first of all login to HMC automatically and displays all P5 servers which are attached and controlled by this HMC .When you enter your mentioned P5 system, it will set values of Unit ID and Bus ID accordingly.
It will then show all lpars on your selected P5 system and then also detect the Lpar containing CD device.
This Lpar would be source lpar for DLPAR operation ( as mentioned by slpar environment variable in script).Now you have to input target lpar ( the lpar to which you want to move CD device ).
After getting all necessary information, cdlpar.sh will perform actual dlpar operation. However before that operation , cdlpar script calls cdmov.sh on target lpar and deletes all child devices definition from operating system and then return control back to cdlpar.sh.

If all child devices are deleted successfully by cdmov.sh , dlpar operation is performed by cdlpar.sh and finally cfgmgr is executed on target lpar to make CD device available for use.

I just made cdmov.sh script slightly interactive so that before deleting child devices, script will show you all devices which it is going to remove and prompt for go ahead.However, this shell script can easily be modified to operate in non-interactive mode.
------------------------------------------------------------------------
#Script Name: cdlpar.sh
#Script Purpose: To detect Backup Device present on which Lpar
#Script Purpose: To get devices related information for CD on server level
#Script Purpose: To invoke real Dlpar operations
#Script Presence: To be present only on management Lpar
---------------------------------------------------------------------------------------

#!/bin/ksh
function chk_err
{
if [[ $? != 0 ]]
then
echo "Exiting on errors... Please check TSM Server activity log"
exit $?
fi }


z=`hostname`
echo Following are the managed systems currently maneged by this HMC

ssh hscadmin@HMC "lssyscfg -r sys -F name" > mansystems

cat mansystems

echo "please enter the managed system , on which you want to do the Dlpar operation"

read msys

case $msys in

9133-55A-SN65C155G) unitid=U787B.001.DNW9488
busid=3
;;
9133-55A-SN65C154G) unitid=U787B.001.DNW947F
busid=3
;;
9117-570-SN65EAFEE) unitid=U7879.001.DQD12AU
busid=2
;;
9117-570-SN65EB03E) unitid=U7879.001.DQD12AT
busid=2
;;
*) echo "please enter the correct choice for managed system"
echo "exiting from script"
exit 1
;;
esac

export unitid
export busid


echo "Following are the lpars on this managed system $msys"


ssh hscadmin@HMC "lssyscfg -r lpar -m $msys -F name" > lparsonsys

cat lparsonsys

echo

echo

ssh hscadmin@HMC "lshwres -r io --rsubtype slot -m $msys --filter "units=$unitid",buses=$busid"" -F drc_index,description,lpar_name" > iores1

cat iores1 | grep "Other Mass Storage Controller" 1>/dev/null

if [[ $? = 0 ]]
then
slpar=`cat iores1 | grep -E "Other Mass | Storage controller" | awk -F\, ' { print $3 } '`

export slpar

else

slpar=`cat iores1 | grep "Storage controller" | awk -F\, ' { print $3 } '`

export slpar

fi

echo " On this managed system $slpar is the Lpar with CD drive "

echo

echo

echo Now please enter target lpar to which you want to move CD

read tlpar

echo Getting desired DRC Index value .......

echo please wait......................

sleep 2

ssh hscadmin@HMC "lshwres -r io --rsubtype slot -m $msys --filter "units=$unitid",buses=$busid"" -F drc_index,description,lpar_name" > iores

cat iores | grep "Other Mass Storage Controller" 1>/dev/null
if [[ $? = 0 ]]
then
drcval=`more iores | grep -E "Other Mass Storage Controller" | grep $slpar | awk -F\, ' { print $1 }'`
echo $drcval
else
drcval=`more iores | grep -E "Storage controller" | grep $slpar | awk -F\, ' { print $1 }'`
echo $drcval
fi

echo " Trying to moving physical adapter resource from $slpar LPAR to $tlpar LPAR on system $msys"


echo please wait ................

sleep 2


echo First Removing all child devices on $slpar

if [ "$slpar" = "$z" ]
then
/home/hscadmin/cdmov.sh
n=$?
else
ssh hscadmin@$slpar /home/hscadmin/cdmov.sh
n=$?
fi

if [[ "$n" = 0 ]]

then
echo child devices removed successfully ......

echo please wait .......Performing DLPAR operation now.......

sleep 2
ssh hscadmin@HMC "chhwres -r io -m $msys -o m -p $slpar -t $tlpar -l $drcval"

k=$?

if [[ "$k" = 0 ]]

then echo " DLPAR operation completed sucessfully "

echo

echo

echo " Now running cfgmgr on $tlpar "

echo " Please wait ..................."

if [ "$tlpar" = "$z" ]
then
sudo -u root cfgmgr
nn=$?
else
ssh hscadmin@$tlpar "sudo -u root cfgmgr "
nn=$?
fi
exit 0

else

echo " Dlpar Operation failed "

exit 1

fi

else
echo child devices on system could not be removed ....

echo please remove the problem manually...............

exit 1
fi

exit 0





--------------------------------------------------------------------
# Script Name: cdmov.sh
# Script Purpose: To delete all child devices and clean up
# Script Purpose: from Operating system before actual Dlpar operations
# Script Presence: To be present on all Lpars

----------------------------------------------------------------------


#!/bin/ksh
cd /home/hscadmin
function chk_err
{
if [[ $? != 0 ]]
then
echo "Exiting on errors... Please check the problem and resolve"
exit $?
else
exit 0
fi
}

echo Now nullifying old devlist file

echo please wait ..................

sleep 2

> devlist
> devlist2

z=`sudo -u root odmget -q name=cd0 CuDv | grep parent | awk -F\" '{ print $2 }'`

y=`sudo -u root odmget -q name=$z CuDv | grep parent | awk -F\" '{ print $2 }'`


pcix=`sudo -u root lsslot -c slot | grep $y | awk '{print $5}'`

echo Detecting PCI device containing cdrom.......

sleep 2

echo The pci device $pcix contains cdrom as a child device

echo

echo

echo Now going to display which devices will be removed by this script

echo please wait .............

echo

echo Generating list of devices........

sleep 2

j=`sudo -u root odmget -q parent=$pcix CuDv | grep name | awk -F\" '{ print $2 }'`

for i in `sudo -u root odmget -q parent=$pcix CuDv | grep name | awk -F\" '{ print $2 }'`

do

echo $i > devlist

x1=`sudo -u root odmget -q parent=$i CuDv | grep name | awk -F\" '{ print $2 }'`

for m in $x1
do
echo $m >> devlist
x2=`sudo -u root odmget -q parent=$m CuDv | grep name | awk -F\" '{ print $2 }'`
echo $x2 >> devlist
done
done
echo List of devices to be removed by removing $pcix is as follows:
echo
cat devlist | awk NF > devlist2
echo
echo
cat devlist2
read
echo You want to remove $pcix and all above associated devices ?.....
echo press "y" to proceed or press "n" to exit safely
read ch
case $ch in

y) echo you have selected yes ....
echo Now proceeding to delete devices....please wait
sleep 1
sudo -u root rmdev -dl $pcix -R
exit $?
#chk_err
;;
n) echo you have selected no
echo So nothing will be removed from system
exit 1
;;
*) echo invalid choice
exit 1
;;
esac



E)Solution Roll out:

Solution can be rolled out in many ways. As described earlier, main steps would be establishing SSH between management Lpar and HMC as well as between management Lpar and other Lpars participating in whole solution, configuring Sudo, and then placement of cdmov.sh script on every Lpar. Once solution is rolled out successfully, you can feel yourself feel free from headache of deleting devices from operating system on Lpars along with manual Dlpar operations themselves before each backup activity. You can also modify solution to become fully non-interactive or arguments based so that you can schedule these shell scripts through cron, followed by scheduled mksysb or savevg operations on Lpars.




Note: This article is one of my published articles in AIX UPDATE, UK. It was published in March 2007 edition of AIX Update.

No comments:

Post a Comment

 How to Enable Graphical Mode on Red Hat 7 he recommended way to enable graphical mode on RHEL  V7 is to install first following packages # ...